AIPI Term Project

Developer: Keese Phillips

About:

The purpose of this project is to perform very basic intelligent document processing (IDP) to extract a table from a document image. This can be a document that is in a PDF or image format that cannot be mapped directly to a csv file. The steps in this process is table detection, optical character recognition (OCR), table extraction and conversion to csv format.

How to run the project

If you want to run the full pipeline and train the model from scratch

You will need to install all of the necessary packages to run the setup.py script beforehand
You will need to download pytesseract and add it to your Path if you are using Windows OS
You will then need to run setup.py to create the data pipeline and train the model
You will then need to run the frontend to use the model

pip install -r requirements.txt
python setup.py
streamlit run main.py

If you want to just run the frontend

You will need to install all of the necessary packages to run the setup.py script beforehand and install pytesseract
You will then need to run the frontend to use the model

pip install -r requirements.txt
streamlit run main.py

Project Structure

requirements.txt: list of python libraries to download before running project

setup.py: script to set up project (get data, train model)

main.py: main script/notebook to run streamlit user interface

assets: directory for images used in frontend

scripts: directory for pipeline scripts or utility scripts

make_dataset.py: script to get data

build_features.py: script to prepare the dataset for training

model.py: script to train model and predict

models: directory for trained models

trained_yolov8.pt: pytorch trained model for album recommendations

gpt_model: directory to store the gpt model

data: directory for project data

raw: directory for raw data

processed: directory to store the processed data

outputs: directory to store the prepared data

notebooks: directory to store any exploration notebooks used

.gitignore: git ignore file

Data source

The data used to train the model was provided by IBM and PubLayNet: largest dataset ever for document layout analysis. As per their dataset description:

PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

Contributions

Brinnae Bent
Jon Reifschneider
Xu Zhong
Jianbin Tang
Antonio Jimeno Yepes