# AIPI Term Project
## Developer: Keese Phillips

## About:
The purpose of this project is to perform very basic intelligent document processing (IDP) to extract a table from a document image. This can be a document that is in a PDF or image format that cannot be mapped directly to a csv file. The steps in this process is table detection, optical character recognition (OCR), table extraction and conversion to csv format. 

## How to run the project

### If you want to run the full pipeline and train the model from scratch
1. You will need to install all of the necessary packages to run the setup.py script beforehand
2. You will need to download pytesseract and add it to your Path if you are using Windows OS
3. You will then need to run setup.py to create the data pipeline and train the model
4. You will then need to run the frontend to use the model
```bash
pip install -r requirements.txt
python setup.py
streamlit run main.py
```

### If you want to just run the frontend
1. You will need to install all of the necessary packages to run the setup.py script beforehand and install pytesseract
2. You will then need to run the frontend to use the model
```bash
pip install -r requirements.txt
streamlit run main.py
```

## Project Structure
> - requirements.txt: list of python libraries to download before running project  
> - setup.py: script to set up project (get data, train model)
> -  main.py: main script/notebook to run streamlit user interface
> - assets: directory for images used in frontend 
> - scripts: directory for pipeline scripts or utility scripts  
>   - make_dataset.py: script to get data  
>   - build_features.py: script to prepare the dataset for training  
>   - model.py: script to train model and predict  
> - models: directory for trained models
>   - trained_yolov8.pt: pytorch trained model for album recommendations 
>   - gpt_model: directory to store the gpt model 
> - data:  directory for project data
>   - raw: directory for raw data 
>   - processed: directory to store the processed data
>   - outputs: directory to store the prepared data
> - notebooks: directory to store any exploration notebooks used
> - .gitignore: git ignore file

## [Data source](https://github.com/ibm-aur-nlp/PubLayNet)
The data used to train the model was provided by [IBM](https://developer.ibm.com/exchanges/data/all/publaynet/) and [PubLayNet: largest dataset ever for document layout analysis](https://arxiv.org/abs/1908.07836). As per their dataset description:
> PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations.  The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

## Contributions
Brinnae Bent   
Jon Reifschneider  
Xu Zhong  
Jianbin Tang  
Antonio Jimeno Yepes