YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AIPI Term Project

Developer: Keese Phillips

About:

The purpose of this project is to perform very basic intelligent document processing (IDP) to extract a table from a document image. This can be a document that is in a PDF or image format that cannot be mapped directly to a csv file. The steps in this process is table detection, optical character recognition (OCR), table extraction and conversion to csv format.

How to run the project

If you want to run the full pipeline and train the model from scratch

  1. You will need to install all of the necessary packages to run the setup.py script beforehand
  2. You will need to download pytesseract and add it to your Path if you are using Windows OS
  3. You will then need to run setup.py to create the data pipeline and train the model
  4. You will then need to run the frontend to use the model
pip install -r requirements.txt
python setup.py
streamlit run main.py

If you want to just run the frontend

  1. You will need to install all of the necessary packages to run the setup.py script beforehand and install pytesseract
  2. You will then need to run the frontend to use the model
pip install -r requirements.txt
streamlit run main.py

Project Structure

  • requirements.txt: list of python libraries to download before running project
  • setup.py: script to set up project (get data, train model)
  • main.py: main script/notebook to run streamlit user interface
  • assets: directory for images used in frontend
  • scripts: directory for pipeline scripts or utility scripts
    • make_dataset.py: script to get data
    • build_features.py: script to prepare the dataset for training
    • model.py: script to train model and predict
  • models: directory for trained models
    • trained_yolov8.pt: pytorch trained model for album recommendations
    • gpt_model: directory to store the gpt model
  • data: directory for project data
    • raw: directory for raw data
    • processed: directory to store the processed data
    • outputs: directory to store the prepared data
  • notebooks: directory to store any exploration notebooks used
  • .gitignore: git ignore file

Data source

The data used to train the model was provided by IBM and PubLayNet: largest dataset ever for document layout analysis. As per their dataset description:

PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

Contributions

Brinnae Bent
Jon Reifschneider
Xu Zhong
Jianbin Tang
Antonio Jimeno Yepes

Downloads last month
6
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.