Visually Guided Generative Text-Layout Pre-training for Document Intelligence

The ViTLP model was proposed in Visually Guided Generative Text-Layout Pre-training for Document Intelligence, which is a generative foundation model for document intelligence. We provide the pre-trained checkpoint ViTLP-medium (380M). The pre-trained ViTLP model can natively perform OCR text localization and recognition.

Demo on Document Text Recognition & Localization

The code of ViTLP inference and demo is assisible at https://github.com/Veason-silverbullet/ViTLP.

ocr-demo-1.png

ocr-demo-2.png

Preset FAQ

  • Why is ViTLP-medium (380M)?

When I commenced this project, it was on the eve of LLMs (precisely speaking, ChatGPT). ViTLP-base presented in our paper, is actually a rather small pre-trained model. We know it is expected to scale up ViTLP in this LLM era. However, the pre-training scale is commonly constrained by computation resources and the pre-training dataset scale, in which context ViTLP-medium (380M) is the largest pre-training scale so far we can support.

Besides, this scale of ViTLP also brings inference sweetness including speed and memory usage. Typically, OCR on a page of a document image can be processed within 5~10 seconds in an Nvidia 4090, which is comparable to (and faster than) most OCR engines (and LLMs).

Note

ViTLP is pronounced /ˈvai·tlp/ (vital). The first version of our paper was submitted to OpenReview in June 2023.

Downloads last month
22
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.