abhinand
/

GOT-OCR-2.0-unofficial

Safetensors

GOT

ocr

vision

Model card Files Files and versions Community

abhinand commited on 9 days ago

Commit

4e7a42d

•

1 Parent(s): 7f28edc

Create README.md

Browse files

Files changed (1) hide show

README.md +177 -0

README.md ADDED Viewed

	@@ -0,0 +1,177 @@

+---
+tags:
+- ocr
+- vision
+---
+**Note:** ORIGINAL MODEL REPO: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
+---
+<h3><a href="">General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model</a></h3>
+<a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/"><img src="https://img.shields.io/badge/Project-Page-Green"></a>
+<a href="https://arxiv.org/abs/2409.01704"><img src="https://img.shields.io/badge/Paper-PDF-orange"></a>
+<a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/assets/wechat.jpg"><img src="https://img.shields.io/badge/Wechat-blue"></a>
+<a href="https://zhuanlan.zhihu.com/p/718163422"><img src="https://img.shields.io/badge/zhihu-red"></a>
+[Haoran Wei*](https://scholar.google.com/citations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu,  [Zheng Ge](https://joker316701882.github.io/), Liang Zhao, [Jianjian Sun](https://scholar.google.com/citations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https://scholar.google.com.hk/citations?user=J0ko04IAAAAJ&hl=zh-CN&oi=ao), Chunrui Han, [Xiangyu Zhang](https://scholar.google.com/citations?user=yuB-cfoAAAAJ&hl=en)
+<p align="center">
+<img src="assets/got_logo.png" style="width: 200px" align=center>
+</p>
+## Release
+- [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this [repo](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/GOT-OCR-2.0-paper.pdf). We also have submitted it to Arxiv.
+- [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!
+[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
+[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)
+**Usage and License Notices**: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.
+## Community contributions
+We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :
+[Colab of GOT](https://colab.research.google.com/drive/1nmiNciZ5ugQVp4rFbL9ZWpEPd92Y9o7p?usp=sharing)   ~      contributor： [@Zizhe Wang](https://github.com/PaperPlaneDeemo)
+## Contents
+- [Install](#install)
+- [GOT Weights](#got-weights)
+- [Demo](#demo)
+- [Train](#train)
+- [Eval](#eval)
+***
+<p align="center">
+<img src="assets/got_support.jpg" style="width: 800px" align=center>
+</p>
+<p align="center">
+<a href="">Towards OCR-2.0 via a Unified End-to-end Model</a>
+</p>
+***
+## Install
+0. Our environment is cuda11.8+torch2.0.1
+1. Clone this repository and navigate to the GOT folder
+```bash
+git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
+cd 'the GOT folder'
+```
+2. Install Package
+```Shell
+conda create -n got python=3.10 -y
+conda activate got
+pip install -e .
+```
+3. Install Flash-Attention
+```
+pip install ninja
+pip install flash-attn --no-build-isolation
+```
+## GOT Weights
+- [Google Drive](https://drive.google.com/drive/folders/1OdDtsJ8bFJYlNUzCQG4hRkUL6V-qBQaN?usp=sharing)
+- [BaiduYun](https://pan.baidu.com/s/1G4aArpCOt6I_trHv_1SE2g) code: OCR2
+## Demo
+1. plain texts OCR:
+```Shell
+python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type ocr
+```
+2. format texts OCR:
+```Shell
+python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format
+```
+3. fine-grained OCR:
+```Shell
+python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --box [x1,y1,x2,y2]
+```
+```Shell
+python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --color red/green/blue
+```
+4. multi-crop OCR:
+```Shell
+python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /an/image/file.png
+```
+5. multi-page OCR (the image path contains multiple .png files):
+```Shell
+python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /images/path/  --multi-page
+```
+6. render the formatted OCR results:
+```Shell
+python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format --render
+ ```
+**Note**:
+The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.
+## Train
+1. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
+2. If you want train from stage-1 described in our paper, you need this [repo](https://github.com/Ucas-HaoranWei/Vary-tiny-600k).
+```Shell
+deepspeed   /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
+ --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json    --model_name_or_path /GOT_weights/ \
+ --use_im_start_end True   \
+ --bf16 True   \
+ --gradient_accumulation_steps 2    \
+ --evaluation_strategy "no"   \
+ --save_strategy "steps"  \
+ --save_steps 200   \
+ --save_total_limit 1   \
+ --weight_decay 0.    \
+ --warmup_ratio 0.001     \
+ --lr_scheduler_type "cosine"    \
+ --logging_steps 1    \
+ --tf32 True     \
+ --model_max_length 8192    \
+ --gradient_checkpointing True   \
+ --dataloader_num_workers 8    \
+ --report_to none  \
+ --per_device_train_batch_size 2    \
+ --num_train_epochs 1  \
+ --learning_rate 2e-5   \
+ --datasets pdf-ocr+scence \
+ --output_dir /your/output.path
+```
+**Note**:
+1. Change the corresponding data information in constant.py.
+2. Change line 37 in conversation_dataset_qwen.py to your data_name.
+## Eval
+1. We use the [Fox](https://github.com/ucaslcl/Fox) and [OneChart](https://github.com/LingyvKong/OneChart) benchmarks, and other benchmarks can be found in the weights download link.
+2. The eval codes can be found in GOT/eval.
+3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs， the --num-chunks can be set to 8.
+ ```Shell
+python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path  /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR
+```
+## Contact
+If you are interested in this work or have questions about the code or the paper, please join our communication [Wechat]() group.
+## Acknowledgement
+- [Vary](https://github.com/Ucas-HaoranWei/Vary/): the codebase we built upon!
+- [Qwen](https://github.com/QwenLM/Qwen): the LLM base model of Vary, which is good at both English and Chinese!
+## Citation
+```bibtex
+@article{wei2024general,
+  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
+  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
+  journal={arXiv preprint arXiv:2409.01704},
+  year={2024}
+}
+@article{wei2023vary,
+  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
+  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
+  journal={arXiv preprint arXiv:2312.06109},
+  year={2023}
+}