abhinand commited on
Commit
4e7a42d
1 Parent(s): 7f28edc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ocr
4
+ - vision
5
+ ---
6
+ **Note:** ORIGINAL MODEL REPO: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
7
+
8
+ ---
9
+
10
+ <h3><a href="">General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model</a></h3>
11
+
12
+ <a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/"><img src="https://img.shields.io/badge/Project-Page-Green"></a>
13
+ <a href="https://arxiv.org/abs/2409.01704"><img src="https://img.shields.io/badge/Paper-PDF-orange"></a>
14
+ <a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/assets/wechat.jpg"><img src="https://img.shields.io/badge/Wechat-blue"></a>
15
+ <a href="https://zhuanlan.zhihu.com/p/718163422"><img src="https://img.shields.io/badge/zhihu-red"></a>
16
+
17
+ [Haoran Wei*](https://scholar.google.com/citations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, [Zheng Ge](https://joker316701882.github.io/), Liang Zhao, [Jianjian Sun](https://scholar.google.com/citations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https://scholar.google.com.hk/citations?user=J0ko04IAAAAJ&hl=zh-CN&oi=ao), Chunrui Han, [Xiangyu Zhang](https://scholar.google.com/citations?user=yuB-cfoAAAAJ&hl=en)
18
+
19
+ <p align="center">
20
+ <img src="assets/got_logo.png" style="width: 200px" align=center>
21
+ </p>
22
+
23
+
24
+ ## Release
25
+
26
+ - [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this [repo](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/GOT-OCR-2.0-paper.pdf). We also have submitted it to Arxiv.
27
+ - [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!
28
+
29
+
30
+ [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
31
+ [![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)
32
+
33
+ **Usage and License Notices**: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.
34
+
35
+
36
+ ## Community contributions
37
+ We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :
38
+
39
+ [Colab of GOT](https://colab.research.google.com/drive/1nmiNciZ5ugQVp4rFbL9ZWpEPd92Y9o7p?usp=sharing) ~ contributor: [@Zizhe Wang](https://github.com/PaperPlaneDeemo)
40
+
41
+ ## Contents
42
+ - [Install](#install)
43
+ - [GOT Weights](#got-weights)
44
+ - [Demo](#demo)
45
+ - [Train](#train)
46
+ - [Eval](#eval)
47
+
48
+ ***
49
+ <p align="center">
50
+ <img src="assets/got_support.jpg" style="width: 800px" align=center>
51
+ </p>
52
+ <p align="center">
53
+ <a href="">Towards OCR-2.0 via a Unified End-to-end Model</a>
54
+ </p>
55
+
56
+ ***
57
+
58
+
59
+ ## Install
60
+ 0. Our environment is cuda11.8+torch2.0.1
61
+ 1. Clone this repository and navigate to the GOT folder
62
+ ```bash
63
+ git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
64
+ cd 'the GOT folder'
65
+ ```
66
+ 2. Install Package
67
+ ```Shell
68
+ conda create -n got python=3.10 -y
69
+ conda activate got
70
+ pip install -e .
71
+ ```
72
+
73
+ 3. Install Flash-Attention
74
+ ```
75
+ pip install ninja
76
+ pip install flash-attn --no-build-isolation
77
+ ```
78
+ ## GOT Weights
79
+ - [Google Drive](https://drive.google.com/drive/folders/1OdDtsJ8bFJYlNUzCQG4hRkUL6V-qBQaN?usp=sharing)
80
+ - [BaiduYun](https://pan.baidu.com/s/1G4aArpCOt6I_trHv_1SE2g) code: OCR2
81
+
82
+ ## Demo
83
+ 1. plain texts OCR:
84
+ ```Shell
85
+ python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type ocr
86
+ ```
87
+ 2. format texts OCR:
88
+ ```Shell
89
+ python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format
90
+ ```
91
+ 3. fine-grained OCR:
92
+ ```Shell
93
+ python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --box [x1,y1,x2,y2]
94
+ ```
95
+ ```Shell
96
+ python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --color red/green/blue
97
+ ```
98
+ 4. multi-crop OCR:
99
+ ```Shell
100
+ python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /an/image/file.png
101
+ ```
102
+ 5. multi-page OCR (the image path contains multiple .png files):
103
+ ```Shell
104
+ python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /images/path/ --multi-page
105
+ ```
106
+ 6. render the formatted OCR results:
107
+ ```Shell
108
+ python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format --render
109
+ ```
110
+ **Note**:
111
+ The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.
112
+
113
+
114
+ ## Train
115
+ 1. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
116
+ 2. If you want train from stage-1 described in our paper, you need this [repo](https://github.com/Ucas-HaoranWei/Vary-tiny-600k).
117
+
118
+ ```Shell
119
+ deepspeed /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
120
+ --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /GOT_weights/ \
121
+ --use_im_start_end True \
122
+ --bf16 True \
123
+ --gradient_accumulation_steps 2 \
124
+ --evaluation_strategy "no" \
125
+ --save_strategy "steps" \
126
+ --save_steps 200 \
127
+ --save_total_limit 1 \
128
+ --weight_decay 0. \
129
+ --warmup_ratio 0.001 \
130
+ --lr_scheduler_type "cosine" \
131
+ --logging_steps 1 \
132
+ --tf32 True \
133
+ --model_max_length 8192 \
134
+ --gradient_checkpointing True \
135
+ --dataloader_num_workers 8 \
136
+ --report_to none \
137
+ --per_device_train_batch_size 2 \
138
+ --num_train_epochs 1 \
139
+ --learning_rate 2e-5 \
140
+ --datasets pdf-ocr+scence \
141
+ --output_dir /your/output.path
142
+ ```
143
+ **Note**:
144
+ 1. Change the corresponding data information in constant.py.
145
+ 2. Change line 37 in conversation_dataset_qwen.py to your data_name.
146
+
147
+
148
+ ## Eval
149
+ 1. We use the [Fox](https://github.com/ucaslcl/Fox) and [OneChart](https://github.com/LingyvKong/OneChart) benchmarks, and other benchmarks can be found in the weights download link.
150
+ 2. The eval codes can be found in GOT/eval.
151
+ 3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs, the --num-chunks can be set to 8.
152
+ ```Shell
153
+ python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR
154
+ ```
155
+
156
+ ## Contact
157
+ If you are interested in this work or have questions about the code or the paper, please join our communication [Wechat]() group.
158
+
159
+ ## Acknowledgement
160
+ - [Vary](https://github.com/Ucas-HaoranWei/Vary/): the codebase we built upon!
161
+ - [Qwen](https://github.com/QwenLM/Qwen): the LLM base model of Vary, which is good at both English and Chinese!
162
+
163
+
164
+ ## Citation
165
+ ```bibtex
166
+ @article{wei2024general,
167
+ title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
168
+ author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
169
+ journal={arXiv preprint arXiv:2409.01704},
170
+ year={2024}
171
+ }
172
+ @article{wei2023vary,
173
+ title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
174
+ author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
175
+ journal={arXiv preprint arXiv:2312.06109},
176
+ year={2023}
177
+ }