File size: 3,322 Bytes
dfc97a6
daf0288
 
dfc97a6
 
 
daf0288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfc97a6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: alps
app_file: app.py
sdk: gradio
sdk_version: 4.44.0
---
# Alps

Pipeline for OCRing PDFs and tables

This repository contains different OCR methods using various libraries/models.

## Running gradio:
`python app.py` in terminal


## Installation : 
Build the docker image and run the contianer 

Clone this repository and Install the required dependencies:
```
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117

apt install weasyprint

```
Note: You need a GPU to run this code.

## Example Usage 

Run python main.py inside the directory. Provide the path to the test file (the file must be placed inside the repository,and the file path should be relative to the repository (alps)). Next, provide the path to save intermediate outputs from the run (draw cell bounding boxes on the table, show table detection results in pdf), and specify which component to run.

outputs are printed in terminal

```
usage: main.py [-h] [--test_file TEST_FILE] [--debug_folder DEBUG_FOLDER] [--englishFlag ENGLISHFLAG] [--denoise DENOISE] ocr

```
Description of the component:

### ocr1 

ocr1
Input: Path to a PDF file
Output: Dictionary of each page and list of line_annotations. List of LineAnnotations contains bboxes for each line and List of its children wordAnnotation. Each wordAnnotation contains bboxes and text inside.
What it does: Runs Ragflow textline detector + OCR with DocTR

Example: 
```
python main.py ocr1 --test_file TestingFiles/OCRTest1German.pdf --debug_folder ./res/ocrdebug1/ 
python main.py ocr1 --test_file TestingFiles/OCRTest3English.pdf --debug_folder ./res/ocrdebug1/ --englishFlag True
```

### table1
Input : file path to an image of a cropped table 
Output: Parsed table in HTML form
What it does: Uses Unitable + DocTR

```
python main.py table1 --test_file cropped_table.png --debug_folder ./res/table1/ 

```

### table2 
Input: File path to an image of a cropped table
Output: Parsed table in HTML form
What it does: Uses Unitable

```
python main.py table2 --test_file cropped_table.png  --debug_folder ./res/table2/ 

```
### pdftable1
Input: PDF file path
Output: Parsed table in HTML form
What it does: Uses Unitable + DocTR


```
python main.py pdftable1 --test_file TestingFiles/OCRTest5English.pdf  --debug_folder ./res/table_debug1/ 

python main.py pdftable3 --test_file TestingFiles/TableOCRTestEnglish.pdf  --debug_folder ./res/poor_relief2
```


### pdftable2 : 
Input: PDF file path
Output: Parsed table in HTML form
What it does: Detects table and parses them, Runs Full Unitable Table detection

```
python main.py pdftable2 --test_file TestingFiles/OCRTest5English.pdf --debug_folder ./res/table_debug2/ 
```


### pdftable3 
Input: PDF file path
Output: Parsed table in HTML form
What it does: Detects table with YOLO, Unitable + DocTR



### pdftable4 
Input: PDF file path
Output: Parsed table in HTML form
What it does: Detects table with YOLO, Runs Full doctr Table detection

python main.py pdftable4 --test_file TestingFiles/TableOCRTestEasier.pdf --debug_folder ./res/table_debug3/ 


## bbox
They are ordered as ordered as [xmin,ymin,xmax,ymax] . Cause the coordinates starts from (0,0) of the image which is upper left corner 

xmin ymim - upper left corner
xmax ymax - bottom lower corner 

![alt text](image-2.png)