File size: 3,986 Bytes
6830f5c
c0dd7a9
 
d76490b
 
c0dd7a9
6830f5c
 
f20b6ed
6830f5c
c0dd7a9
 
 
 
6830f5c
f20b6ed
6830f5c
f20b6ed
6830f5c
e3342ff
 
a7f114d
e3342ff
 
 
a7f114d
965ba7e
 
33deef6
 
 
 
 
 
 
 
 
 
 
 
965ba7e
84408d4
08861b0
84408d4
08861b0
6b59582
7f44fc3
33deef6
7f44fc3
965ba7e
f20b6ed
6830f5c
e04d975
00e0b81
c792bff
6830f5c
e04d975
a5362a9
e04d975
 
a5362a9
e04d975
6830f5c
3fdd30a
 
e04d975
 
 
6830f5c
34aa254
 
c5278fa
eefdbff
34aa254
e04d975
 
6830f5c
e04d975
 
34aa254
e04d975
 
 
6830f5c
7e067f2
 
 
 
f0e0ae9
 
bbb6212
f0e0ae9
 
9123d47
f0e0ae9
014b9aa
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
datasets:
- ds4sd/DocLayNet
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---

# DETR-layout-detection

We present the model cmarkea/detr-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
This is a fine-tuning of the model [detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an
ODQA system.

This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.

## Performance

In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. In both cases, no post-processing was
applied after estimation.

For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
dataset of DocLayNet.

|      Class     | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:--------------:|:---------------:|:-----------:|:---------------:|
|   Background   |      95.82      |      NA     |        NA       |
|     Caption    |      82.68      |    74.71    |      69.05      |
|    Footnote    |      78.19      |    74.71    |      74.19      |
|     Formula    |      87.25      |    76.31    |      97.79      |
|    List-item   |      81.43      |    77.0     |      90.62      |
|   Page-footer  |      82.01      |    69.86    |      96.64      |
|   Page-header  |      68.32      |    77.68    |      88.3       |
|     Picture    |      81.04      |    81.84    |      90.88      |
| Section-header |      73.52      |    73.46    |      85.96      |
|      Table     |      78.59      |    85.45    |      90.58      |
|      Text      |      91.93      |    83.16    |      91.8       |
|      Title     |      70.38      |    74.13    |      63.33      |

## Benchmark

Now, let's compare the performance of this model with other models.

|      Model                                                                                    | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:|
| cmarkea/detr-layout-detection                                                                 |      91.27      |    80.66    |      90.46      |
| [cmarkea/dit-base-layout-detection](https://huggingface.co/cmarkea/dit-base-layout-detection) |      90.77      |    56.29    |      85.26      |

## Direct Use

```python
from transformers import AutoImageProcessor
from transformers.models.detr import DetrForSegmentation

img_proc = AutoImageProcessor.from_pretrained(
    "cmarkea/detr-layout-detection"
)
model = DetrForSegmentation.from_pretrained(
    "cmarkea/detr-layout-detection"
)

img: PIL.Image

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    output = model(**input_ids)

threshold=0.4

segmentation_mask = img_proc.post_process_segmentation(
    output,
    threshold=threshold,
    target_sizes=[img.size[::-1]]
)

bbox_pred = img_proc.post_process_object_detection(
    output,
    threshold=threshold,
    target_sizes=[img.size[::-1]]
)
```

### Example

![example](https://i.postimg.cc/1X6zr216/detr.png)

### Citation

```
@online{DeDetrLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/detr-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}
```