File size: 3,926 Bytes
6830f5c c0dd7a9 d76490b c0dd7a9 6830f5c f20b6ed 6830f5c c0dd7a9 6830f5c f20b6ed 6830f5c f20b6ed 6830f5c e3342ff a7f114d e3342ff a7f114d 965ba7e 33deef6 965ba7e 84408d4 08861b0 84408d4 08861b0 6b59582 7f44fc3 33deef6 7f44fc3 965ba7e f20b6ed 6830f5c e04d975 00e0b81 c792bff 6830f5c e04d975 a5362a9 e04d975 a5362a9 e04d975 6830f5c 3fdd30a e04d975 6830f5c 34aa254 c5278fa eefdbff 34aa254 e04d975 6830f5c e04d975 34aa254 e04d975 6830f5c f0e0ae9 bbb6212 f0e0ae9 75ff05d f0e0ae9 014b9aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
datasets:
- ds4sd/DocLayNet
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---
# DETR-layout-detection
We present the model cmarkea/detr-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
This is a fine-tuning of the model [detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an
ODQA system.
This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.
## Performance
In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. In both cases, no post-processing was
applied after estimation.
For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
dataset of DocLayNet.
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:--------------:|:---------------:|:-----------:|:---------------:|
| Background | 95.82 | NA | NA |
| Caption | 82.68 | 74.71 | 69.05 |
| Footnote | 78.19 | 74.71 | 74.19 |
| Formula | 87.25 | 76.31 | 97.79 |
| List-item | 81.43 | 77.0 | 90.62 |
| Page-footer | 82.01 | 69.86 | 96.64 |
| Page-header | 68.32 | 77.68 | 88.3 |
| Picture | 81.04 | 81.84 | 90.88 |
| Section-header | 73.52 | 73.46 | 85.96 |
| Table | 78.59 | 85.45 | 90.58 |
| Text | 91.93 | 83.16 | 91.8 |
| Title | 70.38 | 74.13 | 63.33 |
## Benchmark
Now, let's compare the performance of this model with other models.
| Model | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:|
| cmarkea/detr-layout-detection | 91.27 | 80.66 | 90.46 |
| [cmarkea/dit-base-layout-detection](https://huggingface.co/cmarkea/dit-base-layout-detection) | 90.77 | 56.29 | 85.26 |
## Direct Use
```python
from transformers import AutoImageProcessor
from transformers.models.detr import DetrForSegmentation
img_proc = AutoImageProcessor.from_pretrained(
"cmarkea/detr-layout-detection"
)
model = DetrForSegmentation.from_pretrained(
"cmarkea/detr-layout-detection"
)
img: PIL.Image
with torch.inference_mode():
input_ids = img_proc(img, return_tensors='pt')
output = model(**input_ids)
threshold=0.4
segmentation_mask = img_proc.post_process_segmentation(
output,
threshold=threshold,
target_sizes=[img.size[::-1]]
)
bbox_pred = img_proc.post_process_object_detection(
output,
threshold=threshold,
target_sizes=[img.size[::-1]]
)
```
### Citation
```
@online{DeDetrLay,
AUTHOR = {Cyrile Delestre},
URL = {https://huggingface.co/cmarkea/detr-base-layout-detection},
YEAR = {2024},
KEYWORDS = {Image Processing ; Transformers ; Layout},
}
``` |