File size: 5,073 Bytes
1744a5c 13645cc 5116a9d 1744a5c 74f402a f148de3 74f402a a0bf765 823de17 74f402a 44aad78 74f402a 8473981 1744a5c 136732b 9847b1d 136732b 3ded383 136732b 9847b1d 3ded383 136732b 5853fa8 136732b 68451f4 136732b 68451f4 136732b 39b6037 deacd55 e6d5a2f 39b6037 e6d5a2f deacd55 e6d5a2f deacd55 a5907f9 deacd55 39b6037 deacd55 275ffef 4ffa7e4 deacd55 01a8d67 275ffef deacd55 1995237 136732b 05254d0 9996f8f 05254d0 5d6e31d 05254d0 5d6e31d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
datasets:
- ds4sd/DocLayNet
---
# DIT-base-layout-detection
We present the model cmarkea/dit-base-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
This is a fine-tuning of the model [dit-base](https://huggingface.co/microsoft/dit-base) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
dataset. It is ideal for processing documentary corpora to be ingested into an
ODQA system.
This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.
## Performance
In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. We did not perform any post-processing
for the semantic segmentation. As for object detection, we only applied OpenCV's `findContours` without any further post-processing.
For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
dataset of DocLayNet.
| Class | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:--------------:|:---------------:|:-----------:|:---------------:|
| Background | 94.98 | NA | NA |
| Caption | 75.54 | 55.61 | 72.62 |
| Footnote | 72.29 | 50.08 | 70.97 |
| Formula | 82.29 | 49.91 | 94.48 |
| List-item | 67.56 | 35.19 | 69 |
| Page-footer | 83.93 | 57.99 | 94.06 |
| Page-header | 62.33 | 65.25 | 79.39 |
| Picture | 78.32 | 58.22 | 92.71 |
| Section-header | 69.55 | 56.64 | 78.29 |
| Table | 83.69 | 63.03 | 90.13 |
| Text | 90.94 | 51.89 | 88.09 |
| Title | 61.19 | 52.64 | 70 |
## Benchmark
Now, let's compare the performance of this model with other models.
| Model | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:|
| cmarkea/dit-base-layout-detection | 90.77 | 56.29 | 85.26 |
| [cmarkea/detr-layout-detection](https://huggingface.co/cmarkea/detr-layout-detection) | 91.27 | 80.66 | 90.46 |
### Direct Use
```python
import torch
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
img_proc = AutoImageProcessor.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
model = BeitForSemanticSegmentation.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
img: PIL.Image
with torch.inference_mode():
input_ids = img_proc(img, return_tensors='pt')
output = model(**input_ids)
segmentation = img_proc.post_process_semantic_segmentation(
output,
target_sizes=[img.size[::-1]]
)
```
Here is a simple method for detecting bounding boxes from semantic segmentation. This is the method used to calculate the model's performance in object
detection, as described in the "Performance" section. The method is provided without any additional post-processing.
```python
import cv2
def detect_bboxes(masks: np.ndarray):
r"""
A simple bounding box detection function
"""
detected_blocks = []
contours, _ = cv2.findContours(
masks.astype(np.uint8),
cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE
)
for contour in list(contours):
if len(list(contour)) >= 4:
# smallest rectangle containing all points
x, y, width, height = cv2.boundingRect(contour)
bounding_box = [x, y, x + width, y + height]
detected_blocks.append(bounding_box)
return detected_blocks
bbox_pred = []
for segment in segmentation:
boxes, labels = [], []
for ii in range(1, len(model.config.label2id)):
mm = segment == ii
if mm.sum() > 0:
bbx = detect_bboxes(mm.numpy())
boxes.extend(bbx)
labels.extend([ii]*len(bbx))
bbox_pred.append(dict(boxes=boxes, labels=labels))
```
### Example
![example](https://i.postimg.cc/rFXswV59/dit1.png)
### Citation
```
@online{DeDitLay,
AUTHOR = {Cyrile Delestre},
URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
YEAR = {2024},
KEYWORDS = {Image Processing ; Transformers ; Layout},
}
``` |