File size: 5,073 Bytes

1744a5c
 
13645cc
 
5116a9d
 
1744a5c
 
74f402a
 
 
 
f148de3
74f402a
 
 
 
 
 
a0bf765
823de17
74f402a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44aad78
74f402a
 
8473981
1744a5c
 
 
136732b
 
9847b1d
136732b
 
3ded383
136732b
9847b1d
3ded383
136732b
 
5853fa8
 
136732b
 
68451f4
136732b
68451f4
 
136732b
 
39b6037
 
 
 
 
 
 
deacd55
 
e6d5a2f
39b6037
e6d5a2f
deacd55
e6d5a2f
 
 
 
 
deacd55
 
a5907f9
deacd55
 
 
 
 
39b6037
deacd55
275ffef
4ffa7e4
deacd55
 
01a8d67
275ffef
deacd55
1995237
136732b
05254d0
9996f8f
 
 
 
05254d0
 
5d6e31d
05254d0
 
 
 
 
5d6e31d

---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
datasets:
- ds4sd/DocLayNet
---

# DIT-base-layout-detection

We present the model cmarkea/dit-base-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
This is a fine-tuning of the model [dit-base](https://huggingface.co/microsoft/dit-base) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
dataset. It is ideal for processing documentary corpora to be ingested into an
ODQA system.

This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.

## Performance

In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. We did not perform any post-processing
for the semantic segmentation. As for object detection, we only applied OpenCV's `findContours` without any further post-processing.

For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
dataset of DocLayNet.

|      Class     | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:--------------:|:---------------:|:-----------:|:---------------:|
|   Background   |      94.98      |      NA     |        NA       |
|     Caption    |      75.54      |    55.61    |      72.62      |
|    Footnote    |      72.29      |    50.08    |      70.97      |
|     Formula    |      82.29      |    49.91    |      94.48      |
|    List-item   |      67.56      |    35.19    |      69         |
|   Page-footer  |      83.93      |    57.99    |      94.06      |
|   Page-header  |      62.33      |    65.25    |      79.39      |
|     Picture    |      78.32      |    58.22    |      92.71      |
| Section-header |      69.55      |    56.64    |      78.29      |
|      Table     |      83.69      |    63.03    |      90.13      |
|      Text      |      90.94      |    51.89    |      88.09      |
|      Title     |      61.19      |    52.64    |      70         |

## Benchmark

Now, let's compare the performance of this model with other models.

|      Model                                                                                    | f1-score (x100) | GIoU (x100) | accuracy (x100) |
|:---------------------------------------------------------------------------------------------:|:---------------:|:-----------:|:---------------:|
| cmarkea/dit-base-layout-detection                                                             |      90.77      |    56.29    |      85.26      |
| [cmarkea/detr-layout-detection](https://huggingface.co/cmarkea/detr-layout-detection)         |      91.27      |    80.66    |      90.46      |

### Direct Use

```python
import torch
from transformers import AutoImageProcessor, BeitForSemanticSegmentation

img_proc = AutoImageProcessor.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)
model = BeitForSemanticSegmentation.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)

img: PIL.Image

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    output = model(**input_ids)

segmentation = img_proc.post_process_semantic_segmentation(
    output,
    target_sizes=[img.size[::-1]]
)
```

Here is a simple method for detecting bounding boxes from semantic segmentation. This is the method used to calculate the model's performance in object
detection, as described in the "Performance" section. The method is provided without any additional post-processing.

```python
import cv2

def detect_bboxes(masks: np.ndarray):
    r"""
    A simple bounding box detection function
    """
    detected_blocks = []
    contours, _ = cv2.findContours(
        masks.astype(np.uint8),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE
    )
    for contour in list(contours):
        if len(list(contour)) >= 4:
            # smallest rectangle containing all points
            x, y, width, height = cv2.boundingRect(contour)
            bounding_box = [x, y, x + width, y + height]
            detected_blocks.append(bounding_box)
    return detected_blocks

bbox_pred = []
for segment in segmentation:
    boxes, labels = [], []
    for ii in range(1, len(model.config.label2id)):
        mm = segment == ii
        if mm.sum() > 0:
            bbx = detect_bboxes(mm.numpy())
            boxes.extend(bbx)
            labels.extend([ii]*len(bbx))
    bbox_pred.append(dict(boxes=boxes, labels=labels))
```

### Example

![example](https://i.postimg.cc/rFXswV59/dit1.png)

### Citation

```
@online{DeDitLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}
```