metadata

library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation

DIT-base-layout-detection

We present the model cmarkea/dit-base-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document. This is a fine-tuning of the model dit-base on the DocLayNet dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an ODQA system.

This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.

Performance

In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. We did not perform any post-processing for the semantic segmentation. As for object detection, we only applied OpenCV's findContours without any further post-processing.

For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation dataset of DocLayNet.

Class	f1-score (x100)	GIoU (x100)	accuracy (x100)
Background	94.98	NA	NA
Caption	75.54	55.61	72.62
Footnote	72.29	50.08	70.97
Formula	82.29	49.91	94.48
List-item	67.56	35.19	69
Page-footer	83.93	57.99	94.06
Page-header	62.33	65.25	79.39
Picture	78.32	58.22	92.71
Section-header	69.55	56.64	78.29
Table	83.69	63.03	90.13
Text	90.94	51.89	88.09
Title	61.19	52.64	70

Benchmark

Now, let's compare the performance of this model with other models.

Class	f1-score (x100)	GIoU (x100)	accuracy (x100)
cmarkea/dit-base-layout-detection	90.77	56.29	85.26
cmarkea/detr-layout-detection	84.23	43.84	71.98

Direct Use

import torch
from transformers import AutoImageProcessor, AutoModel

img_proc = AutoImageProcessor.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)
model = AutoModel.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    segmentation = model(**input_ids)

segmentation_mask = img_proc.post_process_semantic_segmentation(
    segmentation,
    target_sizes=[img.size[::-1]]
)

Citation

@online{DeDitLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}