Update README.md

1995237 verified 3 months ago

5.07 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-segmentation
	datasets:
	- ds4sd/DocLayNet
	---

	# DIT-base-layout-detection

	We present the model cmarkea/dit-base-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document.
	This is a fine-tuning of the model [dit-base](https://huggingface.co/microsoft/dit-base) on the [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
	dataset. It is ideal for processing documentary corpora to be ingested into an
	ODQA system.

	This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.

	## Performance

	In this section, we will assess the model's performance by separately considering semantic segmentation and object detection. We did not perform any post-processing
	for the semantic segmentation. As for object detection, we only applied OpenCV's `findContours` without any further post-processing.

	For semantic segmentation, we will use the F1-score to evaluate the classification of each pixel. For object detection, we will assess performance based on the
	Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation
	dataset of DocLayNet.

	\| Class \| f1-score (x100) \| GIoU (x100) \| accuracy (x100) \|
	\|:--------------:\|:---------------:\|:-----------:\|:---------------:\|
	\| Background \| 94.98 \| NA \| NA \|
	\| Caption \| 75.54 \| 55.61 \| 72.62 \|
	\| Footnote \| 72.29 \| 50.08 \| 70.97 \|
	\| Formula \| 82.29 \| 49.91 \| 94.48 \|
	\| List-item \| 67.56 \| 35.19 \| 69 \|
	\| Page-footer \| 83.93 \| 57.99 \| 94.06 \|
	\| Page-header \| 62.33 \| 65.25 \| 79.39 \|
	\| Picture \| 78.32 \| 58.22 \| 92.71 \|
	\| Section-header \| 69.55 \| 56.64 \| 78.29 \|
	\| Table \| 83.69 \| 63.03 \| 90.13 \|
	\| Text \| 90.94 \| 51.89 \| 88.09 \|
	\| Title \| 61.19 \| 52.64 \| 70 \|

	## Benchmark

	Now, let's compare the performance of this model with other models.

	\| Model \| f1-score (x100) \| GIoU (x100) \| accuracy (x100) \|
	\|:---------------------------------------------------------------------------------------------:\|:---------------:\|:-----------:\|:---------------:\|
	\| cmarkea/dit-base-layout-detection \| 90.77 \| 56.29 \| 85.26 \|
	\| [cmarkea/detr-layout-detection](https://huggingface.co/cmarkea/detr-layout-detection) \| 91.27 \| 80.66 \| 90.46 \|

	### Direct Use

	```python
	import torch
	from transformers import AutoImageProcessor, BeitForSemanticSegmentation

	img_proc = AutoImageProcessor.from_pretrained(
	"cmarkea/dit-base-layout-detection"
	)
	model = BeitForSemanticSegmentation.from_pretrained(
	"cmarkea/dit-base-layout-detection"
	)

	img: PIL.Image

	with torch.inference_mode():
	input_ids = img_proc(img, return_tensors='pt')
	output = model(**input_ids)

	segmentation = img_proc.post_process_semantic_segmentation(
	output,
	target_sizes=[img.size[::-1]]
	)
	```

	Here is a simple method for detecting bounding boxes from semantic segmentation. This is the method used to calculate the model's performance in object
	detection, as described in the "Performance" section. The method is provided without any additional post-processing.

	```python
	import cv2

	def detect_bboxes(masks: np.ndarray):
	r"""
	A simple bounding box detection function
	"""
	detected_blocks = []
	contours, _ = cv2.findContours(
	masks.astype(np.uint8),
	cv2.RETR_EXTERNAL,
	cv2.CHAIN_APPROX_SIMPLE
	)
	for contour in list(contours):
	if len(list(contour)) >= 4:
	# smallest rectangle containing all points
	x, y, width, height = cv2.boundingRect(contour)
	bounding_box = [x, y, x + width, y + height]
	detected_blocks.append(bounding_box)
	return detected_blocks

	bbox_pred = []
	for segment in segmentation:
	boxes, labels = [], []
	for ii in range(1, len(model.config.label2id)):
	mm = segment == ii
	if mm.sum() > 0:
	bbx = detect_bboxes(mm.numpy())
	boxes.extend(bbx)
	labels.extend([ii]*len(bbx))
	bbox_pred.append(dict(boxes=boxes, labels=labels))
	```

	### Example

	![example](https://i.postimg.cc/rFXswV59/dit1.png)

	### Citation

	```
	@online{DeDitLay,
	AUTHOR = {Cyrile Delestre},
	URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
	YEAR = {2024},
	KEYWORDS = {Image Processing ; Transformers ; Layout},
	}
	```