---
license: cc-by-4.0
datasets:
- CATMuS/medieval-segmentation
pipeline_tag: object-detection
tags:
- medieval
- manuscript
---

# Florence 2 Medieval Zone Object Detection

This is Microsoft's Florence 2 model trained for 10 epochs with [CATMuS Medieval Segmentation dataset](https://huggingface.co/datasets/CATMuS/medieval-segmentation) with a learn rate of `1e-6`. This model would not be possible without the numerous annotators behind the various datasets available on HTR-United (See dataset for details). A special thanks to [Thibault Clérice](https://huggingface.co/ponteineptique) who converted the original CATMuS dataset (for HTR) to a segmentation dataset.

# Model Details

- **Developed by**: [William J.B. Mattingly](https://huggingface.co/wjbmattingly)
- **License**: CC-BY 4.0
- **Finetuned from model**: [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft)

## Labels

The following table describes the labels, the ones used to train this model, the counts of those labels (multiples per image), and the definition of those labels with a link to the original documentation.

| Label | Zone | Line | Train Count | Validation Count | Test Count | Definition |
|-------|------|------|-------------|------------------|------------|------------|
| DefaultLine |  | ✓ | 81702 | 13554 | 12209 | [A line of text that is not distinguished by any particular features and is part of the main text flow.](https://segmonto.github.io/gd/gdL/DefaultLine/) |
| InterlinearLine |  | ✓ | 2808 | 27 | 2234 | [A line of text written between two lines of main text, typically containing glosses, translations, or comments.](https://segmonto.github.io/gd/gdL/InterlinearLine/) |
| MainZone | ✓ |  | 2314 | 365 | 275 | [The main textual zone of a page, usually containing the main body of text.](https://segmonto.github.io/gd/gdZ/MainZone/) |
| HeadingLine |  | ✓ | 1381 | 701 | 135 | [A line of text that functions as a heading or title for a section of the main text.](https://segmonto.github.io/gd/gdL/HeadingLine/) |
| MarginTextZone | ✓ |  | 916 | 146 | 199 | [A text zone in the margin of a page, often containing annotations, commentaries, or other secondary information.](https://segmonto.github.io/gd/gdZ/MarginTextZone/) |
| DropCapitalZone | ✓ |  | 1566 | 102 | 124 | [A zone containing a large ornamental initial letter of a paragraph or section, typically extending below the first line of text.](https://segmonto.github.io/gd/gdZ/DropCapitalZone/) |
| NumberingZone | ✓ |  | 632 | 102 | 94 | [A zone containing page numbers, folio numbers, or other numerical identifiers for the page.](https://segmonto.github.io/gd/gdZ/NumberingZone/) |
| TironianSignLine |  |  | 282 | 0 | 0 | [A line containing Tironian notes, an ancient system of shorthand.](https://segmonto.github.io/gd/gdL/TironianSignLine/) |
| DropCapitalLine |  |  | 1175 | 105 | 92 | [A line of text that begins with a drop capital.](https://segmonto.github.io/gd/gdL/DropCapitalLine/) |
| RunningTitleZone | ✓ |  | 340 | 91 | 18 | [A zone containing a running title, typically located at the top of a page and repeating throughout a section or the entire document.](https://segmonto.github.io/gd/gdZ/RunningTitleZone/) |
| GraphicZone | ✓ |  | 300 | 7 | 10 | [A zone containing non-textual elements such as images, drawings, or decorative elements.](https://segmonto.github.io/gd/gdZ/GraphicZone/) |
| DigitizationArtefactZone |  |  | 28 | 0 | 0 | [A zone containing artefacts from the digitization process, such as color bars or reference marks.](https://segmonto.github.io/gd/gdZ/DigitizationArtefactZone/) |
| QuireMarksZone | ✓ |  | 86 | 9 | 8 | [A zone containing marks used to indicate the gathering or quire to which a leaf belongs, often found at the bottom of the page.](https://segmonto.github.io/gd/gdZ/QuireMarksZone/) |
| StampZone | ✓ |  | 39 | 5 | 4 | [A zone containing a stamp, such as a library stamp or ownership mark.](https://segmonto.github.io/gd/gdZ/StampZone/) |
| DamageZone | ✓ |  | 12 | 1 | 0 | [A zone indicating an area of the page that has been damaged or is otherwise illegible due to physical deterioration.](https://segmonto.github.io/gd/gdZ/DamageZone/) |
| MusicZone | ✓ |  | 179 | 0 | 0 | [A zone containing musical notation.](https://segmonto.github.io/gd/gdZ/MusicZone/) |
| MusicLine |  |  | 167 | 0 | 0 | [A line containing musical notation.](https://segmonto.github.io/gd/gdL/MusicLine/) |
| TitlePageZone | ✓ |  | 4 | 1 | 1 | [A zone encompassing the entire title page of a book or document.](https://segmonto.github.io/gd/gdZ/TitlePageZone/) |
| SealZone | ✓ |  | 3 | 0 | 0 | [A zone containing a seal, typically used for authentication or closure of a document.](https://segmonto.github.io/gd/gdZ/SealZone/) |


# How to Get Started with the Model

Use the code below to get started with the model. All models are trained with float16.

```python
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import os
from unittest.mock import patch

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.dynamic_module_utils import get_imports
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Mac solution => https://huggingface.co/microsoft/Florence-2-large-ft/discussions/4
def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
    """Work around for https://huggingface.co/microsoft/phi-1_5/discussions/72."""
    if not str(filename).endswith("/modeling_florence2.py"):
        return get_imports(filename)
    imports = get_imports(filename)
    imports.remove("flash_attn")
    return imports


with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):

    model = AutoModelForCausalLM.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)

def process_image(url):
    prompt = "<OD>"

    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(text=prompt, images=image, return_tensors="pt")

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    result = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
    return result, image


image = "https://huggingface.co/datasets/CATMuS/medieval-segmentation/resolve/main/data/train/cambridge-corpus-christi-college-ms-111/page-002-of-003.jpg"

result, image = process_image(image)
fig, ax = plt.subplots(1, figsize=(15, 15))
ax.imshow(image)

# Add bounding boxes and labels to the plot
for bbox, label in zip(result['<OD>']['bboxes'], result['<OD>']['labels']):
    x, y, width, height = bbox[0], bbox[1], bbox[2] - bbox[0], bbox[3] - bbox[1]
    rect = patches.Rectangle((x, y), width, height, linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    plt.text(x, y, label, fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

# Display the plot
plt.show()
```