|
--- |
|
license: mit |
|
language: |
|
- la |
|
- fr |
|
- es |
|
- de |
|
base_model: |
|
- microsoft/trocr-large-handwritten |
|
tags: |
|
- handwritten-text-recognition |
|
- Image-to-text |
|
--- |
|
|
|
|
|
## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries) |
|
|
|
**TRIDIS** (*Tria Digita Scribunt*) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions |
|
from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising |
|
from legal, administrative, and memorial practices such as registers, feudal books, charters, proceedings, comptability more commonly from the Late Middle Ages (13th century and onwards). |
|
It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies |
|
providing a versatile tool for historians and philologists in transforming and analyzing historical texts. |
|
|
|
A paper presenting the first version of the model is available here: |
|
Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163 |
|
|
|
A paper presenting the second version of the model (tris one) is available here: |
|
Sergio Torres Aguilar. Handwritten Text Recognition for Historical Documents using Visual Language Models and GANs. 2023. https://hal.science/hal-04716654 |
|
|
|
#### Rules of transcription : |
|
|
|
Main factor of semi-diplomatic edition is that abbreviations have been resolved: |
|
- both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>). |
|
- Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved. |
|
- The named entities (names of persons, places and institutions) have been capitalized. |
|
- The beginning of a block of text as well as the original capitals used by the scribe are also capitalized. |
|
- The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin. |
|
- The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation. |
|
- Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end. |
|
|
|
|
|
#### Corpora |
|
The model was trained on documents from the Late Medieval period (11th-16th centuries). |
|
|
|
The training and evaluation ground-truth datasets involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using several freely available ground-truth corpora: |
|
|
|
- The Alcar-HOME database: https://zenodo.org/record/5600884 |
|
- The e-NDP corpus: https://zenodo.org/record/7575693 |
|
- The Himanis project: https://zenodo.org/record/5535306 |
|
- Königsfelden Abbey corpus: https://zenodo.org/record/5179361 |
|
- CODEA |
|
- Monumenta Luxemburgensia. |
|
|
|
Addionally 400k synthetic lines were used to reinforce the pre-training phase of the encoder-decoder. These lines were generated using a GAN system (https://github.com/ganji15/HiGANplus) trained on medieval manuscripts pages. |
|
|
|
|
|
#### Accuracy |
|
TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten ([microsoft/trocr-large-handwritten](https://huggingface.co/microsoft/trocr-large-handwritten)) and a RoBERTa modelized on medieval texts ([magistermilitum/Roberta_Historical](https://huggingface.co/magistermilitum/Roberta_Historical)). |
|
|
|
This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. |
|
|
|
During evaluation, the model showed an accuracy of 96.8% on the validation set and a CER (Character Error Ratio) of about 0.05 to 0.10 on three external unseen datasets |
|
and a WER of about 0.13 to 0.24 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora. |
|
|
|
### Other formats |
|
A CRNN+CTC version of this model trained on Kraken 4.0 (https://github.com/mittagessen/kraken) using the same gold-standard and synsthetic annotation is available in Zenodo: |
|
|
|
Torres Aguilar, S. (2024). TRIDIS v2 : HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th) (Version 2). Zenodo. https://doi.org/10.5281/zenodo.13862096 |
|
|
|
## Testing the Model |
|
The following snippets can be used to get model inferences on manuscript lines. |
|
|
|
1. Clone the model using: git lfs clone https://huggingface.co/magistermilitum/tridis_v2_HTR_historical_manuscripts |
|
|
|
2. Here is how to test the model on one single image: |
|
|
|
```python |
|
from transformers import TrOCRProcessor, AutoTokenizer, VisionEncoderDecoderModel |
|
from safetensors.torch import load_file |
|
import torch.nn as nn |
|
|
|
from PIL import Image |
|
|
|
# load image from the IAM database |
|
path="/path/to/image/file.png" |
|
image = Image.open(path).convert("RGB") |
|
|
|
processor = TrOCRProcessor.from_pretrained("./tridis_v2_HTR_historical_manuscripts") |
|
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-handwritten') |
|
|
|
# Load the weights of this model |
|
safetensors_path = "./tridis_v2_HTR_historical_manuscripts/model.safetensors" #load the weights from the downloaded model |
|
state_dict = load_file(safetensors_path) |
|
|
|
# Load the trocr model |
|
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten") |
|
|
|
#Modify the embeddings size and vocab |
|
model.config.decoder.vocab_size = processor.tokenizer.vocab_size |
|
model.config.vocab_size = model.config.decoder.vocab_size |
|
model.decoder.output_projection = nn.Linear(1024, processor.tokenizer.vocab_size) |
|
#model.decoder.model.decoder.embed_tokens = nn.Embedding(processor.tokenizer.vocab_size, 1024, padding_idx=1) |
|
model.decoder.embed_tokens = nn.Embedding(processor.tokenizer.vocab_size, 1024, padding_idx=1) |
|
|
|
# set beam search parameters |
|
model.config.eos_token_id = processor.tokenizer.sep_token_id |
|
model.config.max_length = 160 |
|
model.config.early_stopping = True |
|
model.config.no_repeat_ngram_size = 3 |
|
model.config.length_penalty = 2.0 |
|
model.config.num_beams = 3 |
|
|
|
model.load_state_dict(state_dict) |
|
|
|
pixel_values = processor(images=image, return_tensors="pt").pixel_values |
|
|
|
generated_ids = model.generate(pixel_values) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(generated_text) |
|
``` |
|
|
|
3. Here is how test the model on a dataset. Ideally the test dataset must be passed to the model on the form of a json list redirecting to the images: |
|
|
|
for ex (graphical_line_path, line_text_content): |
|
|
|
[ |
|
["liber_eSc_line_b9f83857", "Et pour ces deniers que je ai ressus de"], |
|
["liber_eSc_line_8da10559", "lui , sui je ses hons et serai tant con je vive-"], |
|
etc. |
|
] |
|
|
|
```python |
|
import glob |
|
import json, random |
|
import multiprocessing |
|
from tqdm import tqdm |
|
import torchvision.transforms as transforms |
|
from huggingface_hub import hf_hub_download |
|
|
|
import string |
|
import unicodedata |
|
import editdistance |
|
import numpy as np |
|
import pandas as pd |
|
|
|
def ocr_metrics(predicts, ground_truth, norm_accentuation=True, norm_punctuation=False): |
|
"""Calculate Character Error Rate (CER), Word Error Rate (WER) and Sequence Error Rate (SER)""" |
|
|
|
if len(predicts) == 0 or len(ground_truth) == 0: |
|
return (1, 1, 1) |
|
|
|
cer, wer, ser = [], [], [] |
|
|
|
for (pd, gt) in zip(predicts, ground_truth): |
|
pd, gt = pd.lower(), gt.lower() |
|
|
|
if norm_accentuation: |
|
pd = unicodedata.normalize("NFKD", pd).encode("ASCII", "ignore").decode("ASCII") |
|
gt = unicodedata.normalize("NFKD", gt).encode("ASCII", "ignore").decode("ASCII") |
|
if norm_punctuation: |
|
pd = pd.translate(str.maketrans("", "", string.punctuation)) |
|
gt = gt.translate(str.maketrans("", "", string.punctuation)) |
|
|
|
pd_cer, gt_cer = list(pd), list(gt) |
|
dist = editdistance.eval(pd_cer, gt_cer) |
|
cer.append(dist / (max(len(pd_cer), len(gt_cer)))) |
|
|
|
pd_wer, gt_wer = pd.split(), gt.split() |
|
dist = editdistance.eval(pd_wer, gt_wer) |
|
wer.append(dist / (max(len(pd_wer), len(gt_wer)))) |
|
|
|
pd_ser, gt_ser = [pd], [gt] |
|
dist = editdistance.eval(pd_ser, gt_ser) |
|
ser.append(dist / (max(len(pd_ser), len(gt_ser)))) |
|
|
|
metrics = [cer, wer, ser] |
|
metrics = np.mean(metrics, axis=1) |
|
return metrics |
|
|
|
def cleaning_output(text): |
|
import re |
|
clean_output = re.sub(r"[,.;]", "", text) #remove punctuation |
|
clean_output = re.sub(r"\s+", " ", clean_output) #remove extra spaces |
|
return clean_output |
|
|
|
import torch |
|
from torch.utils.data import Dataset |
|
from PIL import Image |
|
|
|
# Define the dataset class |
|
class IAMDataset(Dataset): |
|
def __init__(self, root_dir, df, processor, max_target_length=160): |
|
self.root_dir = root_dir |
|
self.df = df |
|
self.processor = processor |
|
self.max_target_length = max_target_length |
|
|
|
def __len__(self): |
|
return len(self.df) |
|
|
|
def __getitem__(self, idx): |
|
# get file name + text |
|
file_name = self.df['file_name'][idx] |
|
text = self.df['text'][idx] |
|
# prepare image (i.e. resize + normalize) |
|
image = Image.open(self.root_dir + file_name).convert("RGB") |
|
pixel_values = self.processor(image, return_tensors="pt").pixel_values |
|
# add labels (input_ids) by encoding the text |
|
labels = self.processor.tokenizer(text, |
|
padding="max_length", |
|
max_length=self.max_target_length).input_ids |
|
# important: make sure that PAD tokens are ignored by the loss function |
|
labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels] |
|
# Include `file_name` to the results dict |
|
encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels), "file_name": file_name} |
|
return encoding |
|
|
|
# Load the dataset |
|
from transformers import TrOCRProcessor, AutoTokenizer |
|
|
|
#Load the processor from the model |
|
processor = TrOCRProcessor.from_pretrained("./tridis_v2_HTR_historical_manuscripts") #load the processor from the downloaded model |
|
|
|
# Define the dataset |
|
#Open the file with text lines |
|
with open('/your/lines/file.json', encoding='utf-8') as fh: |
|
transcriptions = json.load(fh) |
|
random.shuffle(transcriptions) |
|
transcriptions=list(filter(lambda x: x is not None, transcriptions)) |
|
transcriptions = [[x[0]+".png", x[1]] for x in transcriptions if (len(x[1])>3 and len(x[1])<201 and type(x[1])==str)] #filter by length (optional) with *.png by default |
|
print(len(transcriptions)) |
|
df = pd.DataFrame(transcriptions, columns=["file_name", "text"]) |
|
print(df.head()) |
|
print(sum([len(x[1]) for x in transcriptions])) |
|
|
|
# Open the file with the images lines |
|
test_dataset = IAMDataset(root_dir='/your/images/folder/', |
|
df=df, |
|
processor=processor) |
|
print("Number of test examples:", len(test_dataset)) |
|
|
|
|
|
# Load the test dataloader |
|
from torch.utils.data import DataLoader |
|
import torch.nn as nn |
|
|
|
test_dataloader = DataLoader(test_dataset, batch_size=16) #adapt batch size to your GPU |
|
batch = next(iter(test_dataloader)) |
|
labels = batch["labels"] |
|
labels[labels == -100] = processor.tokenizer.pad_token_id |
|
label_str = processor.batch_decode(labels, skip_special_tokens=True) |
|
label_str |
|
|
|
# Load the model |
|
from transformers import VisionEncoderDecoderModel, AutoModelForCausalLM |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
from safetensors.torch import load_file |
|
# Load the weights of this model |
|
safetensors_path = "./tridis_v2_HTR_historical_manuscripts/model.safetensors" #load the weights from the downloaded model |
|
state_dict = load_file(safetensors_path) |
|
|
|
# Load the trocr model |
|
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten") |
|
|
|
# set special tokens used for creating the decoder_input_ids from the labels |
|
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id |
|
model.config.pad_token_id = processor.tokenizer.pad_token_id |
|
# make sure vocab size is set correctly |
|
model.config.vocab_size = model.config.decoder.vocab_size |
|
|
|
#Configure the embeddings size |
|
model.config.decoder.vocab_size = processor.tokenizer.vocab_size |
|
model.config.vocab_size = model.config.decoder.vocab_size |
|
model.decoder.output_projection = nn.Linear(1024, processor.tokenizer.vocab_size) |
|
#model.decoder.model.decoder.embed_tokens = nn.Embedding(processor.tokenizer.vocab_size, 1024, padding_idx=1) |
|
model.decoder.embed_tokens = nn.Embedding(processor.tokenizer.vocab_size, 1024, padding_idx=1) |
|
|
|
#Useful Hyper-parameters (optional) |
|
model.config.decoder.activation_function="gelu" |
|
model.config.decoder.layernorm_embedding=True |
|
model.config.decoder.max_position_embeddings=514 |
|
model.config.decoder.scale_embedding=False |
|
model.config.decoder.use_learned_position_embeddings=True |
|
|
|
# set beam search parameters |
|
model.config.eos_token_id = processor.tokenizer.sep_token_id |
|
model.config.max_length = 160 |
|
model.config.early_stopping = True |
|
model.config.no_repeat_ngram_size = 3 |
|
model.config.length_penalty = 2.0 |
|
model.config.num_beams = 3 |
|
|
|
# update the model weights |
|
model.load_state_dict(state_dict) |
|
model.to(device) |
|
|
|
# Load the metrics |
|
from datasets import load_metric |
|
bert= load_metric("bertscore") |
|
|
|
# Evaluate the model |
|
print("Running evaluation...") |
|
|
|
dictionary=[] |
|
for batch in tqdm(test_dataloader): |
|
pixel_values = batch["pixel_values"].to(device) |
|
outputs = model.generate(pixel_values) |
|
# Decoding predictions and references |
|
pred_str = processor.batch_decode(outputs, skip_special_tokens=True) |
|
labels = batch["labels"] |
|
labels[labels == -100] = processor.tokenizer.pad_token_id |
|
label_str = processor.batch_decode(labels, skip_special_tokens=True) |
|
file_names = batch["file_name"] # Assert that DataLoader includes `file_name` in each batch |
|
dictionary.extend([[file_name, pred, ref] for file_name, pred, ref in zip(file_names, pred_str, label_str)]) |
|
|
|
# Save results as a dictionary |
|
with open("/your/save/path/dictionary_of_results.json", "w", encoding='utf-8') as jsonfile: |
|
json.dump(dictionary, jsonfile, ensure_ascii=False, indent=1) |
|
|
|
#compute the BERT score |
|
bert_score=bert.compute(references=[x[1] for x in dictionary], predictions=[x[2] for x in dictionary], model_type="bert-base-multilingual-cased") |
|
bert_score_mean=np.mean(bert_score["f1"]) |
|
bert_score_std=np.std(bert_score["f1"]) |
|
|
|
# Print the results according to the metrics |
|
print("BERT_SCORE_MEAN : ", bert_score_mean, "BERT_SCORE_STD : ", bert_score_std ) |
|
print("RAW metrics : ", ocr_metrics([x[1] for x in dictionary], [x[2] for x in dictionary])) |
|
print("CLEAN metrics : ", ocr_metrics([cleaning_output(x[1]) for x in dictionary], [cleaning_output(x[2]) for x in dictionary])) |
|
print(*dictionary[1:], sep="\n\n") |
|
``` |
|
|
|
- **Developed by:** [Sergio Torres Aguilar] |
|
- **Model type:** [TrOCR] |
|
- **Language(s) (NLP):** [Medieval Latin, Spanish, French, Middle German] |
|
- **Finetuned from model [optional]:** [Handwritten Text Recognition] |