magistermilitum
/

tridis_HTR

 ---
 license: mit
+widget:
+- text: Universis presentes [MASK] inspecturis
+- text: eandem [MASK] per omnia parati observare
+- text: yo [MASK] rey de Galicia, de las Indias
+- text: en avant contre les choses [MASK] contenues
+datasets:
+- cc100
+- bigscience-historical-texts/Open_Medieval_French
+- latinwikipedia
+language:
+- la
+- fr
+- es
 ---
+## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)
+**TRIDIS** (*Tria Digita Scribunt*) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions
+from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising
+from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards).
+It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies
+providing a versatile tool for historians and philologists in transforming and analyzing historical texts.
+A paper presenting the first version of the model is available here:
+Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163
+#### Rules of transcription :
+Main factor of semi-diplomatic edition is that abbreviations have been resolved:
+- both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>).
+- Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved.
+- The named entities (names of persons, places and institutions) have been capitalized.
+- The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
+- The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin.
+- The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation.
+- Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end.
+#### Corpora
+The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).
+The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora:
+- The Alcar-HOME database: https://zenodo.org/record/5600884
+- The e-NDP corpus: https://zenodo.org/record/7575693
+- The Himanis project: https://zenodo.org/record/5535306
+- Königsfelden Abbey corpus: https://zenodo.org/record/5179361
+- Monumenta Luxemburgensia.
+#### Accuracy
+TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).
+This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.
+During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets
+and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.