magistermilitum
commited on
Commit
•
0705695
1
Parent(s):
b37e0eb
Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,61 @@
|
|
1 |
-
TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)
|
2 |
---
|
3 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
widget:
|
4 |
+
- text: Universis presentes [MASK] inspecturis
|
5 |
+
- text: eandem [MASK] per omnia parati observare
|
6 |
+
- text: yo [MASK] rey de Galicia, de las Indias
|
7 |
+
- text: en avant contre les choses [MASK] contenues
|
8 |
+
datasets:
|
9 |
+
- cc100
|
10 |
+
- bigscience-historical-texts/Open_Medieval_French
|
11 |
+
- latinwikipedia
|
12 |
+
language:
|
13 |
+
- la
|
14 |
+
- fr
|
15 |
+
- es
|
16 |
---
|
17 |
+
|
18 |
+
|
19 |
+
## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)
|
20 |
+
|
21 |
+
**TRIDIS** (*Tria Digita Scribunt*) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions
|
22 |
+
from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising
|
23 |
+
from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards).
|
24 |
+
It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies
|
25 |
+
providing a versatile tool for historians and philologists in transforming and analyzing historical texts.
|
26 |
+
|
27 |
+
A paper presenting the first version of the model is available here:
|
28 |
+
Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163
|
29 |
+
|
30 |
+
|
31 |
+
#### Rules of transcription :
|
32 |
+
|
33 |
+
Main factor of semi-diplomatic edition is that abbreviations have been resolved:
|
34 |
+
- both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>).
|
35 |
+
- Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved.
|
36 |
+
- The named entities (names of persons, places and institutions) have been capitalized.
|
37 |
+
- The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
|
38 |
+
- The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin.
|
39 |
+
- The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation.
|
40 |
+
- Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end.
|
41 |
+
|
42 |
+
|
43 |
+
#### Corpora
|
44 |
+
The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).
|
45 |
+
|
46 |
+
The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora:
|
47 |
+
|
48 |
+
- The Alcar-HOME database: https://zenodo.org/record/5600884
|
49 |
+
- The e-NDP corpus: https://zenodo.org/record/7575693
|
50 |
+
- The Himanis project: https://zenodo.org/record/5535306
|
51 |
+
- Königsfelden Abbey corpus: https://zenodo.org/record/5179361
|
52 |
+
- Monumenta Luxemburgensia.
|
53 |
+
|
54 |
+
|
55 |
+
#### Accuracy
|
56 |
+
TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).
|
57 |
+
|
58 |
+
This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.
|
59 |
+
|
60 |
+
During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets
|
61 |
+
and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.
|