magistermilitum commited on
Commit
0705695
1 Parent(s): b37e0eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -1
README.md CHANGED
@@ -1,4 +1,61 @@
1
- TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)
2
  ---
3
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ widget:
4
+ - text: Universis presentes [MASK] inspecturis
5
+ - text: eandem [MASK] per omnia parati observare
6
+ - text: yo [MASK] rey de Galicia, de las Indias
7
+ - text: en avant contre les choses [MASK] contenues
8
+ datasets:
9
+ - cc100
10
+ - bigscience-historical-texts/Open_Medieval_French
11
+ - latinwikipedia
12
+ language:
13
+ - la
14
+ - fr
15
+ - es
16
  ---
17
+
18
+
19
+ ## TrOCR model adapted to Handwritting Text Recognition on medieval manuscripts (12th-16th centuries)
20
+
21
+ **TRIDIS** (*Tria Digita Scribunt*) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions
22
+ from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising
23
+ from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards).
24
+ It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies
25
+ providing a versatile tool for historians and philologists in transforming and analyzing historical texts.
26
+
27
+ A paper presenting the first version of the model is available here:
28
+ Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163
29
+
30
+
31
+ #### Rules of transcription :
32
+
33
+ Main factor of semi-diplomatic edition is that abbreviations have been resolved:
34
+ - both those by suspension (<mark>facimꝰ</mark> ---> <mark>facimus</mark>) and by contraction (<mark>dñi</mark> --> <mark>domini</mark>).
35
+ - Likewise, those using conventional signs (<mark>⁊</mark> --> <mark>et</mark> ; <mark>ꝓ</mark> --> <mark>pro</mark>) have been resolved. 
36
+ - The named entities (names of persons, places and institutions) have been capitalized.
37
+ - The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
38
+ - The consonantal <mark>i</mark> and <mark>u</mark> characters have been transcribed as <mark>j</mark> and <mark>v</mark> in both French and Latin.
39
+ - The punctuation marks used in the manuscript like: <mark>.</mark> or <mark>/</mark> or <mark>|</mark> have not been systematically transcribed as the transcription has been standardized with modern punctuation.
40
+ - Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign <mark>$</mark> at the beginning and at the end.
41
+
42
+
43
+ #### Corpora
44
+ The model was trained on charters, registers, feudal books and legal proceedings from the Late Medieval period (11th-16th centuries).
45
+
46
+ The training and evaluation involved 2950 pages, 245k lines of text, and almost 2.3M tokens, conducted using three freely available ground-truth corpora:
47
+
48
+ - The Alcar-HOME database: https://zenodo.org/record/5600884
49
+ - The e-NDP corpus: https://zenodo.org/record/7575693
50
+ - The Himanis project: https://zenodo.org/record/5535306
51
+ - Königsfelden Abbey corpus: https://zenodo.org/record/5179361
52
+ - Monumenta Luxemburgensia.
53
+
54
+
55
+ #### Accuracy
56
+ TRIDIS was trained using a encode-decoder architecture based on a fine-tuned version of the TrOCR-large handwritten (microsoft/trocr-large-handwritten) and a RoBERTa modelized on medieval texts (magistermilitum/RoBERTa_medieval).
57
+
58
+ This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries.
59
+
60
+ During evaluation, the model showed an accuracy of 94.3% on the validation set and a CER (Character Error Ratio) of about 0.06 to 0.12 on three external unseen datasets
61
+ and a WER of about 0.14 to 0.26 respectively, which is about 30% lower compared to CRNN+CTC solutions trained on the same corpora.