RoBERTa Latin model, version 2 --> model card not finished yet
This is a Latin RoBERTa-based LM model, version 2.
The intention of the Transformer-based LM is twofold: on the one hand, it will be used for the evaluation of HTR results; on the other, it should be used as a decoder for the TrOCR architecture.
The training data is more or less the same data as has been used by Bamman and Burns (2020), although more heavily filtered (see below). There are several digital-born texts from online Latin archives. Other Latin texts have been crawled by Bamman and Smith and thus contain many OCR errors.
The overall downsampled corpus contains 577M of text data.
Preprocessing
I undertook the following preprocessing steps:
- Removal of all "pseudo-Latin" text ("Lorem ipsum ...").
- Use of CLTK for sentence splitting and normalisation.
- Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (-->
grep -P '^[A-z0-9ÄÖÜäöüÆ挜ᵫĀāūōŌ.,;:?!\- Ęę]+$' la.nolorem.tok.txt
- deduplication of the corpus
The result is a corpus of ~390 million tokens.
The dataset used to train this model is available HERE.
Contact
For contact, reach out to Phillip Ströbel via mail or via Twitter.