grc-alignment / README.md
TariqYousef's picture
Update README.md
3e8f275
|
raw
history blame
4.22 kB
metadata
license: cc-by-4.0

Automatic Translation Alignment of Ancient Greek Texts

GRC-ALIGNMENT model is an XLM-RoBERTa-based model, fine-tuned for automatic multilingual text alignment at the word level. The model is trained on 12 million monolingual ancient Greek tokens with Masked Language Model (MLM) training objective. Further, the model is fine-tuned on 45k parallel sentences, mainly in ancient Greek-English, Greek-Latin, and Greek-Georgian.

Multilingual Training Dataset

Languages Sentences Source
GRC-ENG 32.500 Perseus Digital Library (Iliad, Odyssey, Xenophon, New Testament)
GRC-LAT 8.200 Digital Fragmenta Historicorum Graecorum project
GRC-KAT
GRC-ENG
GRC-LAT
GRC-ITA
GRC-POR
4.000 UGARIT Translation Alignment Editor

Model Performance

Languages Alignment Error Rate
GRC-ENG 19.73% (IterMax)
GRC-POR 23.91% (IterMax)
GRC-LAT 10.60% (ArgMax)

The gold standard datasets are available on Github.

If you use this model, please cite our papers:

@InProceedings{yousef-EtAl:2022:LREC,
  author    = {Yousef, Tariq  and  Palladino, Chiara  and  Shamsian, Farnoosh  and  d’Orange Ferreira, Anise  and  Ferreira dos Reis, Michel},
  title     = {An automatic model and Gold Standard for translation alignment of Ancient Greek},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {5894--5905},
  abstract  = {This paper illustrates a workflow for developing and evaluating automatic translation alignment models for Ancient Greek. We designed an annotation Style Guide and a gold standard for the alignment of Ancient Greek-English and Ancient Greek-Portuguese, measured inter-annotator agreement and used the resulting dataset to evaluate the performance of various translation alignment models. We proposed a fine-tuning strategy that employs unsupervised training with mono- and bilingual texts and supervised training using manually aligned sentences. The results indicate that the fine-tuned model based on XLM-Roberta is superior in performance, and it achieved good results on language pairs that were not part of the training data.},
  url       = {https://aclanthology.org/2022.lrec-1.634}
}

@InProceedings{yousef-EtAl:2022:LT4HALA2022,
  author    = {Yousef, Tariq  and  Palladino, Chiara  and  Wright, David J.  and  Berti, Monica},
  title     = {Automatic Translation Alignment for Ancient Greek and Latin},
  booktitle      = {Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {101--107},
  abstract  = {This paper presents the results of automatic translation alignment experiments on a corpus of texts in Ancient Greek translated into Latin. We used a state-of-the-art alignment workflow based on a contextualized multilingual language model that is fine-tuned on the alignment task for Ancient Greek and Latin. The performance of the alignment model is evaluated on an alignment gold standard consisting of 100 parallel fragments aligned manually by two domain experts, with a 90.5\% Inter-Annotator-Agreement (IAA). An interactive online interface is provided to enable users to explore the aligned fragments collection and examine the alignment model's output.},
  url       = {https://aclanthology.org/2022.lt4hala2022-1.14}
}