--- license: mit language: - la pipeline_tag: fill-mask tags: - latin - masked language modelling widget: - text: "Gallia est omnis divisa in [MASK] tres ." example_title: "Commentary on Gallic Wars" - text: "[MASK] sum Caesar ." example_title: "Who is Caesar?" - text: "[MASK] it ad forum ." example_title: "Who is going to the forum?" - text: "Ovidius paratus est ad [MASK] ." example_title: "What is Ovidius up to?" - text: "[MASK], veni!" example_title: "Calling someone to come closer" - text: "Roma in Italia [MASK] ." example_title: "Ubi est Roma?" --- # Model Card for Simple Latin BERT A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora. **NOT** apt for production nor commercial use. This model's performance is really poor, and it has not been evaluated. This model comes with its own tokenizer! It will automatically use **lowercase**. Check the `training notebooks` folder for the preprocessing and training scripts. Inspired by - [This repo](https://github.com/dbamman/latin-bert), which has a BERT model for latin that is actually useful! - [This tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples) - [This tutorial](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=VNZZs-r6iKAV) - [This tutorial](https://huggingface.co/blog/how-to-train) # Table of Contents - [Model Card for Simple Latin BERT ](#model-card-for--model_id-) - [Table of Contents](#table-of-contents) - [Table of Contents](#table-of-contents-1) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Direct Use](#direct-use) - [Downstream Use [Optional]](#downstream-use-optional) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Speeds, Sizes, Times](#speeds-sizes-times) - [Evaluation](#evaluation) # Model Details ## Model Description A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora. **NOT** apt for production nor commercial use. This model's performance is really poor, and it has not been evaluated. This model comes with its own tokenizer! Check the `notebooks` folder for the preprocessing and training scripts. - **Developed by:** Luis Antonio VASQUEZ - **Model type:** Language model - **Language(s) (NLP):** la - **License:** mit # Uses ## Direct Use This model can be used directly for Masked Language Modelling. ## Downstream Use This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers' `BertForSequenceClassification`) # Training Details ## Training Data The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/) - [The Latin Library](https://www.thelatinlibrary.com/) - Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) - Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/) - [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/) ## Training Procedure ### Preprocessing For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence. Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded. Training hyperparameters: - epochs: 1 - Batch size: 64 - Attention heads: 12 - Hidden Layers: 12 - Max input size: 512 tokens ### Speeds, Sizes, Times After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours. # Evaluation No evaluation was performed on this dataset.