--- language: cs license: cc-by-nc-sa-4.0 tags: - RobeCzech - Czech - RoBERTa - ÚFAL --- # Model Card for RobeCzech **If you are having issues with the tokenizer, please see https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314.** # Model Details ## Model Description RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. - **Developed by:** Institute of Formal and Applied Linguistics, Charles University, Prague (UFAL) - **Shared by:** Hugging Face and [LINDAT/CLARIAH-CZ](https://hdl.handle.net/11234/1-3691) - **Model type:** Fill-Mask - **Language(s) (NLP):** cs - **License:** cc-by-nc-sa-4.0 - **Model Architecture:** RoBERTa - **Resources for more information:** - [RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model](https://doi.org/10.1007/978-3-030-83527-9_17) - [arXiv preprint is also available](https://arxiv.org/abs/2105.11314) # Uses ## Direct Use Fill-Mask tasks. ## Downstream Use Morphological tagging and lemmatization, dependency parsing, named entity recognition, and semantic parsing. # Bias, Risks, and Limitations Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. # Training Details ## Training Data The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf): > We trained RobeCzech on a collection of the following publicly available texts: > - SYN v4, a large corpus of contemporary written Czech, 4,188M tokens; > - Czes, a collection of Czech newspaper and magazine articles, 432M tokens; > - documents with at least 400 tokens from the Czech part of the web corpus.W2C , tokenized with MorphoDiTa, 16M tokens; > - plain texts extracted from Czech Wikipedia dump 20201020 using WikiEx-tractor, tokenized with MorphoDiTa, 123M tokens > All these corpora contain whole documents, even if the SYN v4 is > block-shuffled (blocks with at most 100 words respecting sentence boundaries > are permuted in a document) and in total contain 4,917M tokens. ## Training Procedure ### Preprocessing The texts are tokenized into subwords with a byte-level BPE (BBPE) tokenizer, which was trained on the entire corpus and we limit its vocabulary size to 52,000 items. ### Speeds, Sizes, Times The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf): > The training batch size is 8,192 and each training batch consists of sentences > sampled contiguously, even across document boundaries, such that the total > length of each sample is at most 512 tokens (FULL-SENTENCES setting). We use > Adam optimizer with β1 = 0.9 and β2 = 0.98 to minimize the masked > language-modeling objective. ### Software Used The [Fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta) implementation was used for training. # Evaluation ## Testing Data, Factors & Metrics ### Testing Data The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf): > We evaluate RobeCzech in five NLP tasks, three of them leveraging frozen > contextualized word embeddings, two approached with fine-tuning: > - morphological analysis and lemmatization: frozen contextualized word embeddings, > - dependency parsing: frozen contextualized word embeddings, > - named entity recognition: frozen contextualized word embeddings, > - semantic parsing: fine-tuned, > - sentiment analysis: fine-tuned. ## Results | Model | Morphosynt PDT3.5 (POS) (LAS) | Morphosynt UD2.3 (XPOS) (LAS) | NER CNEC1.1 (nested) (flat) | Semant. PTG (Avg) (F1) | |-----------|---------------------------------|--------------------------------|------------------------------|-------------------------| | RobeCzech | 98.50 91.42 | 98.31 93.77 | 87.82 87.47 | 92.36 80.13 | # Environmental Impact - **Hardware Type:** 8 QUADRO P5000 GPU - **Hours used:** 2190 (~3 months) # Citation ``` @InProceedings{10.1007/978-3-030-83527-9_17, author={Straka, Milan and N{\'a}plava, Jakub and Strakov{\'a}, Jana and Samuel, David}, editor={Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav}, title={{RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model}}, booktitle="Text, Speech, and Dialogue", year="2021", publisher="Springer International Publishing", address="Cham", pages="197--209", isbn="978-3-030-83527-9" } ``` # How to Get Started with the Model Use the code below to get started with the model.
Click to expand ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base") model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base") ```