robeczech-base-test / README.md
Milan Straka
Initial upload with a fixed tokenizer.
febc85f
---
language: cs
license: cc-by-nc-sa-4.0
tags:
- RobeCzech
- Czech
- RoBERTa
- ÚFAL
---
# Model Card for RobeCzech
**If you are having issues with the tokenizer, please see https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314.**
# Model Details
## Model Description
RobeCzech is a monolingual RoBERTa language representation model trained on Czech data.
- **Developed by:** Institute of Formal and Applied Linguistics, Charles University, Prague (UFAL)
- **Shared by:** Hugging Face and [LINDAT/CLARIAH-CZ](https://hdl.handle.net/11234/1-3691)
- **Model type:** Fill-Mask
- **Language(s) (NLP):** cs
- **License:** cc-by-nc-sa-4.0
- **Model Architecture:** RoBERTa
- **Resources for more information:**
- [RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model](https://doi.org/10.1007/978-3-030-83527-9_17)
- [arXiv preprint is also available](https://arxiv.org/abs/2105.11314)
# Uses
## Direct Use
Fill-Mask tasks.
## Downstream Use
Morphological tagging and lemmatization, dependency parsing, named entity
recognition, and semantic parsing.
# Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models
(see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf)
and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
Predictions generated by the model may include disturbing and harmful
stereotypes across protected classes; identity characteristics; and sensitive,
social, and occupational groups.
## Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and
limitations of the model. More information needed for further recommendations.
# Training Details
## Training Data
The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> We trained RobeCzech on a collection of the following publicly available texts:
> - SYN v4, a large corpus of contemporary written Czech, 4,188M tokens;
> - Czes, a collection of Czech newspaper and magazine articles, 432M tokens;
> - documents with at least 400 tokens from the Czech part of the web corpus.W2C , tokenized with MorphoDiTa, 16M tokens;
> - plain texts extracted from Czech Wikipedia dump 20201020 using WikiEx-tractor, tokenized with MorphoDiTa, 123M tokens
> All these corpora contain whole documents, even if the SYN v4 is
> block-shuffled (blocks with at most 100 words respecting sentence boundaries
> are permuted in a document) and in total contain 4,917M tokens.
## Training Procedure
### Preprocessing
The texts are tokenized into subwords with a byte-level BPE (BBPE) tokenizer,
which was trained on the entire corpus and we limit its vocabulary size to
52,000 items.
### Speeds, Sizes, Times
The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> The training batch size is 8,192 and each training batch consists of sentences
> sampled contiguously, even across document boundaries, such that the total
> length of each sample is at most 512 tokens (FULL-SENTENCES setting). We use
> Adam optimizer with β1 = 0.9 and β2 = 0.98 to minimize the masked
> language-modeling objective.
### Software Used
The [Fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta)
implementation was used for training.
# Evaluation
## Testing Data, Factors & Metrics
### Testing Data
The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> We evaluate RobeCzech in five NLP tasks, three of them leveraging frozen
> contextualized word embeddings, two approached with fine-tuning:
> - morphological analysis and lemmatization: frozen contextualized word embeddings,
> - dependency parsing: frozen contextualized word embeddings,
> - named entity recognition: frozen contextualized word embeddings,
> - semantic parsing: fine-tuned,
> - sentiment analysis: fine-tuned.
## Results
| Model | Morphosynt PDT3.5 (POS) (LAS) | Morphosynt UD2.3 (XPOS) (LAS) | NER CNEC1.1 (nested) (flat) | Semant. PTG (Avg) (F1) |
|-----------|---------------------------------|--------------------------------|------------------------------|-------------------------|
| RobeCzech | 98.50 91.42 | 98.31 93.77 | 87.82 87.47 | 92.36 80.13 |
# Environmental Impact
- **Hardware Type:** 8 QUADRO P5000 GPU
- **Hours used:** 2190 (~3 months)
# Citation
```
@InProceedings{10.1007/978-3-030-83527-9_17,
author={Straka, Milan and N{\'a}plava, Jakub and Strakov{\'a}, Jana and Samuel, David},
editor={Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav},
title={{RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model}},
booktitle="Text, Speech, and Dialogue",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="197--209",
isbn="978-3-030-83527-9"
}
```
# How to Get Started with the Model
Use the code below to get started with the model.
<details>
<summary> Click to expand </summary>
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
```
</details>