File size: 5,383 Bytes

febc85f
dc25bdd
febc85f
dc25bdd
febc85f
 
 
 
 
dc25bdd
febc85f


---
language: cs
license: cc-by-nc-sa-4.0
tags:
- RobeCzech
- Czech
- RoBERTa
- ÚFAL
---

# Model Card for RobeCzech

**If you are having issues with the tokenizer, please see https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314.**

# Model Details

## Model Description

RobeCzech is a monolingual RoBERTa language representation model trained on Czech data.

- **Developed by:** Institute of Formal and Applied Linguistics, Charles University, Prague (UFAL)
- **Shared by:** Hugging Face and [LINDAT/CLARIAH-CZ](https://hdl.handle.net/11234/1-3691)
- **Model type:** Fill-Mask
- **Language(s) (NLP):** cs
- **License:** cc-by-nc-sa-4.0
- **Model Architecture:** RoBERTa
- **Resources for more information:**
  - [RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model](https://doi.org/10.1007/978-3-030-83527-9_17)
    - [arXiv preprint is also available](https://arxiv.org/abs/2105.11314)


# Uses

## Direct Use

Fill-Mask tasks.

## Downstream Use

Morphological tagging and lemmatization, dependency parsing, named entity
recognition, and semantic parsing.


# Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models
(see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf)
and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
Predictions generated by the model may include disturbing and harmful
stereotypes across protected classes; identity characteristics; and sensitive,
social, and occupational groups.

## Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and
limitations of the model. More information needed for further recommendations.


# Training Details

## Training Data

The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> We trained RobeCzech on a collection of the following publicly available texts:
> - SYN v4, a large corpus of contemporary written Czech, 4,188M tokens;
> - Czes, a collection of Czech newspaper and magazine articles, 432M tokens;
> - documents with at least 400 tokens from the Czech part of the web corpus.W2C , tokenized with MorphoDiTa, 16M tokens;
> - plain texts extracted from Czech Wikipedia dump 20201020 using WikiEx-tractor, tokenized with MorphoDiTa, 123M tokens

> All these corpora contain whole documents, even if the SYN v4 is
> block-shuffled (blocks with at most 100 words respecting sentence boundaries
> are permuted in a document) and in total contain 4,917M tokens.

## Training Procedure

### Preprocessing

The texts are tokenized into subwords with a byte-level BPE (BBPE) tokenizer,
which was trained on the entire corpus and we limit its vocabulary size to
52,000 items.

### Speeds, Sizes, Times
The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> The training batch size is 8,192 and each training batch consists of sentences
> sampled contiguously, even across document boundaries, such that the total
> length of each sample is at most 512 tokens (FULL-SENTENCES setting). We use
> Adam optimizer with β1 = 0.9 and β2 = 0.98 to minimize the masked
> language-modeling objective.

### Software Used
The [Fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta)
implementation was used for training.


# Evaluation

## Testing Data, Factors & Metrics

### Testing Data
The model creators note in the [associated paper](https://arxiv.org/pdf/2105.11314.pdf):
> We evaluate RobeCzech in five NLP tasks, three of them leveraging frozen
> contextualized word embeddings, two approached with fine-tuning:
> - morphological analysis and lemmatization: frozen contextualized word embeddings,
> - dependency parsing: frozen contextualized word embeddings,
> - named entity recognition: frozen contextualized word embeddings,
> - semantic parsing: fine-tuned,
> - sentiment analysis: fine-tuned.

## Results

| Model     | Morphosynt PDT3.5   (POS) (LAS) | Morphosynt UD2.3  (XPOS) (LAS) | NER CNEC1.1  (nested) (flat) | Semant. PTG  (Avg) (F1) |
|-----------|---------------------------------|--------------------------------|------------------------------|-------------------------|
| RobeCzech | 98.50      91.42                | 98.31    93.77                 | 87.82	           87.47        | 92.36           80.13   |


# Environmental Impact

- **Hardware Type:** 8 QUADRO P5000 GPU
- **Hours used:** 2190 (~3 months)


# Citation

```
@InProceedings{10.1007/978-3-030-83527-9_17,
  author={Straka, Milan and N{\'a}plava, Jakub and Strakov{\'a}, Jana and Samuel, David},
  editor={Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav},
  title={{RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model}},
  booktitle="Text, Speech, and Dialogue",
  year="2021",
  publisher="Springer International Publishing",
  address="Cham",
  pages="197--209",
  isbn="978-3-030-83527-9"
}
```


# How to Get Started with the Model

Use the code below to get started with the model.

<details>
<summary> Click to expand </summary>

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")

model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
 ```
</details>