File size: 6,091 Bytes

---
license: cc-by-sa-4.0

language: 
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh

tags:
- text-classification
- genre
- text-genre

widget:
- text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!"

---

# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres

Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.

## Model description

### Fine-tuning hyperparameters

Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:

```python
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }        
      
```

## Intended use and limitations

## Usage

### Use examples

```python
from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }
model = ClassificationModel(
    "xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
    args=model_args
    
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", 
                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
                                        )
predictions
### Output:
### (['Instruction', 'Promotion'],
 array([[-1.44140625, -0.63183594, -1.14453125,  7.828125  , -1.05175781,
         -0.80957031, -0.86083984, -0.81201172, -0.71777344],
        [-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938,
         -1.20410156, -1.33300781, -0.87890625,  7.7890625 ]]))
```

## Performance


## Citation

If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:

```
 @misc{Kuzman2022,
  author = {Kuzman, Taja},
  title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}
```

and the following paper on which the original model is based:
```
@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

To cite the datasets that were used for fine-tuning:

CORE dataset:

```
@article{egbert2015developing,
  title={Developing a bottom-up, user-based method of web register classification},
  author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
  journal={Journal of the Association for Information Science and Technology},
  volume={66},
  number={9},
  pages={1817--1831},
  year={2015},
  publisher={Wiley Online Library}
}
```

GINCO dataset:

```
@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
  author    = {Kuzman, Taja  and  Rupnik, Peter  and  Ljube{\v{s}}i{\'c}, Nikola},
  title     = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1584--1594},
  url       = {https://aclanthology.org/2022.lrec-1.170}
}
```

FTD dataset:

```
@article{sharoff2018functional,
  title={Functional text dimensions for the annotation of web corpora},
  author={Sharoff, Serge},
  journal={Corpora},
  volume={13},
  number={1},
  pages={65--95},
  year={2018},
  publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}
```

The datasets are available at:
1. http://hdl.handle.net/11356/1467 (GINCO)
2. https://github.com/TurkuNLP/CORE-corpus (CORE)
3. https://github.com/ssharoff/genre-keras (FTD)