|
--- |
|
license: cc-by-sa-4.0 |
|
|
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
|
|
tags: |
|
- text-classification |
|
- genre |
|
- text-genre |
|
|
|
widget: |
|
- text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!" |
|
|
|
--- |
|
|
|
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres |
|
|
|
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`. |
|
|
|
## Model description |
|
|
|
### Fine-tuning hyperparameters |
|
|
|
Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are: |
|
|
|
```python |
|
model_args= { |
|
"num_train_epochs": 15, |
|
"learning_rate": 1e-5, |
|
"max_seq_length": 512, |
|
} |
|
|
|
``` |
|
|
|
## Intended use and limitations |
|
|
|
## Usage |
|
|
|
### Use examples |
|
|
|
```python |
|
from simpletransformers.classification import ClassificationModel |
|
model_args= { |
|
"num_train_epochs": 15, |
|
"learning_rate": 1e-5, |
|
"max_seq_length": 512, |
|
} |
|
model = ClassificationModel( |
|
"xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True, |
|
args=model_args |
|
|
|
) |
|
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", |
|
"On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"] |
|
) |
|
predictions |
|
### Output: |
|
### (['Instruction', 'Promotion'], |
|
array([[-1.44140625, -0.63183594, -1.14453125, 7.828125 , -1.05175781, |
|
-0.80957031, -0.86083984, -0.81201172, -0.71777344], |
|
[-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938, |
|
-1.20410156, -1.33300781, -0.87890625, 7.7890625 ]])) |
|
``` |
|
|
|
## Performance |
|
|
|
|
|
## Citation |
|
|
|
If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained: |
|
|
|
``` |
|
@misc{Kuzman2022, |
|
author = {Kuzman, Taja}, |
|
title = {{Comparison of genre datasets: CORE, GINCO and FTD}}, |
|
year = {2022}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}} |
|
} |
|
``` |
|
|
|
and the following paper on which the original model is based: |
|
``` |
|
@article{DBLP:journals/corr/abs-1911-02116, |
|
author = {Alexis Conneau and |
|
Kartikay Khandelwal and |
|
Naman Goyal and |
|
Vishrav Chaudhary and |
|
Guillaume Wenzek and |
|
Francisco Guzm{\'{a}}n and |
|
Edouard Grave and |
|
Myle Ott and |
|
Luke Zettlemoyer and |
|
Veselin Stoyanov}, |
|
title = {Unsupervised Cross-lingual Representation Learning at Scale}, |
|
journal = {CoRR}, |
|
volume = {abs/1911.02116}, |
|
year = {2019}, |
|
url = {http://arxiv.org/abs/1911.02116}, |
|
eprinttype = {arXiv}, |
|
eprint = {1911.02116}, |
|
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |
|
|
|
To cite the datasets that were used for fine-tuning: |
|
|
|
CORE dataset: |
|
|
|
``` |
|
@article{egbert2015developing, |
|
title={Developing a bottom-up, user-based method of web register classification}, |
|
author={Egbert, Jesse and Biber, Douglas and Davies, Mark}, |
|
journal={Journal of the Association for Information Science and Technology}, |
|
volume={66}, |
|
number={9}, |
|
pages={1817--1831}, |
|
year={2015}, |
|
publisher={Wiley Online Library} |
|
} |
|
``` |
|
|
|
GINCO dataset: |
|
|
|
``` |
|
@InProceedings{kuzman-rupnik-ljubei:2022:LREC, |
|
author = {Kuzman, Taja and Rupnik, Peter and Ljube{\v{s}}i{\'c}, Nikola}, |
|
title = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}}, |
|
booktitle = {Proceedings of the Language Resources and Evaluation Conference}, |
|
month = {}, |
|
year = {2022}, |
|
address = {Marseille, France}, |
|
publisher = {European Language Resources Association}, |
|
pages = {1584--1594}, |
|
url = {https://aclanthology.org/2022.lrec-1.170} |
|
} |
|
``` |
|
|
|
FTD dataset: |
|
|
|
``` |
|
@article{sharoff2018functional, |
|
title={Functional text dimensions for the annotation of web corpora}, |
|
author={Sharoff, Serge}, |
|
journal={Corpora}, |
|
volume={13}, |
|
number={1}, |
|
pages={65--95}, |
|
year={2018}, |
|
publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…} |
|
} |
|
``` |
|
|
|
The datasets are available at: |
|
1. http://hdl.handle.net/11356/1467 (GINCO) |
|
2. https://github.com/TurkuNLP/CORE-corpus (CORE) |
|
3. https://github.com/ssharoff/genre-keras (FTD) |