license: cc-by-sa-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- genre
- text-genre
widget:
- text: >-
On our site, you can find a great genre identification model which you can
use for thousands of different tasks. For free!
Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
Text classification model based on xlm-roberta-base
and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO1 dataset, the English CORE2 dataset and the English FTD3 dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the xlm-roberta-base
.
Model description
Fine-tuning hyperparameters
Fine-tuning was performed with simpletransformers
. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
}
Intended use and limitations
Usage
Use examples
from simpletransformers.classification import ClassificationModel
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
}
model = ClassificationModel(
"xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
args=model_args
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.",
"On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
)
predictions
### Output:
### (['Instruction', 'Promotion'],
array([[-1.44140625, -0.63183594, -1.14453125, 7.828125 , -1.05175781,
-0.80957031, -0.86083984, -0.81201172, -0.71777344],
[-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938,
-1.20410156, -1.33300781, -0.87890625, 7.7890625 ]]))
Performance
Citation
If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:
@misc{Kuzman2022,
author = {Kuzman, Taja},
title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}
and the following paper on which the original model is based:
@article{DBLP:journals/corr/abs-1911-02116,
author = {Alexis Conneau and
Kartikay Khandelwal and
Naman Goyal and
Vishrav Chaudhary and
Guillaume Wenzek and
Francisco Guzm{\'{a}}n and
Edouard Grave and
Myle Ott and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {Unsupervised Cross-lingual Representation Learning at Scale},
journal = {CoRR},
volume = {abs/1911.02116},
year = {2019},
url = {http://arxiv.org/abs/1911.02116},
eprinttype = {arXiv},
eprint = {1911.02116},
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
To cite the datasets that were used for fine-tuning:
CORE dataset:
@article{egbert2015developing,
title={Developing a bottom-up, user-based method of web register classification},
author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
journal={Journal of the Association for Information Science and Technology},
volume={66},
number={9},
pages={1817--1831},
year={2015},
publisher={Wiley Online Library}
}
GINCO dataset:
@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
author = {Kuzman, Taja and Rupnik, Peter and Ljube{\v{s}}i{\'c}, Nikola},
title = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1584--1594},
url = {https://aclanthology.org/2022.lrec-1.170}
}
FTD dataset:
@article{sharoff2018functional,
title={Functional text dimensions for the annotation of web corpora},
author={Sharoff, Serge},
journal={Corpora},
volume={13},
number={1},
pages={65--95},
year={2018},
publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}
The datasets are available at: