File size: 6,091 Bytes
0e3e5b2 83a7051 0e3e5b2 83a7051 2ad8747 83a7051 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
---
license: cc-by-sa-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- genre
- text-genre
widget:
- text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!"
---
# Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
## Model description
### Fine-tuning hyperparameters
Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
```python
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
}
```
## Intended use and limitations
## Usage
### Use examples
```python
from simpletransformers.classification import ClassificationModel
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
}
model = ClassificationModel(
"xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
args=model_args
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.",
"On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
)
predictions
### Output:
### (['Instruction', 'Promotion'],
array([[-1.44140625, -0.63183594, -1.14453125, 7.828125 , -1.05175781,
-0.80957031, -0.86083984, -0.81201172, -0.71777344],
[-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938,
-1.20410156, -1.33300781, -0.87890625, 7.7890625 ]]))
```
## Performance
## Citation
If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:
```
@misc{Kuzman2022,
author = {Kuzman, Taja},
title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}
```
and the following paper on which the original model is based:
```
@article{DBLP:journals/corr/abs-1911-02116,
author = {Alexis Conneau and
Kartikay Khandelwal and
Naman Goyal and
Vishrav Chaudhary and
Guillaume Wenzek and
Francisco Guzm{\'{a}}n and
Edouard Grave and
Myle Ott and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {Unsupervised Cross-lingual Representation Learning at Scale},
journal = {CoRR},
volume = {abs/1911.02116},
year = {2019},
url = {http://arxiv.org/abs/1911.02116},
eprinttype = {arXiv},
eprint = {1911.02116},
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
To cite the datasets that were used for fine-tuning:
CORE dataset:
```
@article{egbert2015developing,
title={Developing a bottom-up, user-based method of web register classification},
author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
journal={Journal of the Association for Information Science and Technology},
volume={66},
number={9},
pages={1817--1831},
year={2015},
publisher={Wiley Online Library}
}
```
GINCO dataset:
```
@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
author = {Kuzman, Taja and Rupnik, Peter and Ljube{\v{s}}i{\'c}, Nikola},
title = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1584--1594},
url = {https://aclanthology.org/2022.lrec-1.170}
}
```
FTD dataset:
```
@article{sharoff2018functional,
title={Functional text dimensions for the annotation of web corpora},
author={Sharoff, Serge},
journal={Corpora},
volume={13},
number={1},
pages={65--95},
year={2018},
publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}
```
The datasets are available at:
1. http://hdl.handle.net/11356/1467 (GINCO)
2. https://github.com/TurkuNLP/CORE-corpus (CORE)
3. https://github.com/ssharoff/genre-keras (FTD) |