metadata

license: cc-by-sa-4.0
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
tags:
  - text-classification
  - genre
  - text-genre
widget:
  - text: >-
      On our site, you can find a great genre identification model which you can
      use for thousands of different tasks. For free!

Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres

Text classification model based on xlm-roberta-base and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO¹ dataset, the English CORE² dataset and the English FTD³ dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the xlm-roberta-base.

Model description

Fine-tuning hyperparameters

Fine-tuning was performed with simpletransformers. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:

model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }

Intended use and limitations

Usage

Use examples

from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }
model = ClassificationModel(
    "xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True,
    args=model_args
    
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", 
                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
                                        )
predictions
### Output:
### (['Instruction', 'Promotion'],
 array([[-1.44140625, -0.63183594, -1.14453125,  7.828125  , -1.05175781,
         -0.80957031, -0.86083984, -0.81201172, -0.71777344],
        [-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938,
         -1.20410156, -1.33300781, -0.87890625,  7.7890625 ]]))

Performance

Citation

If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained:

 @misc{Kuzman2022,
  author = {Kuzman, Taja},
  title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}

and the following paper on which the original model is based:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

To cite the datasets that were used for fine-tuning:

CORE dataset:

@article{egbert2015developing,
  title={Developing a bottom-up, user-based method of web register classification},
  author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
  journal={Journal of the Association for Information Science and Technology},
  volume={66},
  number={9},
  pages={1817--1831},
  year={2015},
  publisher={Wiley Online Library}
}

GINCO dataset:

@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
  author    = {Kuzman, Taja  and  Rupnik, Peter  and  Ljube{\v{s}}i{\'c}, Nikola},
  title     = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1584--1594},
  url       = {https://aclanthology.org/2022.lrec-1.170}
}

FTD dataset:

@article{sharoff2018functional,
  title={Functional text dimensions for the annotation of web corpora},
  author={Sharoff, Serge},
  journal={Corpora},
  volume={13},
  number={1},
  pages={65--95},
  year={2018},
  publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}

The datasets are available at:

http://hdl.handle.net/11356/1467 (GINCO)
https://github.com/TurkuNLP/CORE-corpus (CORE)
https://github.com/ssharoff/genre-keras (FTD)