--- license: cc-by-sa-4.0 language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh tags: - text-classification - genre - text-genre widget: - text: "On our site, you can find a great genre identification model which you can use for thousands of different tasks. For free!" --- # Multilingual text genre classifier xlm-roberta-base-multilingual-text-genres Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three datasets comprising of texts, annotated with genre categories: Slovene GINCO1 dataset, the English CORE2 dataset and the English FTD3 dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`. ## Model description ### Fine-tuning hyperparameters Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are: ```python model_args= { "num_train_epochs": 15, "learning_rate": 1e-5, "max_seq_length": 512, } ``` ## Intended use and limitations ## Usage ### Use examples ```python from simpletransformers.classification import ClassificationModel model_args= { "num_train_epochs": 15, "learning_rate": 1e-5, "max_seq_length": 512, } model = ClassificationModel( "xlmroberta", "TajaKuzman/xlm-roberta-base-multilingual-text-genres", use_cuda=True, args=model_args ) predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"] ) predictions ### Output: ### (['Instruction', 'Promotion'], array([[-1.44140625, -0.63183594, -1.14453125, 7.828125 , -1.05175781, -0.80957031, -0.86083984, -0.81201172, -0.71777344], [-0.78564453, -1.15429688, -1.26660156, -0.29980469, -1.19335938, -1.20410156, -1.33300781, -0.87890625, 7.7890625 ]])) ``` ## Performance ## Citation If you use the model, please cite the GitHub repository where the fine-tuning experiments are explained: ``` @misc{Kuzman2022, author = {Kuzman, Taja}, title = {{Comparison of genre datasets: CORE, GINCO and FTD}}, year = {2022}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}} } ``` and the following paper on which the original model is based: ``` @article{DBLP:journals/corr/abs-1911-02116, author = {Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm{\'{a}}n and Edouard Grave and Myle Ott and Luke Zettlemoyer and Veselin Stoyanov}, title = {Unsupervised Cross-lingual Representation Learning at Scale}, journal = {CoRR}, volume = {abs/1911.02116}, year = {2019}, url = {http://arxiv.org/abs/1911.02116}, eprinttype = {arXiv}, eprint = {1911.02116}, timestamp = {Mon, 11 Nov 2019 18:38:09 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` To cite the datasets that were used for fine-tuning: CORE dataset: ``` @article{egbert2015developing, title={Developing a bottom-up, user-based method of web register classification}, author={Egbert, Jesse and Biber, Douglas and Davies, Mark}, journal={Journal of the Association for Information Science and Technology}, volume={66}, number={9}, pages={1817--1831}, year={2015}, publisher={Wiley Online Library} } ``` GINCO dataset: ``` @InProceedings{kuzman-rupnik-ljubei:2022:LREC, author = {Kuzman, Taja and Rupnik, Peter and Ljube{\v{s}}i{\'c}, Nikola}, title = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {1584--1594}, url = {https://aclanthology.org/2022.lrec-1.170} } ``` FTD dataset: ``` @article{sharoff2018functional, title={Functional text dimensions for the annotation of web corpora}, author={Sharoff, Serge}, journal={Corpora}, volume={13}, number={1}, pages={65--95}, year={2018}, publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…} } ``` The datasets are available at: 1. http://hdl.handle.net/11356/1467 (GINCO) 2. https://github.com/TurkuNLP/CORE-corpus (CORE) 3. https://github.com/ssharoff/genre-keras (FTD)