metadata

widget:
  - text: >-
      My name is Mark and I live in London. I am a postgraduate student at Queen
      Mary University.
language:
  - en
license: mit

Multilingual Hate Speech Classifier for Social Media Content

A multilingual model for hate speech classification of social media content. The model is based on pre-trained multilingual representations from the XLM-T model (https://arxiv.org/abs/2104.12250) and was jointly fine-tuned on five languages, namely Arabic, Croatian, English, German and Slovenian. The test results on these five languages in terms of F1 score are as follows:

Language	F1
Arabic	0.8704
Croatian	0.7226
English	0.7851
German	0.7826
Slovenian	0.7596

Tokenizer

During training the text was preprocessed using the original XLM-T tokenizer. The pretrained tokenizer files are included in this repository. We suggest the same tokenizer is used for inference.

Model output

The model classifies each input into one of two distinct classes:

0 - not-offensive
1 - offensive