---
widget:

- text: "My name is Mark and I live in London. I am a postgraduate student at Queen Mary University."
language: 
  - en
license: mit
---

# Multilingual Hate Speech Classifier for Social Media Content

A multilingual model for hate speech classification of social media content. The model is based on pre-trained multilingual representations from the XLM-T model (https://arxiv.org/abs/2104.12250) and was jointly fine-tuned on five languages, namely Arabic, Croatian, English, German and Slovenian. The test results on these five languages in terms of F1 score are as follows:

| Language  |   F1   |
|-----------|:------:|
| Arabic    | 0.8704 |
| Croatian  | 0.7226 |
| English   | 0.7851 |
| German    | 0.7826 |
| Slovenian | 0.7596 |

## Tokenizer

During training the text was preprocessed using the original XLM-T tokenizer. The pretrained tokenizer files are included in this repository. We suggest the same tokenizer is used for inference.

## Model output

The model classifies each input into one of two distinct classes:
* 0 - not-offensive
* 1 - offensive