|
--- |
|
widget: |
|
|
|
- text: "My name is Mark and I live in London. I am a postgraduate student at Queen Mary University." |
|
language: |
|
- en |
|
license: mit |
|
--- |
|
|
|
# Multilingual Hate Speech Classifier for Social Media Content |
|
|
|
A multilingual model for hate speech classification of social media content. The model is based on pre-trained multilingual representations from the XLM-T model (https://arxiv.org/abs/2104.12250) and was jointly fine-tuned on five languages, namely Arabic, Croatian, English, German and Slovenian. The test results on these five languages in terms of F1 score are as follows: |
|
|
|
| Language | F1 | |
|
|-----------|:------:| |
|
| Arabic | 0.8704 | |
|
| Croatian | 0.7226 | |
|
| English | 0.7851 | |
|
| German | 0.7826 | |
|
| Slovenian | 0.7596 | |
|
|
|
## Tokenizer |
|
|
|
During training the text was preprocessed using the original XLM-T tokenizer. The pretrained tokenizer files are included in this repository. We suggest the same tokenizer is used for inference. |
|
|
|
## Model output |
|
|
|
The model classifies each input into one of two distinct classes: |
|
* 0 - not-offensive |
|
* 1 - offensive |
|
|