|
--- |
|
license: cc |
|
language: |
|
- pt |
|
tags: |
|
- Hate Speech |
|
- kNOwHATE |
|
- not-for-all-audiences |
|
widget: |
|
- text: Os [MASK] são todos uns animais, deviam voltar para a sua terra. |
|
--- |
|
--- |
|
<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg"> |
|
<p style="text-align: center;"> This is the model card for HateBERTimbau. |
|
You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>. |
|
</p> |
|
|
|
--- |
|
|
|
# HateBERTimbau |
|
|
|
**HateBERTimbau** is a foundation, large language model for European **Portuguese** from **Portugal** for Hate Speech content. |
|
|
|
It is an **encoder** of the BERT family, based on the neural architecture Transformer and |
|
developed over the [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) model, retrained on a dataset of 229,103 tweets specifically focused on potential hate speech. |
|
|
|
## Model Description |
|
|
|
- **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu) |
|
- **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal) |
|
- **Model type:** Transformer-based model retrained for Hate Speech in Portuguese social media text |
|
- **Language:** Portuguese |
|
- **Retrained from model:** [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) |
|
|
|
Several models were developed by fine-tuning Base HateBERTimbau for Hate Speech detection present in the table bellow: |
|
|
|
| HateBERTimbau's Family of Models | |
|
|---------------------------------------------------------------------------------------------------------| |
|
| [**HateBERTimbau YouTube**](https://huggingface.co/knowhate/HateBERTimbau-youtube) | |
|
| [**HateBERTimbau Twitter**](https://huggingface.co/knowhate/HateBERTimbau-twitter) | |
|
| [**HateBERTimbau YouTube+Twitter**](https://huggingface.co/knowhate/HateBERTimbau-yt-tt)| |
|
|
|
# Uses |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
from transformers import pipeline |
|
unmasker = pipeline('fill-mask', model='knowhate/HateBERTimbau') |
|
|
|
unmasker("Os [MASK] são todos uns animais, deviam voltar para a sua terra.") |
|
|
|
[{'score': 0.6771652698516846, |
|
'token': 12714, |
|
'token_str': 'africanos', |
|
'sequence': 'Os africanos são todos uns animais, deviam voltar para a sua terra.'}, |
|
{'score': 0.08679857850074768, |
|
'token': 15389, |
|
'token_str': 'homossexuais', |
|
'sequence': 'Os homossexuais são todos uns animais, deviam voltar para a sua terra.'}, |
|
{'score': 0.03806231543421745, |
|
'token': 4966, |
|
'token_str': 'portugueses', |
|
'sequence': 'Os portugueses são todos uns animais, deviam voltar para a sua terra.'}, |
|
{'score': 0.035253893584012985, |
|
'token': 16773, |
|
'token_str': 'Portugueses', |
|
'sequence': 'Os Portugueses são todos uns animais, deviam voltar para a sua terra.'}, |
|
{'score': 0.023521048948168755, |
|
'token': 8618, |
|
'token_str': 'brancos', |
|
'sequence': 'Os brancos são todos uns animais, deviam voltar para a sua terra.'}] |
|
|
|
``` |
|
|
|
Or this model can be used by fine-tuning it for a specific task/dataset: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer |
|
from datasets import load_dataset |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau") |
|
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau") |
|
dataset = load_dataset("knowhate/youtube-train") |
|
|
|
def tokenize_function(examples): |
|
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True) |
|
|
|
tokenized_datasets = dataset.map(tokenize_function, batched=True) |
|
|
|
training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch") |
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["validation"], |
|
) |
|
|
|
trainer.train() |
|
|
|
``` |
|
|
|
# Training |
|
|
|
## Data |
|
|
|
229,103 tweets associated with offensive content were used to retrain the base model. |
|
|
|
## Training Hyperparameters |
|
|
|
- Batch Size: 4 samples |
|
- Epochs: 100 |
|
- Learning Rate: 5e-5 with Adam optimizer |
|
- Maximum Sequence Length: 512 sentence pieces |
|
|
|
# Testing |
|
|
|
## Data |
|
|
|
We used two different datasets for testing, one for YouTube comments [here](https://huggingface.co/datasets/knowhate/youtube-test) and another for Tweets [here](https://huggingface.co/datasets/knowhate/twitter-test). |
|
|
|
## Hate Speech Classification Results (with no fine-tuning) |
|
|
|
| Dataset | Precision | Recall | F1-score | |
|
|:----------------|:-----------|:----------|:-------------| |
|
| **YouTube** | 0.928 | 0.108 | **0.193** | |
|
| **Twitter** | 0.686 | 0.211 | **0.323** | |
|
|
|
# BibTeX Citation |
|
|
|
``` latex |
|
@inproceedings{DBLP:conf/slate/MatosS00B22, |
|
author = {Bernardo Cunha Matos and |
|
Raquel Bento Santos and |
|
Paula Carvalho and |
|
Ricardo Ribeiro and |
|
Fernando Batista}, |
|
editor = {Jo{\~{a}}o Cordeiro and |
|
Maria Jo{\~{a}}o Pereira and |
|
Nuno F. Rodrigues and |
|
Sebasti{\~{a}}o Pais}, |
|
title = {Comparing Different Approaches for Detecting Hate Speech in Online |
|
Portuguese Comments}, |
|
booktitle = {11th Symposium on Languages, Applications and Technologies, {SLATE} |
|
2022, July 14-15, 2022, Universidade da Beira Interior, Covilh{\~{a}}, |
|
Portugal}, |
|
series = {OASIcs}, |
|
volume = {104}, |
|
pages = {10:1--10:12}, |
|
publisher = {Schloss Dagstuhl - Leibniz-Zentrum f{\"{u}}r Informatik}, |
|
year = {2022}, |
|
url = {https://doi.org/10.4230/OASIcs.SLATE.2022.10}, |
|
doi = {10.4230/OASICS.SLATE.2022.10}, |
|
} |
|
``` |
|
|
|
# Acknowledgements |
|
|
|
This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306). |
|
However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project. |
|
Neither the European Union nor the Knowhate Project can be held responsible. |