|
--- |
|
license: cc |
|
language: |
|
- pt |
|
tags: |
|
- Hate Speech |
|
- kNOwHATE |
|
- not-for-all-audiences |
|
widget: |
|
- text: >- |
|
as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é |
|
deixar de ser humano 😂😂 |
|
pipeline_tag: text-classification |
|
datasets: |
|
- knowhate/youtube-test |
|
--- |
|
--- |
|
<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg"> |
|
<p style="text-align: center;"> This is the model card for HateBERTimbau-YouTube. |
|
You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>. |
|
</p> |
|
|
|
--- |
|
|
|
# HateBERTimbau-YouTube |
|
|
|
**HateBERTimbau-YouTube** is a transformer-based encoder model for identifying Hate Speech in Portuguese social media text. It is a fine-tuned version of [HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau) model, retrained on a dataset of 23,912 YouTube comments specifically focused on Hate Speech. |
|
|
|
## Model Description |
|
|
|
- **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu) |
|
- **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal) |
|
- **Model type:** Transformer-based text classification model fine-tuned for Hate Speech detection in Portuguese social media text |
|
- **Language:** Portuguese |
|
- **Fine-tuned from model:** [knowhate/HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau) |
|
|
|
# Uses |
|
|
|
You can use this model directly with a pipeline for text classification: |
|
|
|
```python |
|
from transformers import pipeline |
|
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-youtube') |
|
|
|
classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂") |
|
|
|
[{'label': 'Hate Speech', 'score': 0.9228119850158691}] |
|
|
|
``` |
|
|
|
Or this model can be used by fine-tuning it for a specific task/dataset: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer |
|
from datasets import load_dataset |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-youtube") |
|
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-youtube") |
|
dataset = load_dataset("knowhate/youtube-train") |
|
|
|
def tokenize_function(examples): |
|
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True) |
|
|
|
tokenized_datasets = dataset.map(tokenize_function, batched=True) |
|
|
|
training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch") |
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["validation"], |
|
) |
|
|
|
trainer.train() |
|
|
|
``` |
|
|
|
# Training |
|
|
|
## Data |
|
|
|
23,912 YouTube comments associated with offensive content were used to fine-tune the base model. |
|
|
|
## Training Hyperparameters |
|
|
|
- Batch Size: 32 |
|
- Epochs: 3 |
|
- Learning Rate: 2e-5 with Adam optimizer |
|
- Maximum Sequence Length: 350 tokens |
|
|
|
# Testing |
|
|
|
## Data |
|
|
|
The dataset used to test this model was: [knowhate/youtube-test](https://huggingface.co/datasets/knowhate/youtube-test) |
|
|
|
## Results |
|
|
|
| Dataset | Precision | Recall | F1-score | |
|
|:------------------------------|:-----------|:----------|:-------------| |
|
| **knowhate/youtube-test** | 0.856 | 0.892 | **0.874** | |
|
|
|
# BibTeX Citation |
|
|
|
Currently in Peer Review |
|
|
|
``` latex |
|
@article{ |
|
|
|
} |
|
``` |
|
|
|
# Acknowledgements |
|
|
|
This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306). |
|
However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project. |
|
Neither the European Union nor the Knowhate Project can be held responsible. |