Not-For-All-Audiences

Inference Endpoints

Model card Files Files and versions Community

HateBERTimbau / README.md

gilramos

Update README.md

e674a33 verified 10 months ago

preview code

raw

history blame contribute delete

6.39 kB

	---
	license: cc
	language:
	- pt
	tags:
	- Hate Speech
	- kNOwHATE
	- not-for-all-audiences
	widget:
	- text: Os [MASK] são todos uns animais, deviam voltar para a sua terra.
	---
	---
	<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg">
	<p style="text-align: center;">    This is the model card for HateBERTimbau.
	You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>.
	</p>

	---

	# HateBERTimbau

	HateBERTimbau is a foundation, large language model for European Portuguese from Portugal for Hate Speech content.

	It is an encoder of the BERT family, based on the neural architecture Transformer and
	developed over the [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) model, retrained on a dataset of 229,103 tweets specifically focused on potential hate speech.

	## Model Description

	- Developed by: [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
	- Funded by: [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
	- Model type: Transformer-based model retrained for Hate Speech in Portuguese social media text
	- Language: Portuguese
	- Retrained from model: [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased)

	Several models were developed by fine-tuning Base HateBERTimbau for Hate Speech detection present in the table bellow:

	\| HateBERTimbau's Family of Models \|
	\|---------------------------------------------------------------------------------------------------------\|
	\| [HateBERTimbau YouTube](https://huggingface.co/knowhate/HateBERTimbau-youtube) \|
	\| [HateBERTimbau Twitter](https://huggingface.co/knowhate/HateBERTimbau-twitter) \|
	\| [HateBERTimbau YouTube+Twitter](https://huggingface.co/knowhate/HateBERTimbau-yt-tt)\|

	# Uses

	You can use this model directly with a pipeline for masked language modeling:

	```python
	from transformers import pipeline
	unmasker = pipeline('fill-mask', model='knowhate/HateBERTimbau')

	unmasker("Os [MASK] são todos uns animais, deviam voltar para a sua terra.")

	[{'score': 0.6771652698516846,
	'token': 12714,
	'token_str': 'africanos',
	'sequence': 'Os africanos são todos uns animais, deviam voltar para a sua terra.'},
	{'score': 0.08679857850074768,
	'token': 15389,
	'token_str': 'homossexuais',
	'sequence': 'Os homossexuais são todos uns animais, deviam voltar para a sua terra.'},
	{'score': 0.03806231543421745,
	'token': 4966,
	'token_str': 'portugueses',
	'sequence': 'Os portugueses são todos uns animais, deviam voltar para a sua terra.'},
	{'score': 0.035253893584012985,
	'token': 16773,
	'token_str': 'Portugueses',
	'sequence': 'Os Portugueses são todos uns animais, deviam voltar para a sua terra.'},
	{'score': 0.023521048948168755,
	'token': 8618,
	'token_str': 'brancos',
	'sequence': 'Os brancos são todos uns animais, deviam voltar para a sua terra.'}]

	```

	Or this model can be used by fine-tuning it for a specific task/dataset:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
	from datasets import load_dataset

	tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau")
	model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau")
	dataset = load_dataset("knowhate/youtube-train")

	def tokenize_function(examples):
	return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

	tokenized_datasets = dataset.map(tokenize_function, batched=True)

	training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["validation"],
	)

	trainer.train()

	```

	# Training

	## Data

	229,103 tweets associated with offensive content were used to retrain the base model.

	## Training Hyperparameters

	- Batch Size: 4 samples
	- Epochs: 100
	- Learning Rate: 5e-5 with Adam optimizer
	- Maximum Sequence Length: 512 sentence pieces

	# Testing

	## Data

	We used two different datasets for testing, one for YouTube comments [here](https://huggingface.co/datasets/knowhate/youtube-test) and another for Tweets [here](https://huggingface.co/datasets/knowhate/twitter-test).

	## Hate Speech Classification Results (with no fine-tuning)

	\| Dataset \| Precision \| Recall \| F1-score \|
	\|:----------------\|:-----------\|:----------\|:-------------\|
	\| YouTube \| 0.928 \| 0.108 \| 0.193 \|
	\| Twitter \| 0.686 \| 0.211 \| 0.323 \|

	# BibTeX Citation

	``` latex
	@inproceedings{DBLP:conf/slate/MatosS00B22,
	author = {Bernardo Cunha Matos and
	Raquel Bento Santos and
	Paula Carvalho and
	Ricardo Ribeiro and
	Fernando Batista},
	editor = {Jo{\~{a}}o Cordeiro and
	Maria Jo{\~{a}}o Pereira and
	Nuno F. Rodrigues and
	Sebasti{\~{a}}o Pais},
	title = {Comparing Different Approaches for Detecting Hate Speech in Online
	Portuguese Comments},
	booktitle = {11th Symposium on Languages, Applications and Technologies, {SLATE}
	2022, July 14-15, 2022, Universidade da Beira Interior, Covilh{\~{a}},
	Portugal},
	series = {OASIcs},
	volume = {104},
	pages = {10:1--10:12},
	publisher = {Schloss Dagstuhl - Leibniz-Zentrum f{\"{u}}r Informatik},
	year = {2022},
	url = {https://doi.org/10.4230/OASIcs.SLATE.2022.10},
	doi = {10.4230/OASICS.SLATE.2022.10},
	}
	```

	# Acknowledgements

	This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306).
	However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project.
	Neither the European Union nor the Knowhate Project can be held responsible.