Edit model card

IndoBERTweet-IdentityAttack

Model Description

IndoBERTweet fine-tuned on IndoToxic2024 dataset, with an accuracy of 0.89 and macro-F1 of 0.78. Performances are obtained through stratified 10-fold cross-validation.

Supported Tokenizer

  • indolem/indobertweet-base-uncased

Example Code

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Specify the model and tokenizer name
model_name = "Exqrch/IndoBERTweet-IdentityAttack"
tokenizer_name = "indolem/indobertweet-base-uncased"
# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
text = "selamat pagi semua!"
output = model(**tokenizer(text, return_tensors="pt"))
logits = output.logits
# Get the predicted class label
predicted_class = torch.argmax(logits, dim=-1).item()
print(predicted_class)
--- Output ---
> 0
--- End of Output ---

Limitations

Trained only on Indonesian texts. No information on code-switched text performance.

Sample Output

Model name: Exqrch/IndoBERTweet-IdentityAttack
Text 1: ayolah, jaga kebersihan bersama
Prediction: 0
Text 2: dia itu loh, udah hitam, dengkil lagi
Prediction: 1

Citation

If used, please cite:

@article{susanto2024indotoxic2024,
      title={IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language}, 
      author={Lucky Susanto and Musa Izzanardi Wijanarko and Prasetia Anugrah Pratama and Traci Hong and Ika Idris and Alham Fikri Aji and Derry Wijaya},
      year={2024},
      eprint={2406.19349},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19349}, 
}
Downloads last month
2
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.