Model Card for IJELID (Indonesian-Javanese-English Language IDentification) Model
Model Description
This model is designed for the task of language identification, specifically focusing on code-mixed language data from Indonesian, Javanese, and English Twitter posts. It can identify whether a given text is in Indonesian (ID), Javanese (JV), English (EN), a mix of Indonesian and English (MIX_ID_EN), a mix of Indonesian and Javanese (MIX_ID_JV), a mix of Javanese and English (MIX_JV_EN), or other (OTH).
Intended Use
This model is intended for academic researchers and practitioners who need to identify and analyze the language of text data, particularly in the context of social media where code-mixing is common.
Training Data
This model is a fine-tuned version of IndoJavE-IndoBERTweet ona dataset of code-mixed Indonesian-Javanese-English Twitter Data. Further details and access to the dataset can be found here.
Hyperparameter search values and range for fine-tuning.
We conducted hyperparameter search using Optuna with the following search values and range:
Hyperparameter | Values or Range |
---|---|
Number of training epochs | 2 to 10 |
Learning rate | 1e-4, 3e-4, 2e-5, 3e-5, 5e-5 |
Per device batch size | 8, 16, 32, 64 |
Weight decay | 4e-5 to 0.01 |
Training Procedure
The model was fine-tuned using 4 NVIDIA Tesla V100-SXM2-32GB GPUs provided by Universitas Islam Indonesia. The fine-tuning process was conducted using the following best hyperparameter values:
{
"num_epochs": 8,
"learning_rate": 2e-05,
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 64,
"weight_decay": 0.0017197732539373108
}
Evaluation Results
The model achieved the following scores on the test set:
- Macro Average Precision: 93.91%
- Macro Average Recall: 94.51%
- Macro Average F1-Score: 94.20%
Ethical Considerations
The dataset used for training includes user-generated content from Twitter, which has been anonymized and is compliant with ethical standards for research. Personal information has been removed to ensure privacy.
Caveats and Recommendations
Performance may vary on text data that significantly differs from the Twitter data it was trained on. It is recommended to evaluate the model on specific data of interest before using it for critical applications.
How to Use
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "fathan/ijelid-ft-indojave-indobertweet"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Replace the following text with your own input
text = "Productnya bagus bgt guys, nek bales chat cepet tur pelayanane apik."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Citation
@misc{ijelid-ft-indojave-indobertweet,
author = {Ahmad Fathan Hidayatullah},
title = {Indonesian-Javanese-English Language Identification (IJELID) using IndoJavE-IndoBERTweet pre-trained model.},
year = {2023},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/fathan/ijelid-ft-indojave-indobertweet}}
}
- Downloads last month
- 16