Model Card for IJELID (Indonesian-Javanese-English Language IDentification) Model

Model Description

This model is designed for the task of language identification, specifically focusing on code-mixed language data from Indonesian, Javanese, and English Twitter posts. It can identify whether a given text is in Indonesian (ID), Javanese (JV), English (EN), a mix of Indonesian and English (MIX_ID_EN), a mix of Indonesian and Javanese (MIX_ID_JV), a mix of Javanese and English (MIX_JV_EN), or other (OTH).

Intended Use

This model is intended for academic researchers and practitioners who need to identify and analyze the language of text data, particularly in the context of social media where code-mixing is common.

Training Data

This model is a fine-tuned version of IndoJavE-IndoBERTweet ona dataset of code-mixed Indonesian-Javanese-English Twitter Data. Further details and access to the dataset can be found here.

Hyperparameter search values and range for fine-tuning.

We conducted hyperparameter search using Optuna with the following search values and range:

Hyperparameter	Values or Range
Number of training epochs	2 to 10
Learning rate	1e-4, 3e-4, 2e-5, 3e-5, 5e-5
Per device batch size	8, 16, 32, 64
Weight decay	4e-5 to 0.01

Training Procedure

The model was fine-tuned using 4 NVIDIA Tesla V100-SXM2-32GB GPUs provided by Universitas Islam Indonesia. The fine-tuning process was conducted using the following best hyperparameter values:

{
  "num_epochs": 8,
  "learning_rate": 2e-05,
  "per_device_train_batch_size": 8,
  "per_device_eval_batch_size": 64,
  "weight_decay": 0.0017197732539373108
}

Evaluation Results

The model achieved the following scores on the test set:

Macro Average Precision: 93.91%
Macro Average Recall: 94.51%
Macro Average F1-Score: 94.20%

Ethical Considerations

The dataset used for training includes user-generated content from Twitter, which has been anonymized and is compliant with ethical standards for research. Personal information has been removed to ensure privacy.

Caveats and Recommendations

Performance may vary on text data that significantly differs from the Twitter data it was trained on. It is recommended to evaluate the model on specific data of interest before using it for critical applications.

How to Use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "fathan/ijelid-ft-indojave-indobertweet"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Replace the following text with your own input
text = "Productnya bagus bgt guys, nek bales chat cepet tur pelayanane apik."

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

@misc{ijelid-ft-indojave-indobertweet,
  author = {Ahmad Fathan Hidayatullah},
  title = {Indonesian-Javanese-English Language Identification (IJELID) using IndoJavE-IndoBERTweet pre-trained model.},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/fathan/ijelid-ft-indojave-indobertweet}}
}

fathan
/

ijelid-ft-indojave-indobertweet