EgBERT: Fine-Tuned AraBERT for Egyptian Arabic

Model Description

EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.

Key Features:

Based on aubmindlab/bert-base-arabert.
Fine-tuned specifically for Egyptian Arabic.
Optimized for Masked Language Modeling (MLM) tasks.

Training Details

Dataset:
- A custom dataset of Egyptian Arabic collected from conversational text sources.
- Preprocessed to include common colloquial phrases and reduce noise in data.
Training Setup:
- Pre-trained model: aubmindlab/bert-base-arabert
- Fine-tuning performed for 3 epochs with a batch size of 16.
- Learning rate: 2e-5.
- MLM Probability: 15%.

Evaluation Results

Model Perplexity

Baseline Model: 36.2377
Fine-Tuned Model: 26.5359

The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.

How to Use

Here’s an example of how to use EgBERT in your project:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")

# Input text with a masked token
text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Decode the top 5 predictions for the [MASK] token
mask_token_logits = predictions[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]

print(f"Predicted words: {predicted_words}")
,,,


@misc{EgBERT,
  author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
  title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/noortamerr/EgBERT}
}