EgBERT: Fine-Tuned AraBERT for Egyptian Arabic
Model Description
EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.
Key Features:
- Based on aubmindlab/bert-base-arabert.
- Fine-tuned specifically for Egyptian Arabic.
- Optimized for Masked Language Modeling (MLM) tasks.
Training Details
- Dataset:
- A custom dataset of Egyptian Arabic collected from conversational text sources.
- Preprocessed to include common colloquial phrases and reduce noise in data.
- Training Setup:
- Pre-trained model:
aubmindlab/bert-base-arabert
- Fine-tuning performed for 3 epochs with a batch size of 16.
- Learning rate: 2e-5.
- MLM Probability: 15%.
- Pre-trained model:
Evaluation Results
Model Perplexity
- Baseline Model: 36.2377
- Fine-Tuned Model: 26.5359
The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.
How to Use
Here’s an example of how to use EgBERT in your project:
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")
# Input text with a masked token
text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
# Decode the top 5 predictions for the [MASK] token
mask_token_logits = predictions[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]
print(f"Predicted words: {predicted_words}")
,,,
@misc{EgBERT,
author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/noortamerr/EgBERT}
}
Inference API (serverless) does not yet support transformers models for this pipeline type.
Model tree for noorrtamerr/EGYbert
Base model
google-bert/bert-base-uncased