noorrtamerr commited on
Commit
69c7571
·
verified ·
1 Parent(s): d8a4238

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -1
README.md CHANGED
@@ -15,4 +15,75 @@ tags:
15
  - fine-tuned
16
  - arabert
17
  license: apache-2.0 # Add a license (choose one appropriate for your work)
18
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  - fine-tuned
16
  - arabert
17
  license: apache-2.0 # Add a license (choose one appropriate for your work)
18
+ ---
19
+
20
+ # EgBERT: Fine-Tuned AraBERT for Egyptian Arabic
21
+
22
+ ## Model Description
23
+
24
+ EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.
25
+
26
+ Key Features:
27
+ - Based on **[aubmindlab/bert-base-arabert](https://huggingface.co/aubmindlab/bert-base-arabert)**.
28
+ - Fine-tuned specifically for **Egyptian Arabic**.
29
+ - Optimized for **Masked Language Modeling (MLM)** tasks.
30
+
31
+ ## Training Details
32
+
33
+ - **Dataset**:
34
+ - A custom dataset of Egyptian Arabic collected from conversational text sources.
35
+ - Preprocessed to include common colloquial phrases and reduce noise in data.
36
+ - **Training Setup**:
37
+ - Pre-trained model: `aubmindlab/bert-base-arabert`
38
+ - Fine-tuning performed for 3 epochs with a batch size of 16.
39
+ - Learning rate: 2e-5.
40
+ - MLM Probability: 15%.
41
+ - **Tools**:
42
+ - **Hugging Face Transformers Library**
43
+ - **PyTorch**
44
+
45
+ ## Evaluation Results
46
+
47
+ ### Model Perplexity
48
+ - **Baseline Model**: 36.2377
49
+ - **Fine-Tuned Model**: 26.5359
50
+
51
+ The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.
52
+
53
+ ## How to Use
54
+
55
+ Here’s an example of how to use EgBERT in your project:
56
+
57
+ ```python
58
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
59
+
60
+ # Load the fine-tuned model and tokenizer
61
+ tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
62
+ model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")
63
+
64
+ # Input text with a masked token
65
+ text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."
66
+
67
+ # Tokenize and predict
68
+ inputs = tokenizer(text, return_tensors="pt")
69
+ mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
70
+
71
+ with torch.no_grad():
72
+ outputs = model(**inputs)
73
+ predictions = outputs.logits
74
+
75
+ # Decode the top 5 predictions for the [MASK] token
76
+ mask_token_logits = predictions[0, mask_token_index, :]
77
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
78
+ predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]
79
+
80
+ print(f"Predicted words: {predicted_words}")
81
+
82
+
83
+ @misc{EgBERT,
84
+ author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
85
+ title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
86
+ year = {2024},
87
+ publisher = {Hugging Face},
88
+ url = {https://huggingface.co/noortamerr/EgBERT}
89
+ }