riotu-lab commited on
Commit
2d3b862
·
1 Parent(s): 742037c

update readme

Browse files
Files changed (1) hide show
  1. README.md +35 -58
README.md CHANGED
@@ -7,73 +7,50 @@ tags:
7
  - 'arabic '
8
  - text-generation
9
  ---
10
- # Model Description
11
-
12
- ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.
13
-
14
-
15
- | Specification | Value |
16
- |-----------------------|----------|
17
- | Model Name | ArabianGPT|
18
- | Architecture |GPT-2 |
19
- | Layers | 12 |
20
- | MAL (Model Attention Layers) | 12 |
21
- | Model Size | 134M |
22
- | Context Window Size | 768 |
23
-
24
-
25
- # Training
26
- * Dataset: Abu Elkhiar Corpus
27
- * Size: 15.5 GB
28
- * Number of Words: 237,814,541
29
- * Number of Tokens: 1,752,421,071
30
-
31
-
32
- # Compute
33
-
34
- | Model | Hardware | Num of Examples (seq len = 768) | Batch Size | Num of Steps | Time (in days) | Loss
35
- |------------------|---------------|---------------------------------|------------|--------------|----------------|----------------|
36
- | ArabianGPT-base | NDIVIA A100 | 7.5M | 512 | 313.5K | 3 | 3.97 |
37
-
38
-
39
- > The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.
40
-
41
- # Tokenizer
42
- Type: Custom trained SentencePiece tokenizer
43
- Vocabulary Size: 64K
44
-
45
- > We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.
46
-
47
- More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main)
48
-
49
-
50
- # Usage
51
- ArabianGPT can be used for text generation tasks in Arabic.
52
-
53
- ### How to use
54
-
55
- Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline:
56
-
57
  ```python
58
  from transformers import pipeline
59
 
60
- pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512)
61
-
62
  text = ''
63
-
64
  pipe.predict(text)
65
  ```
66
 
67
- # Limitations
68
 
69
- > As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.
 
70
 
71
- # Ethical Considerations
72
- We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.
73
 
 
74
 
75
- # Acknowledgments
76
- > We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support.
77
 
78
- # Contact
79
- For inquiries regarding ArabianGPT, please contact riotu@psu.edu.sa.
 
7
  - 'arabic '
8
  - text-generation
9
  ---
10
+ # ArabianGPT Model Overview
11
+
12
+ ## Introduction
13
+ ArabianGPT is a GPT-2 based model, custom-trained for the Arabic language, as part of the ArabianLLM initiatives at Prince Sultan University's Robotics and Internet of Things Lab.
14
+
15
+ ## Key Features
16
+ - **Architecture**: GPT-2
17
+ - **Model Size**: 134 million parameters
18
+ - **Layers**: 12
19
+ - **Model Attention Layers (MAL)**: 12
20
+ - **Context Window Size**: 768 tokens
21
+
22
+ ## Training
23
+ - **Dataset**: Abu Elkhiar Corpus
24
+ - **Data Size**: 15.5 GB
25
+ - **Words**: 237.8 million
26
+ - **Tokens**: Over 1.75 billion
27
+ - **Hardware**: NDIVIA A100
28
+ - **Training Scale**: 7.5 million examples
29
+ - **Training Duration**: 3 days
30
+ - **Performance**: Final loss of 3.97
31
+
32
+ ## Role in ArabianLLM Initiatives
33
+ ArabianGPT 0.1B is crucial for advancing Arabic language processing, addressing challenges unique to Arabic morphology and dialects.
34
+
35
+ ## Usage
36
+ Suitable for Arabic text generation tasks. Example usage with Transformers SummarizationPipeline:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
  from transformers import pipeline
39
 
40
+ pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base", max_new_tokens=512)
 
41
  text = ''
 
42
  pipe.predict(text)
43
  ```
44
 
45
+ ## Limitations and Ethical Considerations
46
 
47
+ - The model may have context understanding or text generation limitations in certain scenarios.
48
+ - Emphasis on ethical use to prevent misinformation or harmful content propagation.
49
 
50
+ ## Acknowledgments
 
51
 
52
+ Special thanks to Prince Sultan University, particularly the Robotics and Internet of Things Lab.
53
 
54
+ ## Contact Information
 
55
 
56
+ For inquiries: [riotu@psu.edu.sa](mailto:riotu@psu.edu.sa).