RaymondAISG commited on
Commit
e41fd97
1 Parent(s): b4220f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -7,10 +7,10 @@ language:
7
  - vi
8
  license: llama3
9
  ---
10
- # LLaMA3 8B SEA-LIONv2
11
 
12
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
- This is the card for the LLaMA3 8B SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
14
 
15
  SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
16
 
@@ -19,18 +19,18 @@ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
19
 
20
  ### Model Description
21
 
22
- The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
23
 
24
  - **Developed by:** Products Pillar, AI Singapore
25
  - **Funded by:** Singapore NRF
26
  - **Model type:** Decoder
27
  - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
28
- - **License:** [LLaMA3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
29
 
30
  For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
31
 
32
  ### Benchmark Performance
33
- We evaluated LLaMA3 8B SEA-LIONv2 base model on general language capabilities.
34
 
35
  #### General Language Capabilities
36
  For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
@@ -60,7 +60,7 @@ We also evaluated the model on English capabilities using tasks from the Open LL
60
 
61
  ### Data
62
 
63
- LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
64
 
65
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
66
  |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
@@ -81,16 +81,16 @@ LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the f
81
  | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
82
 
83
  Note:
84
- - All token counts are counted using LLaMA3 tokenizer
85
  - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
86
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
87
 
88
  ### Infrastructure
89
 
90
- LLaMA3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
91
  on the following hardware:
92
 
93
- | Training Details | LLaMA3 8B SEA-LIONv2 |
94
  |----------------------|:--------------------:|
95
  | AWS EC2 p5d.24xlarge | 8 instances |
96
  | Nvidia H100 80GB GPU | 64 |
@@ -99,7 +99,7 @@ on the following hardware:
99
 
100
  ### Configuration
101
 
102
- | HyperParameter | LLaMA3 8B SEA-LIONv2 |
103
  |-------------------|:--------------------:|
104
  | Precision | bfloat16 |
105
  | Optimizer | decoupled_adamw |
 
7
  - vi
8
  license: llama3
9
  ---
10
+ # Llama3 8B SEA-LIONv2
11
 
12
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
+ This is the card for the Llama3 8B SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
14
 
15
  SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
16
 
 
19
 
20
  ### Model Description
21
 
22
+ The continued pre-training data for Llama3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
23
 
24
  - **Developed by:** Products Pillar, AI Singapore
25
  - **Funded by:** Singapore NRF
26
  - **Model type:** Decoder
27
  - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
28
+ - **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
29
 
30
  For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
31
 
32
  ### Benchmark Performance
33
+ We evaluated Llama3 8B SEA-LIONv2 base model on general language capabilities.
34
 
35
  #### General Language Capabilities
36
  For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
 
60
 
61
  ### Data
62
 
63
+ Llama3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
64
 
65
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
66
  |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
 
81
  | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
82
 
83
  Note:
84
+ - All token counts are counted using Llama3 tokenizer
85
  - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
86
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
87
 
88
  ### Infrastructure
89
 
90
+ Llama3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
91
  on the following hardware:
92
 
93
+ | Training Details | Llama3 8B SEA-LIONv2 |
94
  |----------------------|:--------------------:|
95
  | AWS EC2 p5d.24xlarge | 8 instances |
96
  | Nvidia H100 80GB GPU | 64 |
 
99
 
100
  ### Configuration
101
 
102
+ | HyperParameter | Llama3 8B SEA-LIONv2 |
103
  |-------------------|:--------------------:|
104
  | Precision | bfloat16 |
105
  | Optimizer | decoupled_adamw |