RaymondAISG
commited on
Commit
•
e41fd97
1
Parent(s):
b4220f3
Update README.md
Browse files
README.md
CHANGED
@@ -7,10 +7,10 @@ language:
|
|
7 |
- vi
|
8 |
license: llama3
|
9 |
---
|
10 |
-
#
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
-
This is the card for the
|
14 |
|
15 |
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
|
16 |
|
@@ -19,18 +19,18 @@ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
|
|
19 |
|
20 |
### Model Description
|
21 |
|
22 |
-
The continued pre-training data for
|
23 |
|
24 |
- **Developed by:** Products Pillar, AI Singapore
|
25 |
- **Funded by:** Singapore NRF
|
26 |
- **Model type:** Decoder
|
27 |
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
28 |
-
- **License:** [
|
29 |
|
30 |
For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
|
31 |
|
32 |
### Benchmark Performance
|
33 |
-
We evaluated
|
34 |
|
35 |
#### General Language Capabilities
|
36 |
For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
@@ -60,7 +60,7 @@ We also evaluated the model on English capabilities using tasks from the Open LL
|
|
60 |
|
61 |
### Data
|
62 |
|
63 |
-
|
64 |
|
65 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
66 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
@@ -81,16 +81,16 @@ LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the f
|
|
81 |
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
82 |
|
83 |
Note:
|
84 |
-
- All token counts are counted using
|
85 |
- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
|
86 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
87 |
|
88 |
### Infrastructure
|
89 |
|
90 |
-
|
91 |
on the following hardware:
|
92 |
|
93 |
-
| Training Details |
|
94 |
|----------------------|:--------------------:|
|
95 |
| AWS EC2 p5d.24xlarge | 8 instances |
|
96 |
| Nvidia H100 80GB GPU | 64 |
|
@@ -99,7 +99,7 @@ on the following hardware:
|
|
99 |
|
100 |
### Configuration
|
101 |
|
102 |
-
| HyperParameter |
|
103 |
|-------------------|:--------------------:|
|
104 |
| Precision | bfloat16 |
|
105 |
| Optimizer | decoupled_adamw |
|
|
|
7 |
- vi
|
8 |
license: llama3
|
9 |
---
|
10 |
+
# Llama3 8B SEA-LIONv2
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
+
This is the card for the Llama3 8B SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
|
14 |
|
15 |
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
|
16 |
|
|
|
19 |
|
20 |
### Model Description
|
21 |
|
22 |
+
The continued pre-training data for Llama3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
|
23 |
|
24 |
- **Developed by:** Products Pillar, AI Singapore
|
25 |
- **Funded by:** Singapore NRF
|
26 |
- **Model type:** Decoder
|
27 |
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
28 |
+
- **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
|
29 |
|
30 |
For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
|
31 |
|
32 |
### Benchmark Performance
|
33 |
+
We evaluated Llama3 8B SEA-LIONv2 base model on general language capabilities.
|
34 |
|
35 |
#### General Language Capabilities
|
36 |
For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
|
|
60 |
|
61 |
### Data
|
62 |
|
63 |
+
Llama3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
|
64 |
|
65 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
66 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
|
|
81 |
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
82 |
|
83 |
Note:
|
84 |
+
- All token counts are counted using Llama3 tokenizer
|
85 |
- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
|
86 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
87 |
|
88 |
### Infrastructure
|
89 |
|
90 |
+
Llama3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
|
91 |
on the following hardware:
|
92 |
|
93 |
+
| Training Details | Llama3 8B SEA-LIONv2 |
|
94 |
|----------------------|:--------------------:|
|
95 |
| AWS EC2 p5d.24xlarge | 8 instances |
|
96 |
| Nvidia H100 80GB GPU | 64 |
|
|
|
99 |
|
100 |
### Configuration
|
101 |
|
102 |
+
| HyperParameter | Llama3 8B SEA-LIONv2 |
|
103 |
|-------------------|:--------------------:|
|
104 |
| Precision | bfloat16 |
|
105 |
| Optimizer | decoupled_adamw |
|