waiyiaisg commited on
Commit
d0d5235
1 Parent(s): 29bf150

Update english benchmark results and format of data table

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -44,7 +44,7 @@ The evaluation was done **five-shot** with native prompts and only a sample of 1
44
 
45
  **BHASA**
46
 
47
-
48
 
49
  **English**
50
 
@@ -64,23 +64,23 @@ The evaluation was done **five-shot** with native prompts and only a sample of 1
64
 
65
  LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
66
 
67
- | Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
68
- |---------------------------|:-------------:|:----------:|:------------:|:----------:|
69
- | Dolma RefinedWeb - English| 7.650B | 1 | 7.650B | 15.90% |
70
- | Dolma C4 - English | 1.160B | 1 | 1B | 9.21% |
71
- | Dolma Reddit - English | 1.339B | 1 | 14.7B | 2.42% |
72
- | Dolma Semantic Scholar | 0.959B | 1 | 2.9B | 2.79% |
73
- | Dolma arXiv | 0.469B | 1 | 5.3B | 1.99% |
74
- | Dolma StarCoder | 4.422B | 1 | 4.9B | 0.98% |
75
- | SEA-LION Pile - Indonesian| 3.4B | 1 | 6.8B | 14.17% |
76
- | Wiki* - Indonesian | 0.3B | 4 | 1.2B | 2.50% |
77
- | SEA-LION Pile - Tamil | 5.6B | 1 | 5.6B | 11.67% |
78
- | Wiki* + News - Tamil | 0.6B | 4 | 2.4B | 5.00% |
79
- | SEA-LION Pile - Thai | 2.28B | 1 | 2.28B | 4.75% |
80
- | WangChanBERTa - Thai | 5B | 1 | 5B | 10.42% |
81
- | Wiki* - Thai | 0.18B | 4 | 0.72B | 1.50% |
82
- | SEA-LION Pile - Vietnamese| 6.76B | 1 | 6.76B | 14.08% |
83
- | Wiki* - Vietnamese | 0.31B | 4 | 1.24B | 2.58% |
84
 
85
  Note:
86
  - All token counts are counted using LLaMA3 tokenizer
 
44
 
45
  **BHASA**
46
 
47
+ To be released soon
48
 
49
  **English**
50
 
 
64
 
65
  LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
66
 
67
+ | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
68
+ |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
69
+ | Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
70
+ | Dolma C4 - English | 1.160 | 1 | 1 | 9.21 |
71
+ | Dolma Reddit - English | 1.339 | 1 | 14.7 | 2.42 |
72
+ | Dolma Semantic Scholar | 0.959 | 1 | 2.9 | 2.79 |
73
+ | Dolma arXiv | 0.469 | 1 | 5.3 | 1.99 |
74
+ | Dolma StarCoder | 4.422 | 1 | 4.9 | 0.98 |
75
+ | SEA-LION Pile - Indonesian| 3.4 | 1 | 6.8 | 14.17 |
76
+ | Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
77
+ | SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
78
+ | Wiki* + News - Tamil | 0.6 | 4 | 2.4 | 5.00 |
79
+ | SEA-LION Pile - Thai | 2.28 | 1 | 2.28 | 4.75 |
80
+ | WangChanBERTa - Thai | 5 | 1 | 5 | 10.42 |
81
+ | Wiki* - Thai | 0.18 | 4 | 0.72 | 1.50 |
82
+ | SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
83
+ | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
84
 
85
  Note:
86
  - All token counts are counted using LLaMA3 tokenizer