multilingual
sea
nxphi47 commited on
Commit
46d6172
ยท
1 Parent(s): 358e25d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md CHANGED
@@ -1,3 +1,196 @@
1
  ---
2
  license: llama2
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ inference: false
4
  ---
5
+ <p align="center">
6
+ <img src="seal_logo.png" width="200" />
7
+ </p>
8
+
9
+ # SeaLLM - An Assistant for South East Asian Languages
10
+
11
+
12
+ <!-- - DEMO: [DAMO-NLP-SG/damo-seal-v0](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) -->
13
+
14
+ <p align="center">
15
+ ๐Ÿค— <a href="https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0">Hugging Face DEMO</a>
16
+ </p>
17
+
18
+ We introduce SeaLLM - a family of language models optimized for South East Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises mainly Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ texts, along with those in English ๐Ÿ‡ฌ๐Ÿ‡ง and Chinese ๐Ÿ‡จ๐Ÿ‡ณ. The pre-training stage involves multiple stages with dynamic data control to preserve the original knowledge base of Llama-2 while gaining new abilities in SEA languages.
19
+
20
+ The [SeaLLM-chat](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) model underwent supervised finetuning (SFT) on a mix of public instruction data (e.g. [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca)) and a small internally-collected amount of natural queries from SEA native speakers, which **adapt to the local cultural norms, customs, styles and laws in these regions**, as well as other SFT enhancement techniques (to be revealed later).
21
+
22
+ Our customized SFT process helps enhance our models' ability to understand, respond and serve communities whose languages are often neglected by previous [English-dominant LLMs](https://arxiv.org/abs/2307.09288), while outperforming existing polyglot LLMs, like [BLOOM](https://arxiv.org/abs/2211.05100) or [PolyLM](https://arxiv.org/pdf/2307.06018.pdf).
23
+
24
+ Our [first released SeaLLM](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) supports Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ. Future verions endeavor to cover all languages spoken in South East Asia.
25
+
26
+ <!-- - Model links: [DAMO-NLP-SG/seal-13b-chat-a](https://huggingface.co/DAMO-NLP-SG/seal-13b-chat-a) -->
27
+
28
+
29
+ <blockquote style="color:red">
30
+ <p><strong style="color: red">Terms of Use</strong>: By using our released weights, codes and demos, you agree and comply with the following terms and conditions:</p>
31
+ <ul>
32
+ <li>Follow LLama-2 <a rel="noopener nofollow" href="https://ai.meta.com/llama/license/">License</a> and <a rel="noopener nofollow" href="https://ai.meta.com/llama/use-policy/">Terms of Use</a>.</li>
33
+ <li>Strictly comply with the local regulations where you operate at and not attempt to generate or illicit our models to generate locally and internationally illegal and inappropriate content.</li>
34
+ </ul>
35
+ </blockquote>
36
+
37
+ > **Disclaimer**:
38
+ > We must note that even though the weights, codes and demos are released in an open manner, similar to other pre-trained language models, and despite our best effort in red teaming and safety finetuning and enforcement, our models come with potential risks influenced by complex factors, including but not limited to over-diversified, inaccurate, misleading or potentially harmful generation.
39
+ > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
40
+ > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes or demos.
41
+
42
+ > The logo was generated by DALL-E 3.
43
+
44
+ The following sections summarize the technical specifications and performance evaluations.
45
+
46
+ ## Pre-training
47
+
48
+ ### Vocabulary Expansion
49
+ Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-european and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to degraded performance [(Nguyen et al., 2023)](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
50
+
51
+ Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
52
+
53
+ As seen in the below table, our new vocabulary reduce the compression ratio from 4.29 to 1.57 for Thai, meaning it can now encode 2.7x longer Thai text given the same context length. Meanwhile, English is only compressed by 0.3%, thus preserving its integrity.
54
+
55
+ |Language | Llama's ratio | Our ratio | # New tokens
56
+ | --- | --- | --- | --- |
57
+ | Vi | 2.91 | 1.2488 | 2304
58
+ | Zh | 1.99 | 1.1806 | 3456
59
+ | Th | 4.29 | 1.5739 | 1536
60
+ | Id | 1.76 | 1.1408 | 3840
61
+ | En | 1.00 | 0.9976
62
+
63
+
64
+ ### Pre-training Data
65
+
66
+
67
+ ### Pre-training Strategies
68
+
69
+ We conduct pre-training in 4 different stages. Each stage serves different specific objectives and involves dynamic control of data mixture, both unsupervised and supervised, and data specification and categorization. We also employ a novel sequence construction and masking techniques during these stages. More details are to be provided in the technical report.
70
+
71
+ As our goal is for Llama-2 to learn new languages with the least number tokens and computing resources, we control appropriate data mix of new (Vi, Id & Th) and old (En, Zh) languages so that the new vocabulary and knowledge is trained quickly, while relatively maintaining the performance of the original Llama-2 model and establishing a knowledge bridge between new and existing languages.
72
+
73
+ We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens.
74
+
75
+ ## Supervised Finetuning (SFT)
76
+
77
+ ### SFT Data
78
+
79
+ Our supervised finetuning (SFT) data consists of many categories. The largests of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned is monolingual, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
80
+
81
+ More importantly, we engaged native speakers to collect a small amount of natural queries and responses data, which adapts to the local cultural customs, norms and laws. We also collect country-relevant safety data that covers many culturally and legally sensitive topics in each of these countries, which are often ignored, or even in conflict, with western safety data. Therefore, we believe our models are more local-friendly and abide by local rules to a higher degree.
82
+
83
+ ### SFT Strategies
84
+
85
+ We
86
+
87
+
88
+ ## Evaluation
89
+
90
+ ### Peer Comparison
91
+
92
+ Evaluated by
93
+
94
+ <!-- ! Add the stack chart better -->
95
+ | vs ChatGPT | win | lose | tie
96
+ | --- | --- | --- | --- |
97
+ | Polylm-13b-chat | 204 | 1517 | 122
98
+ | Qwen-14b-chat | 433 | 1128 | 306
99
+ | SeaLLM-13bChat/SFT/v1 | 454 | 1185 | 209
100
+
101
+ ### M3Exam - World Knowledge in Regional Languages
102
+
103
+ Introduction about the M3Exam
104
+
105
+
106
+ <!-- | Qwen-7b-chat | 33.91 | 60.85 | 29.57 | 0.00 | 18.04
107
+ | Qwen-13b-v3-pro | 75.30 | 89.27 | 56.68 | 49.46 | 39.35
108
+ | Qwen-13b-v3-pro-SFT | 38.20 | 4.23 | 46.39 | 33.97 | 19.79
109
+ | Qwen-14b | 75.56 | 88.78 | 54.61 | 49.97 | 42.62
110
+ | Qwen-14b-SFT | 49.50 | 41.79 | 54.84 | 44.91 | 19.51 -->
111
+
112
+ | M3-exam / 3-shot | En | Zh | Vi | Id | Th
113
+ |-----------| ------- | ------- | ------- | ------- | ------- |
114
+ | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00
115
+ | ChatGPT | 75.46 | 60.20 | 58.64 | ? | 37.41
116
+ | Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
117
+ | Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
118
+ | Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
119
+ | Qwen-PolyLM-7b-chat | 53.65 | 61.58 | 39.26 | 33.69 | 29.02
120
+ | SeaLLM-13b/78k-step | 58.19 | 41.95 | 46.56 | 37.63 | 31.00
121
+ | SeaLLM-13bChat/SFT/v1 | 63.53 | 45.47 | 50.25 | 39.85 | 36.07
122
+ | SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
123
+
124
+
125
+ <!-- ! Considering removing zero-shot -->
126
+ <!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
127
+ <!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
128
+ |-----------| ------- | ------- | ------- | ------- | ------- |
129
+ | ChatGPT | 75.98 | 61.00 | 57.18 | 48.58 | 34.09
130
+ | Llama-2-13b | 19.49 | 39.07 | 35.38 | 23.66 | 12.44
131
+ | Llama-2-13b-chat | 52.57 | 39.52 | 36.56 | 27.39 | 10.40
132
+ | Polylm-13b-chat | 28.74 | 27.71 | 25.77 | 22.01 | 13.65
133
+ | Qwen-PolyLM-7b-chat | 52.51 | 56.14 | 32.34 | 25.49 | 24.64
134
+ | SeaLLM-13b/78k-step | 36.68 | 36.58 | 41.98 | 25.87 | 20.11
135
+ | SeaLLM-13bChat/SFT/v1 | 64.30 | 45.58 | 48.13 | 37.76 | 30.77
136
+ | SeaLLM-13bChat/SFT/v2 | 62.23 | 41.00 | 47.23 | 35.10 | 30.77 -->
137
+
138
+
139
+ ### MMLU - Retaining English-based knowledge
140
+
141
+ | MMLU | Average | STEM | Social Sciences | Humanities | Others |
142
+ |-----------| ------- | ------- | ------- | ------- | ------- |
143
+ | Llama-2-13b | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 |
144
+ | Llama-2-13b-chat? | 46.9 | 35.8 | 53.8 | 45.0 | 53.3 |
145
+ | SeaLLM-13bChat/SFT/v1 | 64.30 | 45.58 | 48.13 | 37.76 | 30.77
146
+ | SeaLLM-13bChat/SFT/v2 | 62.23 | 41.00 | 47.23 | 35.10 | 30.77
147
+
148
+
149
+ ### NLP tasks
150
+
151
+ #### Reading Comprehension (Xquad & IndoQA)
152
+
153
+ 1-shot
154
+
155
+ Read-Comphrension | En | Zh | Vi | Id | Th | ALL | SEA
156
+ |-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
157
+ | Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
158
+ | Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
159
+ | SeaLLM-13b-chat-v1 | 83.12 | 73.95 | 74.16 | 61.37 | 60.94 | 70.71 | 65.49
160
+ | SeaLLM-13b-chat-v2 | 81.51 | 76.10 | 73.64 | 69.11 | 64.54 | 72.98 | 69.10
161
+
162
+
163
+ #### Translation
164
+
165
+ 4-shot
166
+
167
+ Model | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
168
+ |-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
169
+ | Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
170
+ | Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
171
+ | SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
172
+ | SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
173
+
174
+
175
+ #### Summarization
176
+
177
+ XL-sum - Rouge-L - 2shot
178
+
179
+ XL-Summarization (rouge-L) | En | Zh | Vi | Id | Th
180
+ |-------- | ---- | ---- | ---- | ---- | ---- |
181
+ | Llama-2-13b | 32.57 | 34.37 | 18.61 | 25.14 | 16.91
182
+ | Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
183
+ | SeaLLM-13b-chat-v2 | 27.00 | 33.31 | 20.31 | 25.69 | 21.97
184
+
185
+
186
+ ## Citation
187
+
188
+ If you find our project useful, hope you can star our repo and cite our work as follows:
189
+
190
+ ```
191
+ @article{damonlpsg2023seallm,
192
+ author = {???},
193
+ title = {SeaLLM: A language model for South East Asian Languages},
194
+ year = 2023,
195
+ }
196
+ ```