dreamerdeo commited on
Commit
9c219a7
1 Parent(s): b9ae2c4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - id
6
+ - th
7
+ - vi
8
+ - ms
9
+ - lo
10
+ - my
11
+ - jv
12
+ - km
13
+ - su
14
+ - tl
15
+ tags:
16
+ - multilingual
17
+ - sea
18
+ - sailor
19
+ - sft
20
+ - chat
21
+ - instruction
22
+ widget:
23
+ - text: 如何制作烤鱼?
24
+ example_title: Chinese
25
+ - text: How to bake fish?
26
+ example_title: English
27
+ - text: Bagaimana cara memanggang ikan?
28
+ example_title: Malay
29
+ - text: วิธีย่างปลา?
30
+ example_title: Thai
31
+ - text: Bagaimana membuat bakaran ikan?
32
+ example_title: Indonesian
33
+ - text: Làm thế nào để nướng cá?
34
+ example_title: Vietnamese
35
+ license: apache-2.0
36
+ base_model:
37
+ - sail/Sailor2-1B
38
+ ---
39
+
40
+ <div align="center">
41
+ <img src="sailor2_banner.jpg" width="700"/>
42
+ </div>
43
+
44
+ > The logo was generated by MidJourney
45
+
46
+ Sailor2 is a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA).
47
+ Our research highlights a strong demand for models in the **8B and 20B parameter** range for production use, alongside **1B models** for specialized applications,
48
+ such as speculative decoding and research purposes.
49
+ These models, released under the **Apache 2.0 license**, provide enhanced accessibility to advanced language technologies across the region.
50
+
51
+
52
+ Sailor2 builds upon the foundation of the awesome multilingual model [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and
53
+ is continuously pre-trained on **500B tokens** to support **15 languages** better with a unified model.
54
+ These languages include English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
55
+ By addressing the growing demand for diverse, robust, and accessible language models,
56
+ Sailor2 seeks to serve the underserved in SEA areas with open, inclusive, and accessible multilingual LLMs.
57
+
58
+ Refer to [Sailor2 Website](https://sailorllm.github.io/) for more training details.
59
+
60
+ ## Model Summary
61
+ - **Model Collections:** [Base Model & Chat Model](https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b)
62
+ - **Project Website:** [sailorllm.github.io](https://sailorllm.github.io/)
63
+ - **Codebase:** [github.com/sail-sg/sailor2](https://github.com/sail-sg/sailor2)
64
+ - **Technical Report:** Coming Soon
65
+
66
+
67
+ ## Training details
68
+
69
+
70
+ ## Requirements
71
+ The code of Sailor2 has been in the latest Hugging face transformers and we advise you to install `transformers==4.46.3`.
72
+
73
+ ## Quickstart
74
+
75
+ Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
76
+
77
+ ```python
78
+ import torch
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer
80
+ device = "cuda"
81
+
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ 'sail/Sailor2-1B-Chat',
84
+ torch_dtype=torch.bfloat16,
85
+ device_map="auto"
86
+ )
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained('sail/Sailor2-20B-Chat')
89
+ system_prompt= \
90
+ 'You are an AI assistant named Sailor2, created by Sea AI Lab. \
91
+ As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages \
92
+ such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray. \
93
+ Your responses should be friendly, unbiased, informative, detailed, and faithful.'
94
+
95
+ prompt = "Beri saya pengenalan singkat tentang model bahasa besar."
96
+ # prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn."
97
+ # prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่"
98
+
99
+ messages = [
100
+ {"role": "system", "content": system_prompt},
101
+ {"role": "user", "content": prompt}
102
+ ]
103
+ text = tokenizer.apply_chat_template(
104
+ messages,
105
+ tokenize=False,
106
+ add_generation_prompt=True
107
+ )
108
+
109
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
110
+ input_ids = model_inputs.input_ids.to(device)
111
+
112
+ generated_ids = model.generate(
113
+ input_ids,
114
+ max_new_tokens=512,
115
+ )
116
+
117
+ generated_ids = [
118
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
119
+ ]
120
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
121
+ print(response)
122
+ ```
123
+
124
+ # License
125
+
126
+ Sailor2 is distributed under the terms of the Apache License 2.0.
127
+ No restrict on the research and the commercial use.
128
+
129
+ ## Citation
130
+
131
+ If you find Sailor2 useful, please cite our work as follows:
132
+
133
+ ```
134
+ @misc{sailor2report,
135
+ title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
136
+ author={Sailor2 Team},
137
+ year={2024}
138
+ }
139
+ ```
140
+
141
+ # Contact Us
142
+
143
+ If you have any questions, please raise an issue or contact us at [doulx@sea.com](mailto:doulx@sea.com) or [liuqian.sea@gmail.com](mailto:liuqian.sea@gmail.com).