jujbob commited on
Commit
31f16eb
โ€ข
1 Parent(s): 3975922

First commit

Browse files
Files changed (1) hide show
  1. README.md +233 -3
README.md CHANGED
@@ -1,3 +1,233 @@
1
- ---
2
- license: llama3
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ko
5
+ license: llama3
6
+ library_name: transformers
7
+ base_model:
8
+ - meta-llama/Meta-Llama-3-8B
9
+ ---
10
+
11
+ <a href="https://github.com/MLP-Lab/Bllossom">
12
+ <img src="https://github.com/teddysum/bllossom/blob/main//bllossom_icon.png?raw=true" width="40%" height="50%">
13
+ </a>
14
+
15
+ # Bllossom | [Demo]() | [Homepage](https://www.bllossom.ai/) | [Github](https://github.com/MLP-Lab/Bllossom) | [Colab-tutorial](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing) |
16
+
17
+
18
+ ```bash
19
+ ์ €ํฌ BllossomํŒ€ ์—์„œ ํ•œ๊ตญ์–ด-์˜์–ด ์ด์ค‘ ์–ธ์–ด๋ชจ๋ธ์ธ Bllossom์„ ๊ณต๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค!
20
+ ์„œ์šธ๊ณผ๊ธฐ๋Œ€ ์Šˆํผ์ปดํ“จํŒ… ์„ผํ„ฐ์˜ ์ง€์›์œผ๋กœ 100GB๊ฐ€๋„˜๋Š” ํ•œ๊ตญ์–ด๋กœ ๋ชจ๋ธ์ „์ฒด๋ฅผ ํ’€ํŠœ๋‹ํ•œ ํ•œ๊ตญ์–ด ๊ฐ•ํ™” ์ด์ค‘์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค!
21
+ ํ•œ๊ตญ์–ด ์ž˜ํ•˜๋Š” ๋ชจ๋ธ ์ฐพ๊ณ  ์žˆ์ง€ ์•Š์œผ์…จ๋‚˜์š”?
22
+ - ํ•œ๊ตญ์–ด ์ตœ์ดˆ! ๋ฌด๋ ค 3๋งŒ๊ฐœ๊ฐ€ ๋„˜๋Š” ํ•œ๊ตญ์–ด ์–ดํœ˜ํ™•์žฅ
23
+ - Llama3๋Œ€๋น„ ๋Œ€๋žต 25% ๋” ๊ธด ๊ธธ์ด์˜ ํ•œ๊ตญ์–ด Context ์ฒ˜๋ฆฌ๊ฐ€๋Šฅ
24
+ - ํ•œ๊ตญ์–ด-์˜์–ด Pararell Corpus๋ฅผ ํ™œ์šฉํ•œ ํ•œ๊ตญ์–ด-์˜์–ด ์ง€์‹์—ฐ๊ฒฐ (์‚ฌ์ „ํ•™์Šต)
25
+ - ํ•œ๊ตญ์–ด ๋ฌธํ™”, ์–ธ์–ด๋ฅผ ๊ณ ๋ คํ•ด ์–ธ์–ดํ•™์ž๊ฐ€ ์ œ์ž‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ฏธ์„ธ์กฐ์ •
26
+ - ๊ฐ•ํ™”ํ•™์Šต
27
+ ์ด ๋ชจ๋“ ๊ฒŒ ํ•œ๊บผ๋ฒˆ์— ์ ์šฉ๋˜๊ณ  ์ƒ์—…์  ์ด์šฉ์ด ๊ฐ€๋Šฅํ•œ Bllossom์„ ์ด์šฉํ•ด ์—ฌ๋Ÿฌ๋ถ„ ๋งŒ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด์„ธ์šฅ!
28
+ ๋ณธ ๋ชจ๋ธ์€ 4GB GPU์—์„œ ๊ตฌ๋™ ๊ฐ€๋Šฅํ•œ ์–‘์žํ™” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค!
29
+
30
+ 1. Bllossom-8B๋Š” ์„œ์šธ๊ณผ๊ธฐ๋Œ€, ํ…Œ๋””์ธ, ์—ฐ์„ธ๋Œ€ ์–ธ์–ด์ž์› ์—ฐ๊ตฌ์‹ค์˜ ์–ธ์–ดํ•™์ž์™€ ํ˜‘์—…ํ•ด ๋งŒ๋“  ์‹ค์šฉ์ฃผ์˜๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์ž…๋‹ˆ๋‹ค! ์•ž์œผ๋กœ ์ง€์†์ ์ธ ์—…๋ฐ์ดํŠธ๋ฅผ ํ†ตํ•ด ๊ด€๋ฆฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค ๋งŽ์ด ํ™œ์šฉํ•ด์ฃผ์„ธ์š” ๐Ÿ™‚
31
+ 2. ์ดˆ ๊ฐ•๋ ฅํ•œ Advanced-Bllossom 8B, 70B๋ชจ๋ธ, ์‹œ๊ฐ-์–ธ์–ด๋ชจ๋ธ์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค! (๊ถ๊ธˆํ•˜์‹ ๋ถ„์€ ๊ฐœ๋ณ„ ์—ฐ๋ฝ์ฃผ์„ธ์š”!!)
32
+ 3. Bllossom์€ NAACL2024, LREC-COLING2024 (๊ตฌ๋‘) ๋ฐœํ‘œ๋กœ ์ฑ„ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
33
+ 4. ์ข‹์€ ์–ธ์–ด๋ชจ๋ธ ๊ณ„์† ์—…๋ฐ์ดํŠธ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!! ํ•œ๊ตญ์–ด ๊ฐ•ํ™”๋ฅผ์œ„ํ•ด ๊ณต๋™ ์—ฐ๊ตฌํ•˜์‹ค๋ถ„(ํŠนํžˆ๋…ผ๋ฌธ) ์–ธ์ œ๋“  ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค!!
34
+ ํŠนํžˆ ์†Œ๋Ÿ‰์˜ GPU๋ผ๋„ ๋Œ€์—ฌ ๊ฐ€๋Šฅํ•œํŒ€์€ ์–ธ์ œ๋“  ์—ฐ๋ฝ์ฃผ์„ธ์š”! ๋งŒ๋“ค๊ณ  ์‹ถ์€๊ฑฐ ๋„์™€๋“œ๋ ค์š”.
35
+ ```
36
+
37
+ The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:
38
+
39
+ * **Knowledge Linking**: Linking Korean and English knowledge through additional training
40
+ * **Vocabulary Expansion**: Expansion of Korean vocabulary to enhance Korean expressiveness.
41
+ * **Instruction Tuning**: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
42
+ * **Human Feedback**: DPO has been applied
43
+ * **Vision-Language Alignment**: Aligning the vision transformer with this language model
44
+
45
+ **This model developed by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)**
46
+
47
+ ## Demo Video
48
+
49
+ <div style="display: flex; justify-content: space-between;">
50
+ <!-- ์ฒซ ๋ฒˆ์งธ ์ปฌ๋Ÿผ -->
51
+ <div style="width: 49%;">
52
+ <a>
53
+ <img src="https://github.com/lhsstn/lhsstn/blob/main/x-llava_dem.gif?raw=true" style="width: 100%; height: auto;">
54
+ </a>
55
+ <p style="text-align: center;">Bllossom-V Demo</p>
56
+ </div>
57
+
58
+ <!-- ๋‘ ๋ฒˆ์งธ ์ปฌ๋Ÿผ (ํ•„์š”ํ•˜๋‹ค๋ฉด) -->
59
+ <div style="width: 49%;">
60
+ <a>
61
+ <img src="https://github.com/lhsstn/lhsstn/blob/main/bllossom_demo_kakao.gif?raw=true" style="width: 70%; height: auto;">
62
+ </a>
63
+ <p style="text-align: center;">Bllossom Demo(Kakao)ใ…คใ…คใ…คใ…คใ…คใ…คใ…คใ…ค</p>
64
+ </div>
65
+ </div>
66
+
67
+
68
+
69
+ ## NEWS
70
+ * [2024.05.08] Vocab Expansion Model Update
71
+ * [2024.04.25] We released Bllossom v2.0, based on llama-3
72
+ * [2023/12] We released Bllossom-Vision v1.0, based on Bllossom
73
+ * [2023/08] We released Bllossom v1.0, based on llama-2.
74
+ * [2023/07] We released Bllossom v0.7, based on polyglot-ko.
75
+
76
+
77
+ ## Example code
78
+
79
+ ### Colab Tutorial
80
+ - [Inference-Code-Link](https://colab.research.google.com/drive/1fBOzUVZ6NRKk_ugeoTbAOokWKqSN47IG?usp=sharing)
81
+
82
+ ### Install Dependencies
83
+ ```bash
84
+ pip install torch transformers==4.40.0 accelerate
85
+ ```
86
+
87
+ ### Python code with Pipeline
88
+ ```python
89
+ import transformers
90
+ import torch
91
+
92
+ model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"
93
+
94
+ pipeline = transformers.pipeline(
95
+ "text-generation",
96
+ model=model_id,
97
+ model_kwargs={"torch_dtype": torch.bfloat16},
98
+ device_map="auto",
99
+ )
100
+
101
+ pipeline.model.eval()
102
+
103
+ PROMPT = '''๋‹น์‹ ์€ ์œ ์šฉํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์งˆ์˜์— ๋Œ€ํ•ด ์นœ์ ˆํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
104
+ You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
105
+ instruction = "์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด์ค˜"
106
+
107
+ messages = [
108
+ {"role": "system", "content": f"{PROMPT}"},
109
+ {"role": "user", "content": f"{instruction}"}
110
+ ]
111
+
112
+ prompt = pipeline.tokenizer.apply_chat_template(
113
+ messages,
114
+ tokenize=False,
115
+ add_generation_prompt=True
116
+ )
117
+
118
+ terminators = [
119
+ pipeline.tokenizer.eos_token_id,
120
+ pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
121
+ ]
122
+
123
+ outputs = pipeline(
124
+ prompt,
125
+ max_new_tokens=2048,
126
+ eos_token_id=terminators,
127
+ do_sample=True,
128
+ temperature=0.6,
129
+ top_p=0.9,
130
+ repetition_penalty = 1.1
131
+ )
132
+
133
+ print(outputs[0]["generated_text"][len(prompt):])
134
+
135
+ # ์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์—ฐ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์›์€ ์ž„๊ฒฝํƒœ ๊ต์ˆ˜์™€ ๊น€๋ฏผ์ค€, ๊น€์ƒ๋ฏผ, ์ตœ์ฐฝ์ˆ˜, ์›์ธํ˜ธ, ์œ ํ•œ๊ฒฐ, ์ž„ํ˜„์„, ์†ก์Šน์šฐ, ์œก์ •ํ›ˆ, ์‹ ๋™์žฌ ํ•™์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
136
+ ```
137
+
138
+ ### Python code with AutoModel
139
+ ```python
140
+
141
+ import os
142
+ import torch
143
+ from transformers import AutoTokenizer, AutoModelForCausalLM
144
+
145
+ model_id = 'MLP-KTLim/llama-3-Korean-Bllossom-8B'
146
+
147
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
148
+ model = AutoModelForCausalLM.from_pretrained(
149
+ model_id,
150
+ torch_dtype=torch.bfloat16,
151
+ device_map="auto",
152
+ )
153
+
154
+ model.eval()
155
+
156
+ PROMPT = '''๋‹น์‹ ์€ ์œ ์šฉํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์งˆ์˜์— ๋Œ€ํ•ด ์นœ์ ˆํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
157
+ You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
158
+ instruction = "์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด์ค˜"
159
+
160
+ messages = [
161
+ {"role": "system", "content": f"{PROMPT}"},
162
+ {"role": "user", "content": f"{instruction}"}
163
+ ]
164
+
165
+ input_ids = tokenizer.apply_chat_template(
166
+ messages,
167
+ add_generation_prompt=True,
168
+ return_tensors="pt"
169
+ ).to(model.device)
170
+
171
+ terminators = [
172
+ tokenizer.eos_token_id,
173
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
174
+ ]
175
+
176
+ outputs = model.generate(
177
+ input_ids,
178
+ max_new_tokens=2048,
179
+ eos_token_id=terminators,
180
+ do_sample=True,
181
+ temperature=0.6,
182
+ top_p=0.9,
183
+ repetition_penalty = 1.1
184
+ )
185
+
186
+ print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
187
+ # ์„œ์šธ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™๊ต MLP์—ฐ๊ตฌ์‹ค์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์—ฐ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์›์€ ์ž„๊ฒฝํƒœ ๊ต์ˆ˜์™€ ๊น€๋ฏผ์ค€, ๊น€์ƒ๋ฏผ, ์ตœ์ฐฝ์ˆ˜, ์›์ธํ˜ธ, ์œ ํ•œ๊ฒฐ, ์ž„ํ˜„์„, ์†ก์Šน์šฐ, ์œก์ •ํ›ˆ, ์‹ ๋™์žฌ ํ•™์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
188
+ ```
189
+
190
+
191
+
192
+ ## Citation
193
+ **Language Model**
194
+ ```text
195
+ @misc{bllossom,
196
+ author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
197
+ title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
198
+ year = {2024},
199
+ journal = {LREC-COLING 2024},
200
+ paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
201
+ },
202
+ }
203
+ ```
204
+
205
+ **Vision-Language Model**
206
+ ```text
207
+ @misc{bllossom-V,
208
+ author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
209
+ title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
210
+ year = {2024},
211
+ publisher = {GitHub},
212
+ journal = {NAACL 2024 findings},
213
+ paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
214
+ },
215
+ }
216
+ ```
217
+
218
+ ## Contact
219
+ - ์ž„๊ฒฝํƒœ(KyungTae Lim), Professor at Seoultech. `ktlim@seoultech.ac.kr`
220
+ - ํ•จ์˜๊ท (Younggyun Hahm), CEO of Teddysum. `hahmyg@teddysum.ai`
221
+ - ๊น€ํ•œ์ƒ˜(Hansaem Kim), Professor at Yonsei. `khss@yonsei.ac.kr`
222
+
223
+ ## Contributor
224
+ - ์ตœ์ฐฝ์ˆ˜(Chansu Choi), choics2623@seoultech.ac.kr
225
+ - ๊น€์ƒ๋ฏผ(Sangmin Kim), sangmin9708@naver.com
226
+ - ์›์ธํ˜ธ(Inho Won), wih1226@seoultech.ac.kr
227
+ - ๊น€๋ฏผ์ค€(Minjun Kim), mjkmain@seoultech.ac.kr
228
+ - ์†ก์Šน์šฐ(Seungwoo Song), sswoo@seoultech.ac.kr
229
+ - ์‹ ๋™์žฌ(Dongjae Shin), dylan1998@seoultech.ac.kr
230
+ - ์ž„ํ˜„์„(Hyeonseok Lim), gustjrantk@seoultech.ac.kr
231
+ - ์œก์ •ํ›ˆ(Jeonghun Yuk), usually670@gmail.com
232
+ - ์œ ํ•œ๊ฒฐ(Hangyeol Yoo), 21102372@seoultech.ac.kr
233
+ - ์†ก์„œํ˜„(Seohyun Song), alexalex225225@gmail.com