myeongho-jeong commited on
Commit
05009de
1 Parent(s): d7a7019

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -15
README.md CHANGED
@@ -30,7 +30,14 @@ This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0]
30
  <p align="left">
31
  <img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
32
  <p>
33
- Here’s a glimpse into our technical approach:
 
 
 
 
 
 
 
34
 
35
  ```python
36
  # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
@@ -47,15 +54,6 @@ for name, param in model.named_parameters():
47
  param.requires_grad = False
48
  ```
49
 
50
- Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
51
-
52
- 1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
53
- 2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
54
-
55
- As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
56
-
57
- For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
58
-
59
  ### Usage and Limitations
60
 
61
  Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
@@ -93,11 +91,11 @@ This rigorous approach ensured a comprehensive and contextually rich Korean voca
93
  ## Citation
94
 
95
  ```
96
- @misc{cui2023ultrafeedback,
97
- title={UltraFeedback: Boosting Language Models with High-quality Feedback},
98
- author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
99
- year={2023},
100
- eprint={2310.01377},
101
  archivePrefix={arXiv},
102
  primaryClass={cs.CL}
103
  }
 
30
  <p align="left">
31
  <img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
32
  <p>
33
+
34
+ To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing.
35
+ This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean.
36
+ Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training.
37
+ We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process.
38
+ For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
39
+
40
+ Here’s an simplified code for our key approach:
41
 
42
  ```python
43
  # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
 
54
  param.requires_grad = False
55
  ```
56
 
 
 
 
 
 
 
 
 
 
57
  ### Usage and Limitations
58
 
59
  Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
 
91
  ## Citation
92
 
93
  ```
94
+ @misc{Kim2024Efficient,
95
+ title={Efficient and Effective Vocabulary Expansion \\Towards Multilingual Large Language Models},
96
+ author={Seungduk Kim, Seungtaek Choi, Myeongho Jeong},
97
+ year={2024},
98
+ eprint={2402.XXXXX},
99
  archivePrefix={arXiv},
100
  primaryClass={cs.CL}
101
  }