yanolja
/

EEVE-Korean-10.8B-v1.0

@@ -30,7 +30,14 @@ This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0]
 <p align="left">
   <img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
 <p>
-Here’s a glimpse into our technical approach:
 ```python
 # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
@@ -47,15 +54,6 @@ for name, param in model.named_parameters():
         param.requires_grad = False
 ```
-Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
-1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
-2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
-As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
-For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
 ### Usage and Limitations
 Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
@@ -93,11 +91,11 @@ This rigorous approach ensured a comprehensive and contextually rich Korean voca
 ## Citation
 ```
-@misc{cui2023ultrafeedback,
-      title={UltraFeedback: Boosting Language Models with High-quality Feedback},
-      author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
-      year={2023},
-      eprint={2310.01377},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }

 <p align="left">
   <img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
 <p>
+To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing.
+This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean.
+Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training.
+We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process.
+For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
+Here’s an simplified code for our key approach:
 ```python
 # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
         param.requires_grad = False
 ```
 ### Usage and Limitations
 Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
 ## Citation
 ```
+@misc{Kim2024Efficient,
+      title={Efficient and Effective Vocabulary Expansion \\Towards Multilingual Large Language Models},
+      author={Seungduk Kim, Seungtaek Choi, Myeongho Jeong},
+      year={2024},
+      eprint={2402.XXXXX},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }