Facilitating large language model Russian adaptation with Learned Embedding Propagation
Abstract
Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.
Community
The article proposes an approach to the method of adapting large language models to a Russian language. Adaptation in this context primarily means transferring the model to a new tokenization that is more effective for the target language.
The main feature of the work is that the adaptation occurs for the basic version of the model, but using the Learned Embedding Propagation methodology proposed in the article, any instructional version obtained from this base can be adapted later. Despite the fact that the work about the Russian language, the approach should be applicable to any other alphabetic language (languages based on hieroglyphs require additional research)
Based on the article, a series of experiments were conducted to adapt the Qwen 2.5 series models to the Russian language, which resulted in an increase not only in the efficiency of text generation in Russian (from 2.5 characters per token to 4 characters per token -> 30-60% acceleration in characters/words), but also in the quality in benchmarks and arenas.
Another important property of the resulting models was that the models stopped switching to Chinese (since there were practically no Chinese characters left in the tokenization), but the quality in English remained practically unchanged.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation (2024)
- LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks (2024)
- LLMs are Also Effective Embedding Models: An In-depth Overview (2024)
- Development of Pre-Trained Transformer-based Models for the Nepali Language (2024)
- A Practical Guide to Fine-tuning Language Models with Limited Data (2024)
- BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment (2024)
- A Comparative Analysis of Instruction Fine-Tuning Large Language Models for Financial Text Classification (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 3
Collections including this paper 0
No Collection including this paper