File size: 8,099 Bytes
ee0c097 018f4e8 3757c00 018f4e8 ee0c097 018f4e8 2d9c1d0 018f4e8 2d9c1d0 b868179 a9ade51 018f4e8 aafa59f 018f4e8 9542349 2d9c1d0 9542349 018f4e8 541053d 2d9c1d0 9542349 018f4e8 b662482 018f4e8 b868179 018f4e8 fb39f32 018f4e8 b868179 018f4e8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: llama2
datasets:
- HuggingFaceH4/ultrachat_200k
- HuggingFaceH4/ultrafeedback_binarized
- HuggingFaceH4/cai-conversation-harmless
language:
- bg
- en
---
# SambaLingo-Bulgarian-Chat
<img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
<!-- Provide a quick summary of what the model is/does. -->
SambaLingo-Bulgarian-Chat is a human aligned chat model trained in Bulgarian and English. It is trained using direct preference optimization on top the base model [SambaLingo-Bulgarian-Base](https://huggingface.co/sambanovasystems/SambaLingo-Bulgarian-Base). The base model adapts [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Bulgarian by training on 63 billion tokens from the Bulgarian split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. Try this model at [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
## Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** [SambaNova Systems](https://sambanova.ai/)
- **Model type:** Language Model
- **Language(s):** Bulgarian, English
- **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- **Try this model:** [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space)
- **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
- **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
## Getting Started
### Loading Model With Hugging Face
Please make sure to set use_fast=False when loading the tokenizer.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Bulgarian-Chat", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Bulgarian-Chat", device_map="auto", torch_dtype="auto")
```
### Interacting With Model Pipeline
Please make sure to set use_fast=False when loading the tokenizer.
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="sambanovasystems/SambaLingo-Bulgarian-Chat", device_map="auto", use_fast=False)
messages = [
{"role": "user", "content": {YOUR_QUESTION}},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)[0]
outputs = outputs["generated_text"]
```
### Suggested Inference Parameters
- Temperature: 0.8
- Repetition penalty: 1.0
- Top-p: 0.9
### Prompting Guidelines
To prompt this model, please use the following chat template:
```
<|user|>\n{question}</s>\n<|assistant|>\n
```
### Example Prompts and Generations
```
<|user|>
Какъв извод можеш да извадиш за чувството, което изпитват героите от следната кратка история:
Мария чакаше с трепет да дойде 27-ми. Още само три дни и корабът щеше да акостира на пристанището. След 6 дълги месеца тя най-после щеше да види Иван. Когато денят най-после дойде, тя отиде на пристанището заедно с десетки други жени. Моряците един по един започнаха да слизат. Най-накрая тя съзря погледа на Иван, и двамата се прегърнаха силно.</s>
<|assistant|>
Мария изпитва силно чувство на очакване и нетърпение да види Иван след шестмесечна раздяла. Тя е развълнувана и щастлива да го види отново и прегръдката им показва, че и двамата са били много развълнувани да се съберат отново.
```
## Training Details
The alignment phase follows the recipe for [Zephyr-7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), and comprises two stages: supervised fine-tuning (SFT) and Direct Performance Optimization (DPO).
The SFT phase was done on the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset mixed with the Google translated version of the ultrachat_200k dataset. It was trained for one epoch with global batch size 512 and max sequence length 2048 tokens. We used a linear decay learning rate of 2e-5 and 10% warmup.
The DPO phase was done on the [ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset and [cai-conversation-harmless](https://huggingface.co/datasets/HuggingFaceH4/cai-conversation-harmless) dataset, mixed with 10% of the data Google translated. It was trained with global batch size 32 and for three epochs. We used a linear decay learning rate of 5e-7, 10% warmup and β=0.1 as the regularization factor for DPO.
## Tokenizer Details
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
## Evaluation
For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Use of this model is governed by the Meta’s [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/). Please review and accept the license before downloading the model weights.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
SambaLingo should NOT be used for:
- Mission-critical applications
- Applications that involve the safety of others
- Making highly important decisions
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Like all LLMs, SambaLingo has certain limitations:
- Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
- Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
- Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
- Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
- Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.
## Acknowledgments
We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been possible without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.
We would like to give a special thanks to the following groups:
- Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset
- Nguyen et al for open sourcing CulturaX dataset
- CohereAI for releasing AYA-101 and open sourcing a multilingual instruction tuning dataset
- EleutherAI for their open source evaluation framework
- Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo
## Cite SambaLingo
```
@misc{csaki2024sambalingo,
title={SambaLingo: Teaching Large Language Models New Languages},
author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
year={2024},
eprint={2404.05829},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |