zolicsaki commited on
Commit
e7c52c6
1 Parent(s): 1ffd6f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -19
README.md CHANGED
@@ -29,6 +29,7 @@ SambaLingo-Arabic-Base is a pretrained Bi-lingual Arabic and English model that
29
  - **Language(s):** Arabic, English
30
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
31
  - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
 
32
  - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
33
 
34
  ## Getting Started
@@ -52,19 +53,9 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
52
 
53
  ## Tokenizer Details
54
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
55
- ## Evaluation
56
 
57
- || SambaLingo-Arabic-Base | Jais-13b | bloomz-7 | xglm-7.5 | mGPT-13B |
58
- |------------------------------|----------|----------|----------|----------|--------|
59
- | Perplexity (Lower Is Better) | **1.422** | 1.504 | 1.578 | 1.623 | 2.066 |
60
- | FLORES en->ar (8 shot, CHRF) | **0.501** | 0.493 | 0.259 | 0.415 | 0.138 |
61
- | FLORES ar->en (8 shot, CHRF) | **0.610** | 0.605 | 0.176 | 0.133 | 0.141 |
62
- | FLORES en->ar (8 shot, BLEU) | **0.169** | 0.160 | 0.011 | 0.009 | 0.003 |
63
- | FLORES ar->en (8 shot, BLEU) | **0.339** | 0.331 | 0.036 | 0.153 | 0.005 |
64
- | Belebele (3 shot) | **39.00%** | 34.40% | 29.00% | 21.89% | 23.67% |
65
- | SIB-200 (3 shot) | 71.57% | **76.47%** | 63.24% | 65.20% | 46.57% |
66
- | XNLI (0 shot) | 33.57% | **36.33%** | 33.79% | 33.37% | 33.43% |
67
- | XStoryCloze (0 shot) | **66.25%** | 63.34% | 58.50% | 56.19% | 51.62% |
68
 
69
  ## Uses
70
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -107,12 +98,12 @@ We would like to give a special thanks to the following groups:
107
 
108
  ## Cite SambaLingo
109
  ```
110
- @software{sambalingo,
111
- title = {{SambaLingo: Open Source Language Experts}},
112
- author = {SambaNova Systems},
113
- url = {https://huggingface.co/sambanovasystems/SambaLingo-Arabic-Base}
114
- month = {2},
115
- year = {2024},
116
- version = {1.0},
117
  }
118
  ```
 
29
  - **Language(s):** Arabic, English
30
  - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
31
  - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
32
+ - **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829) (edited)
33
  - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
34
 
35
  ## Getting Started
 
53
 
54
  ## Tokenizer Details
55
  We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
 
56
 
57
+ ## Evaluation
58
+ For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
 
 
 
 
 
 
 
 
 
59
 
60
  ## Uses
61
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
98
 
99
  ## Cite SambaLingo
100
  ```
101
+ @misc{csaki2024sambalingo,
102
+ title={SambaLingo: Teaching Large Language Models New Languages},
103
+ author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
104
+ year={2024},
105
+ eprint={2404.05829},
106
+ archivePrefix={arXiv},
107
+ primaryClass={cs.CL}
108
  }
109
  ```