sambanovasystems
/

SambaLingo-Arabic-Base

@@ -29,6 +29,7 @@ SambaLingo-Arabic-Base is a pretrained Bi-lingual Arabic and English model that
 - **Language(s):** Arabic, English
 - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
 - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
 - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
 ## Getting Started
@@ -52,19 +53,9 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
 ## Tokenizer Details
 We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
-## Evaluation
-|| SambaLingo-Arabic-Base       | Jais-13b | bloomz-7 | xglm-7.5 | mGPT-13B |
-|------------------------------|----------|----------|----------|----------|--------|
-| Perplexity (Lower Is Better) | **1.422**    | 1.504    | 1.578    | 1.623    | 2.066  |
-| FLORES en->ar (8 shot, CHRF) | **0.501**    | 0.493    | 0.259    | 0.415    | 0.138  |
-| FLORES ar->en (8 shot, CHRF) | **0.610**    | 0.605    | 0.176    | 0.133    | 0.141  |
-| FLORES en->ar (8 shot, BLEU) | **0.169**    | 0.160    | 0.011   | 0.009    |    0.003    |
-| FLORES ar->en (8 shot, BLEU) | **0.339**    | 0.331    | 0.036    | 0.153    | 0.005    |
-| Belebele (3 shot)            | **39.00%**   | 34.40%   | 29.00%   | 21.89%   | 23.67% |
-| SIB-200 (3 shot)             | 71.57%   | **76.47%**   | 63.24%   | 65.20%   | 46.57% |
-| XNLI (0 shot)                | 33.57%   | **36.33%**   | 33.79%   | 33.37%   | 33.43% |
-| XStoryCloze (0 shot)         | **66.25%**   | 63.34%   | 58.50%   | 56.19%   | 51.62% |
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -107,12 +98,12 @@ We would like to give a special thanks to the following groups:
 ## Cite SambaLingo
 ```
-@software{sambalingo,
-  title = {{SambaLingo: Open Source Language Experts}},
-  author = {SambaNova Systems},
-  url = {https://huggingface.co/sambanovasystems/SambaLingo-Arabic-Base}
-  month = {2},
-  year = {2024},
-  version = {1.0},
 }
 ```

 - **Language(s):** Arabic, English
 - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
 - **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
+- **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829) (edited)
 - **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
 ## Getting Started
 ## Tokenizer Details
 We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
+## Evaluation
+For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ## Cite SambaLingo
 ```
+@misc{csaki2024sambalingo,
+      title={SambaLingo: Teaching Large Language Models New Languages},
+      author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
+      year={2024},
+      eprint={2404.05829},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
 }
 ```