|
--- |
|
language: |
|
- en |
|
- id |
|
- ta |
|
- th |
|
- vi |
|
license: llama3 |
|
--- |
|
# LLaMA3 8B SEA-LIONv2 |
|
|
|
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region. |
|
This model is continued pre-trained from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model. |
|
This is the card for the LLaMA3 8B SEA-LIONv2 base model. |
|
|
|
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The LLaMA3 8B SEA-LIONv model is a significant leap forward in the field of Natural Language Processing, |
|
specifically trained to understand the SEA regional context. |
|
|
|
For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct. |
|
|
|
The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens. |
|
|
|
- **Developed by:** Products Pillar, AI Singapore |
|
- **Funded by:** Singapore NRF |
|
- **Model type:** Decoder |
|
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil |
|
- **License:** [LLaMA3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE) |
|
|
|
### Benchmark Performance |
|
We evaluated LLaMA3 8B SEA-LIONv2 base model on general language capabilities. |
|
|
|
#### General Language Capabilities |
|
For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks. |
|
These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI). |
|
|
|
The evaluation was done **five-shot** with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper. |
|
|
|
**BHASA** |
|
|
|
|
|
|
|
**English** |
|
|
|
| Model | ARC | BBH | HellaSwag | MMLU | GSM8k | Average | |
|
| ---------------------------------------- | ----- | ----- | --------- | ----- | ----- | ------- | |
|
| aisingapore/llama3-8b-cpt-sealionv2-base | 58.87 | 47.70 | 81.14 | 63.11 | 50.49 | 60.26 | |
|
| google/gemma-2-9b | 68.00 | 53.53 | 82.73 | 70.26 | 63.53 | 67.61 | |
|
| meta-llama/Meta-Llama-3-8B | 57.85 | 46.09 | 81.89 | 65.10 | 45.34 | 59.25 | |
|
| Qwen/Qwen2-7B | 61.86 | 53.10 | 80.63 | 70.45 | 78.09 | 68.83 | |
|
| Sail/Sailor-7B | 50.34 | 35.65 | 76.11 | 52.80 | 33.81 | 49.74 | |
|
| mistralai/Mistral-7B-v0.3 | 59.56 | 44.89 | 82.97 | 62.36 | 33.36 | 56.63 | |
|
|
|
|
|
## Training Details |
|
|
|
### Data |
|
|
|
LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data: |
|
|
|
| Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage | |
|
|---------------------------|:-------------:|:----------:|:------------:|:----------:| |
|
| Dolma RefinedWeb - English| 7.650B | 1 | 7.650B | 15.90% | |
|
| Dolma C4 - English | 1.160B | 1 | 1B | 9.21% | |
|
| Dolma Reddit - English | 1.339B | 1 | 14.7B | 2.42% | |
|
| Dolma Semantic Scholar | 0.959B | 1 | 2.9B | 2.79% | |
|
| Dolma arXiv | 0.469B | 1 | 5.3B | 1.99% | |
|
| Dolma StarCoder | 4.422B | 1 | 4.9B | 0.98% | |
|
| SEA-LION Pile - Indonesian| 3.4B | 1 | 6.8B | 14.17% | |
|
| Wiki* - Indonesian | 0.3B | 4 | 1.2B | 2.50% | |
|
| SEA-LION Pile - Tamil | 5.6B | 1 | 5.6B | 11.67% | |
|
| Wiki* + News - Tamil | 0.6B | 4 | 2.4B | 5.00% | |
|
| SEA-LION Pile - Thai | 2.28B | 1 | 2.28B | 4.75% | |
|
| WangChanBERTa - Thai | 5B | 1 | 5B | 10.42% | |
|
| Wiki* - Thai | 0.18B | 4 | 0.72B | 1.50% | |
|
| SEA-LION Pile - Vietnamese| 6.76B | 1 | 6.76B | 14.08% | |
|
| Wiki* - Vietnamese | 0.31B | 4 | 1.24B | 2.58% | |
|
|
|
Note: |
|
- All token counts are counted using LLaMA3 tokenizer |
|
- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage |
|
- Source of Tamil news is source with permission from [Seithi](https://seithi.mediacorp.sg/) |
|
|
|
### Infrastructure |
|
|
|
LLaMA3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer) |
|
on the following hardware: |
|
|
|
| Training Details | LLaMA3 8B SEA-LIONv2 | |
|
|----------------------|:--------------------:| |
|
| AWS EC2 p5d.24xlarge | 8 instances | |
|
| Nvidia H100 80GB GPU | 64 | |
|
| Training Duration | 2 days | |
|
|
|
|
|
### Configuration |
|
|
|
| HyperParameter | LLaMA3 8B SEA-LIONv2 | |
|
|-------------------|:--------------------:| |
|
| Precision | bfloat16 | |
|
| Optimizer | decoupled_adamw | |
|
| Scheduler | weight_stable_decay | |
|
| Learning Rate | 1.0e-5 | |
|
| Global Batch Size | 512 | |
|
| Micro Batch Size | 2 | |
|
|
|
|
|
## The Team |
|
|
|
Brandon Ong<br> |
|
Bryan Siow<br> |
|
Esther Choa<br> |
|
Huang Yuli<br> |
|
Lee Chwan Ren<br> |
|
Leong Wai Yi<br> |
|
Leong Wei Qi<br> |
|
Li Yier<br> |
|
Liu Bing Jie Darius<br> |
|
Lovenia Holy<br> |
|
Montalan Jann Railey<br> |
|
Ng Boon Cheong Raymond<br> |
|
Ngui Jian Gang<br> |
|
Nguyen Thanh Ngan<br> |
|
Nicholas Cheng<br> |
|
Ong Tat-Wee David<br> |
|
Ong Zhi Hao<br> |
|
Rengarajan Hamsawardhini<br> |
|
Susanto Yosephine<br> |
|
Tai Ngee Chia<br> |
|
Tan Choon Meng<br> |
|
Teo Jin Howe<br> |
|
Teo Eng Sipp Leslie<br> |
|
Teo Wei Yi<br> |
|
Tjhi William<br> |
|
Walter Teng<br> |
|
Wayne Lau<br> |
|
Yeo Yeow Tong<br> |
|
Yong Xianbin<br> |
|
|
|
|
|
## Acknowledgements |
|
|
|
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. |
|
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. |
|
|
|
|
|
## Contact |
|
|
|
For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6) |
|
|
|
[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion) |
|
|
|
|
|
## Disclaimer |
|
|
|
This the repository for the base model. |
|
The model has _not_ been aligned for safety. |
|
Developers and users should perform their own safety fine-tuning and related security measures. |
|
In no event shall the authors be held liable for any claim, damages, or other liability |
|
arising from the use of the released weights and codes. |
|
|
|
|
|
## References |
|
|
|
```bibtex |
|
@misc{lowphansirikul2021wangchanberta, |
|
title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, |
|
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong}, |
|
year={2021}, |
|
eprint={2101.09635}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |