|
--- |
|
license: mit |
|
language: |
|
- en |
|
- zh |
|
- id |
|
- ms |
|
- th |
|
- vi |
|
- tl |
|
- ta |
|
- my |
|
- km |
|
- lo |
|
inference: false |
|
--- |
|
# SEA-LION-BERT |
|
|
|
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>. |
|
|
|
This is the card for the SEA-LION-BERT base model. |
|
|
|
## How To Use |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('aisingapore/sealion-bert-base', trust_remote_code=True) |
|
model = AutoModelForMaskedLM.from_pretrained('aisingapore/sealion-bert-base', trust_remote_code=True) |
|
|
|
# prepare input |
|
text = "Give me a <|mask|>!!!" |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
|
|
``` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The SEA-LION-BERT model is built on the MosaicBERT architecture and has a vocabulary size of 256K. |
|
|
|
For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance. |
|
|
|
The training data for SEA-LION-BERT encompasses 790B tokens. |
|
|
|
- **Developed by:** Products Pillar, AI Singapore |
|
- **Funded by:** Singapore NRF |
|
- **Model type:** Encoder |
|
- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao |
|
- **License:** MIT License |
|
|
|
## Training Details |
|
|
|
### Data |
|
|
|
SEA-LION was trained on 790B tokens of the following data: |
|
|
|
| Data Source | Tokens | Percentage | |
|
|---------------------------|-------:|:----------:| |
|
| RefinedWeb - English | 571.3B | 72.26% | |
|
| mC4 - Chinese | 91.2B | 11.54% | |
|
| mC4 - Indonesian | 14.7B | 1.86% | |
|
| mC4 - Malay | 2.9B | 0.36% | |
|
| mC4 - Filipino | 5.3B | 0.67% | |
|
| mC4 - Burmese | 4.9B | 0.61% | |
|
| mC4 - Vietnamese | 63.4B | 8.02% | |
|
| mC4 - Thai | 21.6B | 2.74% | |
|
| mC4 - Lao | 1.1B | 0.14% | |
|
| mC4 - Khmer | 3.9B | 0.50% | |
|
| mC4 - Tamil | 10.2B | 1.29% | |
|
|
|
### Infrastructure |
|
|
|
SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer) |
|
on the following hardware: |
|
|
|
| Training Details | SEA-LION-BERT | |
|
|----------------------|:-------------:| |
|
| Nvidia A100 40GB GPU | 4 | |
|
| Training Duration | 14 days | |
|
|
|
|
|
### Configuration |
|
|
|
| HyperParameter | SEA-LION-BERT | |
|
|-------------------|:-----------------------:| |
|
| Precision | bfloat16 | |
|
| Optimizer | decoupled_adamw | |
|
| Scheduler | linear_decay_with_warmup| |
|
| Learning Rate | 5e-4 | |
|
| Global Batch Size | 448 | |
|
| Micro Batch Size | 56 | |
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
SEA-LION-BERT is an encoder model using the MosaicBERT architecture. |
|
|
|
| Parameter | SEA-LION-BERT | |
|
|-----------------|:-------------:| |
|
| Layers | 12 | |
|
| d_model | 768 | |
|
| head_dim | 12 | |
|
| Vocabulary | 256000 | |
|
| Sequence Length | 128 | |
|
|
|
|
|
### Tokenizer Details |
|
|
|
We sample 20M lines from the training data to train the tokenizer.<br> |
|
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br> |
|
The tokenizer type is Byte-Pair Encoding (BPE). |
|
|
|
|
|
## The Team |
|
|
|
Montalan Jann Railey<br> |
|
Nguyen Thanh Ngan<br> |
|
Rengarajan Hamsawardhini<br> |
|
Teo Eng Sipp Leslie<br> |
|
Tjhi William<br> |
|
|
|
|
|
## Acknowledgements |
|
|
|
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. |
|
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. |
|
|
|
## Contact |
|
|
|
For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6) |