aisingapore
/

llama3-8b-cpt-sea-lionv2-base

@@ -7,10 +7,10 @@ language:
 - vi
 license: llama3
 ---
-# LLaMA3 8B SEA-LIONv2
 SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
-This is the card for the LLaMA3 8B SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
 SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
@@ -19,18 +19,18 @@ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
 ### Model Description
-The continued pre-training data for LLaMA3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by:** Singapore NRF
 - **Model type:** Decoder
 - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
-- **License:** [LLaMA3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
 For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
 ### Benchmark Performance
-We evaluated LLaMA3 8B SEA-LIONv2 base model on general language capabilities.
 #### General Language Capabilities
 For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
@@ -60,7 +60,7 @@ We also evaluated the model on English capabilities using tasks from the Open LL
 ### Data
-LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
 | Data Source               | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
 |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
@@ -81,16 +81,16 @@ LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the f
 | Wiki* - Vietnamese        |         0.31      |          4 |        1.24      |      2.58      |
 Note:
-- All token counts are counted using LLaMA3 tokenizer
 - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
 - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
 ### Infrastructure
-LLaMA3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
 on the following hardware:
-| Training Details     | LLaMA3 8B SEA-LIONv2 |
 |----------------------|:--------------------:|
 | AWS EC2 p5d.24xlarge |          8 instances |
 | Nvidia H100 80GB GPU |          64          |
@@ -99,7 +99,7 @@ on the following hardware:
 ### Configuration
-| HyperParameter    | LLaMA3 8B SEA-LIONv2 |
 |-------------------|:--------------------:|
 | Precision         | bfloat16             |
 | Optimizer         | decoupled_adamw      |

 - vi
 license: llama3
 ---
+# Llama3 8B SEA-LIONv2
 SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
+This is the card for the Llama3 8B SEA-LIONv2 base model which has undergone continued pre-training from the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.
 SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
 ### Model Description
+The continued pre-training data for Llama3 8B SEA-LIONv2 base model encompasses approximately 48B tokens.
 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by:** Singapore NRF
 - **Model type:** Decoder
 - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
+- **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
 For tokenization, the model employs the default tokenizer used in Meta-Llama-3-8B-Instruct.
 ### Benchmark Performance
+We evaluated Llama3 8B SEA-LIONv2 base model on general language capabilities.
 #### General Language Capabilities
 For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
 ### Data
+Llama3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
 | Data Source               | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
 |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
 | Wiki* - Vietnamese        |         0.31      |          4 |        1.24      |      2.58      |
 Note:
+- All token counts are counted using Llama3 tokenizer
 - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
 - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
 ### Infrastructure
+Llama3 8B SEA-LIONv2 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
 on the following hardware:
+| Training Details     | Llama3 8B SEA-LIONv2 |
 |----------------------|:--------------------:|
 | AWS EC2 p5d.24xlarge |          8 instances |
 | Nvidia H100 80GB GPU |          64          |
 ### Configuration
+| HyperParameter    | Llama3 8B SEA-LIONv2 |
 |-------------------|:--------------------:|
 | Precision         | bfloat16             |
 | Optimizer         | decoupled_adamw      |