p208p2002
/

llama-3-zhtw-8B

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

p208p2002 commited on May 27

Commit

470377c

•

1 Parent(s): 34acfeb

Create README.md

Files changed (1) hide show

README.md +51 -0

README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+---
+datasets:
+- HuggingFaceFW/fineweb
+- erhwenkuo/c4-chinese-zhtw
+- erhwenkuo/wikipedia-zhtw
+- p208p2002/wudao
+- p208p2002/NDLTD-T10-90-111
+- codeparrot/github-code-clean
+language:
+- en
+- zh
+---
+# Llama 3 zhtw
+在 Llama 3 上試驗中文 Continue Pretraining (CP)，共計訓練 800M tokens。
+由於中文預訓練語料品質還有改進空間，CP 後表現未能超越原版 Llama 3，我們比較幾個開源社群訓練的中文 Llama 3 也有類似狀況。
+在英文方面 LLaMA 3 zhtw 使用 FineWeb，使得 MMLU 表現高於其他中文CP模型，能力與原版 LLaMA 3 持平。
+## Benchmarks
+| Models                       |     | ↑ TMMLU+ (ACC) | CMMLU (ACC)   | MMLU (ACC)    |
+| ---------------------------- | --- | -------------- | ------------- | ------------- |
+|                              |     | TC, Knowledge  | CN, Knowledge | EN, Knowledge |
+|                              |     | 5 shot         | 5 shot        | 5 shot        |
+| Yi-6B                        | 6B  | 49.63          | 75.53         | 65.35         |
+| Qwen-7B                      | 7B  | 42.84          | 73.1          | 61.00         |
+| Meta-Llama-3-8B              | 8B  | 41.97          | 50.8          | 65.17         |
+| **p208p2002/llama3-zhtw-8B** | 8B  | 41.84          | 50.6          | 65.31         |
+| Breeze-7B-Base-v0_1          | 7B  | 40.35          | 44.05         | 61.63         |
+| hfl/llama-3-chinese-8b       | 8B  | 39.64          | 50.9          | 61.1          |
+## Recipe
+### Datasets
+| Dataset        | Lang        | Weight |
+|----------------|-------------|--------|
+| FineWeb        | en          | 0.35   |
+| Wudao          | zh-cn       | 0.1    |
+| C4Tw           | zh-tw       | 0.1    |
+| WikiZhTw       | zh-tw       | 0.15   |
+| NdltdT10       | zh-tw       | 0.1    |
+| GitHubMarkDown | code        | 0.1    |
+| GitHubPython   | code        | 0.1    |
+### Hyper Parameters
+- Learning Rate: 1e-7
+- Global Batch Size: 45
+- Sequence Length: 8192