Mxode commited on
Commit
8a22474
1 Parent(s): 363f39d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -39
README.md CHANGED
@@ -1,39 +1,45 @@
1
- ---
2
- license: gpl-3.0
3
- language:
4
- - en
5
- ---
6
- # NanoLM-365M-base
7
-
8
- English | [简体中文](README_zh-CN.md)
9
-
10
- ## Introduction
11
-
12
- Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
13
-
14
- ## Details
15
-
16
- To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).
17
-
18
- | | Value |
19
- | :-------------------------: | :----------------------------------------------------------: |
20
- | Total Params | 365 M |
21
- | Trainable Params | < 10 M |
22
- | Trainable Parts | `model.embed_tokens` |
23
- | Training Steps | 40,000 |
24
- | Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
25
- | Optimizer | adamw_torch |
26
- | Learning Rate | 2e-4 |
27
- | LR Scheduler | cosine |
28
- | Weight Decay | 0.1 |
29
- | Warm-up Ratio | 0.03 |
30
- | Batch Size | 16 |
31
- | Gradient Accumulation Steps | 1 |
32
- | Seq Len | 4096 |
33
- | Dtype | bf16 |
34
- | Peak GPU Memory | < 48 GB |
35
- | Device | NVIDIA A100-SXM4-80GB |
36
-
37
-
38
- The specific training records are as follows:
39
- ![result](static/results.png)
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ language:
4
+ - en
5
+ datasets:
6
+ - HuggingFaceTB/cosmopedia-100k
7
+ - pleisto/wikipedia-cn-20230720-filtered
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - text-generation-inference
11
+ ---
12
+ # NanoLM-365M-base
13
+
14
+ English | [简体中文](README_zh-CN.md)
15
+
16
+ ## Introduction
17
+
18
+ Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
19
+
20
+ ## Details
21
+
22
+ To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).
23
+
24
+ | | Value |
25
+ | :-------------------------: | :----------------------------------------------------------: |
26
+ | Total Params | 365 M |
27
+ | Trainable Params | < 10 M |
28
+ | Trainable Parts | `model.embed_tokens` |
29
+ | Training Steps | 40,000 |
30
+ | Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
31
+ | Optimizer | adamw_torch |
32
+ | Learning Rate | 2e-4 |
33
+ | LR Scheduler | cosine |
34
+ | Weight Decay | 0.1 |
35
+ | Warm-up Ratio | 0.03 |
36
+ | Batch Size | 16 |
37
+ | Gradient Accumulation Steps | 1 |
38
+ | Seq Len | 4096 |
39
+ | Dtype | bf16 |
40
+ | Peak GPU Memory | < 48 GB |
41
+ | Device | NVIDIA A100-SXM4-80GB |
42
+
43
+
44
+ The specific training records are as follows:
45
+ ![result](static/results.png)