uer commited on
Commit
ffee30c
1 Parent(s): 8b7ad07

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -19
README.md CHANGED
@@ -8,20 +8,32 @@ widget:
8
  ---
9
 
10
 
11
- # Chinese GPT2-medium Model
12
 
13
  ## Model description
14
 
15
- The model is used to generate Chinese texts. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-medium-chinese-cluecorpussmall](https://huggingface.co/uer/gpt2-medium-chinese-cluecorpussmall).
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## How to use
18
 
19
- You can use the model directly with a pipeline for text generation:
20
 
21
  ```python
22
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
23
- >>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-medium-chinese-cluecorpussmall")
24
- >>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-medium-chinese-cluecorpussmall")
25
  >>> text_generator = TextGenerationPipeline(model, tokenizer)
26
  >>> text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
27
  [{'generated_text': '这是很久之前的事情了 。 我 现 在 想 起 来 就 让 自 己 很 伤 心 , 很 失 望 。 我 现 在 想 到 , 我 觉 得 大 多 数 人 的 生 活 比 我 的 生 命 还 要 重 要 , 对 一 些 事 情 的 看 法 , 对 一 些 人 的 看 法 , 都 是 在 发 泄 。 但 是 , 我 们 的 生 活 是 需 要 一 个 信 用 体 系 的 。 我 不 知'}]
@@ -33,7 +45,9 @@ You can use the model directly with a pipeline for text generation:
33
 
34
  ## Training procedure
35
 
36
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
 
 
37
 
38
  Stage1:
39
 
@@ -44,14 +58,71 @@ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
44
  --seq_length 128 --processes_num 32 --data_processor lm
45
  ```
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
  deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
49
  --dataset_path corpora/cluecorpussmall_lm_seq128_dataset.pt \
50
  --vocab_path models/google_zh_vocab.txt \
51
- --config_path models/gpt2/medium_config.json \
52
- --output_model_path models/cluecorpussmall_gpt2_medium_seq128.bin \
53
  --world_size 8 --batch_size 64 \
54
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000
 
 
 
 
 
 
 
 
55
  ```
56
 
57
  Stage2:
@@ -60,28 +131,34 @@ Stage2:
60
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
61
  --vocab_path models/google_zh_vocab.txt \
62
  --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
63
- --seq_length 1024 --processes_num 32 --data_processor lm
64
  ```
65
 
66
  ```
67
  deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
68
  --dataset_path corpora/cluecorpussmall_lm_seq1024_dataset.pt \
69
  --vocab_path models/google_zh_vocab.txt \
70
- --config_path models/gpt2/medium_config.json \
71
- --pretrained_model_path models/cluecorpussmall_gpt2_medium_seq128_pt.bin \
72
- --output_model_path models/cluecorpussmall_gpt2_medium_seq1024_stage2 \
73
  --world_size 8 --batch_size 16 --learning_rate 5e-5 \
74
  --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
75
- --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 12
 
 
 
 
 
 
 
76
  ```
77
 
78
  Finally, we convert the pre-trained model into Huggingface's format:
79
 
80
  ```
81
- python3 models/cluecorpussmall_gpt2_medium_seq1024_stage2/zero_to_fp32.py models/cluecorpussmall_gpt2_medium_seq1024_stage2 cluecorpussmall_gpt2_medium_seq1024_model.bin
82
- python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_medium_seq1024_model.bin \
83
- --output_model_path pytorch_model.bin \
84
- --layers_num 24
85
  ```
86
 
87
  ### BibTeX entry and citation info
@@ -100,4 +177,17 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluec
100
  pages={241},
101
  year={2019}
102
  }
103
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
 
11
+ # Chinese GPT2-distil Model
12
 
13
  ## Model description
14
 
15
+ The set of GPT2 models, except for GPT2-xlarge model, are pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658). The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain) introduced in [this paper](https://arxiv.org/abs/2212.06385), which inherits UER-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. Besides, the other models could also be pre-trained by TencentPretrain.
16
+
17
+ The model is used to generate Chinese texts. You can download the set of Chinese GPT2 models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
18
+
19
+ | | Link |
20
+ | ----------------- | :----------------------------: |
21
+ | **GPT2-distil** | [**L=6/H=768**][distil] |
22
+ | **GPT2** | [**L=12/H=768**][base] |
23
+ | **GPT2-medium** | [**L=24/H=1024**][medium] |
24
+ | **GPT2-large** | [**L=36/H=1280**][large] |
25
+ | **GPT2-xlarge** | [**L=48/H=1600**][xlarge] |
26
+
27
+ Note that the 6-layer model is called GPT2-distil model because it follows the configuration of [distilgpt2](https://huggingface.co/distilgpt2), and the pre-training does not involve the supervision of larger models.
28
 
29
  ## How to use
30
 
31
+ You can use the model directly with a pipeline for text generation (take the case of GPT2-distil):
32
 
33
  ```python
34
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
35
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
36
+ >>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
37
  >>> text_generator = TextGenerationPipeline(model, tokenizer)
38
  >>> text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
39
  [{'generated_text': '这是很久之前的事情了 。 我 现 在 想 起 来 就 让 自 己 很 伤 心 , 很 失 望 。 我 现 在 想 到 , 我 觉 得 大 多 数 人 的 生 活 比 我 的 生 命 还 要 重 要 , 对 一 些 事 情 的 看 法 , 对 一 些 人 的 看 法 , 都 是 在 发 泄 。 但 是 , 我 们 的 生 活 是 需 要 一 个 信 用 体 系 的 。 我 不 知'}]
 
45
 
46
  ## Training procedure
47
 
48
+ The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain), and the others are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
49
+
50
+ For the models pre-trained by UER-py, take the case of GPT2-distil
51
 
52
  Stage1:
53
 
 
58
  --seq_length 128 --processes_num 32 --data_processor lm
59
  ```
60
 
61
+ ```
62
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
63
+ --vocab_path models/google_zh_vocab.txt \
64
+ --config_path models/gpt2/distil_config.json \
65
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin \
66
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
67
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
68
+ --learning_rate 1e-4 --batch_size 64
69
+ ```
70
+
71
+ Stage2:
72
+
73
+ ```
74
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
75
+ --vocab_path models/google_zh_vocab.txt \
76
+ --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
77
+ --seq_length 1024 --processes_num 32 --data_processor lm
78
+ ```
79
+
80
+ ```
81
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
82
+ --vocab_path models/google_zh_vocab.txt \
83
+ --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin-1000000 \
84
+ --config_path models/gpt2/distil_config.json \
85
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
86
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
87
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
88
+ --learning_rate 5e-5 --batch_size 16
89
+ ```
90
+
91
+ Finally, we convert the pre-trained model into Huggingface's format:
92
+
93
+ ```
94
+ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_distil_seq1024_model.bin-250000 \
95
+ --output_model_path pytorch_model.bin \
96
+ --layers_num 6
97
+ ```
98
+
99
+ For GPT2-xlarge model, we use TencetPretrain.
100
+
101
+ Stage1:
102
+
103
+ ```
104
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
105
+ --vocab_path models/google_zh_vocab.txt \
106
+ --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
107
+ --seq_length 128 --processes_num 32 --data_processor lm
108
+ ```
109
+
110
  ```
111
  deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
112
  --dataset_path corpora/cluecorpussmall_lm_seq128_dataset.pt \
113
  --vocab_path models/google_zh_vocab.txt \
114
+ --config_path models/gpt2/xlarge_config.json \
115
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq128 \
116
  --world_size 8 --batch_size 64 \
117
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
118
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 24
119
+ ```
120
+
121
+ Before stage2, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
122
+
123
+ ```
124
+ python3 models/cluecorpussmall_gpt2_xlarge_seq128/zero_to_fp32.py models/cluecorpussmall_gpt2_xlarge_seq128/ \
125
+ models/cluecorpussmall_gpt2_xlarge_seq128.bin
126
  ```
127
 
128
  Stage2:
 
131
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
132
  --vocab_path models/google_zh_vocab.txt \
133
  --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
134
+ --seq_length 1024 --processes_num 32 --data_processor lm
135
  ```
136
 
137
  ```
138
  deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
139
  --dataset_path corpora/cluecorpussmall_lm_seq1024_dataset.pt \
140
  --vocab_path models/google_zh_vocab.txt \
141
+ --config_path models/gpt2/xlarge_config.json \
142
+ --pretrained_model_path models/cluecorpussmall_gpt2_xlarge_seq128.bin \
143
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2 \
144
  --world_size 8 --batch_size 16 --learning_rate 5e-5 \
145
  --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
146
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 6
147
+ ```
148
+
149
+ Then, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
150
+
151
+ ```
152
+ python3 models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/zero_to_fp32.py models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/ \
153
+ models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin
154
  ```
155
 
156
  Finally, we convert the pre-trained model into Huggingface's format:
157
 
158
  ```
159
+ python3 scripts/convert_gpt2_from_tencentpretrain_to_huggingface.py --input_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin \
160
+ --output_model_path pytorch_model.bin \
161
+ --layers_num 48
 
162
  ```
163
 
164
  ### BibTeX entry and citation info
 
177
  pages={241},
178
  year={2019}
179
  }
180
+
181
+ @article{zhao2023tencentpretrain,
182
+ title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
183
+ author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
184
+ journal={ACL 2023},
185
+ pages={217},
186
+ year={2023}
187
+ ```
188
+
189
+ [distil]:https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
190
+ [base]:https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
191
+ [medium]:https://huggingface.co/uer/gpt2-medium-chinese-cluecorpussmall
192
+ [large]:https://huggingface.co/uer/gpt2-large-chinese-cluecorpussmall
193
+ [xlarge]:https://huggingface.co/uer/gpt2-xlarge-chinese-cluecorpussmall