junnyu commited on
Commit
3327b18
1 Parent(s): f23fea6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -65
README.md CHANGED
@@ -11,70 +11,72 @@
11
  2. 设置参数运行convert.py代码
12
  3. 例子:
13
  假设我想转换https://huggingface.co/junnyu/roformer_chinese_base 权重
14
- (1)首先下载 https://huggingface.co/junnyu/roformer_chinese_base/tree/main 中的pytorch_model.bin文件,假设我们存入了`./roformer_chinese_base/pytorch_model.bin`
15
- (2)运行convert.py
16
- ```bash
17
- python convert.py \
18
- --pytorch_checkpoint_path ./roformer_chinese_base/pytorch_model.bin \
19
- --paddle_dump_path ./roformer_chinese_base/model_state.pdparams
20
- ```
21
- (3)最终我们得到了转化好的权重`./roformer_chinese_base/model_state.pdparams`
 
22
 
23
  ## 预训练MLM测试
24
- # test_mlm.py
25
- ```python
26
- import paddle
27
- import argparse
28
- from paddlenlp.transformers import RoFormerForPretraining, RoFormerTokenizer
29
-
30
- def test_mlm(text, model_name):
31
- model = RoFormerForPretraining.from_pretrained(model_name)
32
- model.eval()
33
- tokenizer = RoFormerTokenizer.from_pretrained(model_name)
34
- tokens = ["[CLS]"]
35
- text_list = text.split("[MASK]")
36
- for i,t in enumerate(text_list):
37
- tokens.extend(tokenizer.tokenize(t))
38
- if i==len(text_list)-1:
39
- tokens.extend(["[SEP]"])
40
- else:
41
- tokens.extend(["[MASK]"])
42
-
43
- input_ids_list = tokenizer.convert_tokens_to_ids(tokens)
44
- input_ids = paddle.to_tensor([input_ids_list])
45
-
46
- with paddle.no_grad():
47
- pd_outputs = model(input_ids)[0][0]
48
- pd_outputs_sentence = "paddle: "
49
- for i, id in enumerate(input_ids_list):
50
- if id == tokenizer.convert_tokens_to_ids(["[MASK]"])[0]:
51
- tokens = tokenizer.convert_ids_to_tokens(pd_outputs[i].topk(5)[1].tolist())
52
- pd_outputs_sentence += "[" + "||".join(tokens) + "]"
53
- else:
54
- pd_outputs_sentence += "".join(
55
- tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True)
56
- )
57
- print(pd_outputs_sentence)
58
-
59
- if __name__ == "__main__":
60
- parser = argparse.ArgumentParser()
61
- parser.add_argument(
62
- "--model_name", default="roformer-chinese-base", type=str, help="Pretrained roformer name or path."
63
- )
64
- parser.add_argument(
65
- "--text", default="今天[MASK]很好,我想去公园玩!", type=str, help="MLM text."
66
- )
67
- args = parser.parse_args()
68
- test_mlm(text=args.text, model_name=args.model_name)
69
-
70
- ```
71
- ```bash
72
- python test_mlm.py --model_name roformer-chinese-base --text 今天[MASK]很好,我想去公园玩!
73
- # paddle: 今天[天气||天||阳光||太阳||空气]很好,我想去公园玩!
74
- python test_mlm.py --model_name roformer-chinese-base --text 北京是[MASK]的首都!
75
- # paddle: 北京是[中国||谁||中华人民共和国||我们||中华民族]的首都!
76
- python test_mlm.py --model_name roformer-chinese-char-base --text 今天[MASK]很好,我想去公园玩!
77
- # paddle: 今天[天||气||都||风||人]很好,我想去公园玩!
78
- python test_mlm.py --model_name roformer-chinese-char-base --text 北京是[MASK]的首都!
79
- # paddle: 北京是[谁||我||你||他||国]的首都!
80
- ```
 
 
11
  2. 设置参数运行convert.py代码
12
  3. 例子:
13
  假设我想转换https://huggingface.co/junnyu/roformer_chinese_base 权重
14
+ - (1)首先下载 https://huggingface.co/junnyu/roformer_chinese_base/tree/main 中的pytorch_model.bin文件,假设我们存入了`./roformer_chinese_base/pytorch_model.bin`
15
+ - (2)运行convert.py
16
+ ```bash
17
+ python convert.py \
18
+ --pytorch_checkpoint_path ./roformer_chinese_base/pytorch_model.bin \
19
+ --paddle_dump_path ./roformer_chinese_base/model_state.pdparams
20
+ ```
21
+ - (3)最终我们得到了转化好的权重`./roformer_chinese_base/model_state.pdparams`
22
+
23
 
24
  ## 预训练MLM测试
25
+ ### test_mlm.py
26
+ ```python
27
+ import paddle
28
+ import argparse
29
+ from paddlenlp.transformers import RoFormerForPretraining, RoFormerTokenizer
30
+
31
+ def test_mlm(text, model_name):
32
+ model = RoFormerForPretraining.from_pretrained(model_name)
33
+ model.eval()
34
+ tokenizer = RoFormerTokenizer.from_pretrained(model_name)
35
+ tokens = ["[CLS]"]
36
+ text_list = text.split("[MASK]")
37
+ for i,t in enumerate(text_list):
38
+ tokens.extend(tokenizer.tokenize(t))
39
+ if i==len(text_list)-1:
40
+ tokens.extend(["[SEP]"])
41
+ else:
42
+ tokens.extend(["[MASK]"])
43
+
44
+ input_ids_list = tokenizer.convert_tokens_to_ids(tokens)
45
+ input_ids = paddle.to_tensor([input_ids_list])
46
+
47
+ with paddle.no_grad():
48
+ pd_outputs = model(input_ids)[0][0]
49
+ pd_outputs_sentence = "paddle: "
50
+ for i, id in enumerate(input_ids_list):
51
+ if id == tokenizer.convert_tokens_to_ids(["[MASK]"])[0]:
52
+ tokens = tokenizer.convert_ids_to_tokens(pd_outputs[i].topk(5)[1].tolist())
53
+ pd_outputs_sentence += "[" + "||".join(tokens) + "]"
54
+ else:
55
+ pd_outputs_sentence += "".join(
56
+ tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True)
57
+ )
58
+ print(pd_outputs_sentence)
59
+
60
+ if __name__ == "__main__":
61
+ parser = argparse.ArgumentParser()
62
+ parser.add_argument(
63
+ "--model_name", default="roformer-chinese-base", type=str, help="Pretrained roformer name or path."
64
+ )
65
+ parser.add_argument(
66
+ "--text", default="今天[MASK]很好,我想去公园玩!", type=str, help="MLM text."
67
+ )
68
+ args = parser.parse_args()
69
+ test_mlm(text=args.text, model_name=args.model_name)
70
+
71
+ ```
72
+ ### 输出
73
+ ```bash
74
+ python test_mlm.py --model_name roformer-chinese-base --text 今天[MASK]很好,我想去公园玩!
75
+ # paddle: 今天[天气||天||阳光||太阳||空气]很好,我想去公园玩!
76
+ python test_mlm.py --model_name roformer-chinese-base --text 北京是[MASK]的首都!
77
+ # paddle: 北京是[中国||谁||中华人民共和国||我们||中华民族]的首都!
78
+ python test_mlm.py --model_name roformer-chinese-char-base --text 今天[MASK]很好,我想去公园玩!
79
+ # paddle: 今天[天||气||都||风||人]很好,我想去公园玩!
80
+ python test_mlm.py --model_name roformer-chinese-char-base --text 北京是[MASK]的首都!
81
+ # paddle: 北京是[谁||我||你||他||国]的首都!
82
+ ```