Tanrei commited on
Commit
a25fc7b
1 Parent(s): ac78298

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -11
README.md CHANGED
@@ -6,36 +6,46 @@ pipeline_tag: text-generation
6
  ---
7
  # Model Card for Tanrei/GPTSAN-japanese
8
 
 
 
9
  General-purpose Swich transformer based Japanese language model
10
 
 
 
 
 
11
  ## Text Generation
12
 
13
  ```python
14
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
15
- >>>
16
  >>> device = "cuda"
17
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
18
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
19
- >>> x_token = tokenizer.encode("織田信長は、", return_tensors="pt").to(device)
20
  >>> trainer_utils.set_seed(30)
21
- >>> gen_token = model.generate(x_token, max_new_tokens=50)
 
22
  >>> tokenizer.decode(gen_token[0])
23
  "織田信長は、政治・軍事の中枢まで掌握した政治家であり、日本史上類を見ない驚異的な軍事侵攻を続け..."
24
  ```
25
 
26
 
27
 
 
28
  ## Text Generation with Prefix-LM model
29
 
30
  ```python
31
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
32
- >>>
33
  >>> device = "cuda"
34
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
35
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
36
- >>> x_token = tokenizer.encode("", prefix_text="織田信長は、", return_tensors="pt").to(device)
37
  >>> trainer_utils.set_seed(30)
38
- >>> gen_token = model.generate(x_token, max_new_tokens=50)
 
 
39
  >>> tokenizer.decode(gen_token[0])
40
  "織田信長は、政治・外交で数々の戦果を上げるが、1568年からは、いわゆる本能寺の変で細川晴元に暗殺される..."
41
  ```
@@ -45,17 +55,21 @@ General-purpose Swich transformer based Japanese language model
45
 
46
  ```python
47
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
48
- >>>
49
  >>> device = "cuda"
50
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
51
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
52
- >>> x_token = tokenizer.encode("", prefix_text="武田信玄は、<|inputmask|>時代ファンならぜひ押さえ<|inputmask|>きたい名将の一人。", return_tensors="pt").to(device)
 
 
53
  >>> trainer_utils.set_seed(30)
54
- >>> out_lm_token = model.generate(x_token, max_new_tokens=50)
55
- >>> out_mlm_token = model(x_token).logits.argmax(axis=-1)
 
 
56
  >>> tokenizer.decode(out_mlm_token[0])
57
  "武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。"
58
- >>> tokenizer.decode(out_lm_token[0][x_token.shape[1]:])
59
  "武田氏の三代に渡った武田家のひとり\n甲斐市に住む、日本史上最大の戦国大名。..."
60
  ```
61
 
@@ -74,6 +88,22 @@ It has the same structure as the model introduced as `Prefix LM` in the T5 paper
74
  - **Language(s) (NLP):** Japanese
75
  - **License:** MIT License
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Model Sources
78
 
79
  <!-- Provide the basic links for the model. -->
 
6
  ---
7
  # Model Card for Tanrei/GPTSAN-japanese
8
 
9
+ ![GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/logo-bk.png?raw=true)
10
+
11
  General-purpose Swich transformer based Japanese language model
12
 
13
+ GPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models.
14
+ The Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text.
15
+ GPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original [GPTSAN repository](https://github.com/tanreinama/GPTSAN/) for details.
16
+
17
  ## Text Generation
18
 
19
  ```python
20
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
21
+
22
  >>> device = "cuda"
23
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
24
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
25
+ >>> x_token = tokenizer("織田信長は、", return_tensors="pt")
26
  >>> trainer_utils.set_seed(30)
27
+ >>> input_ids = x_token.input_ids.to(device)
28
+ >>> gen_token = model.generate(input_ids, max_new_tokens=50)
29
  >>> tokenizer.decode(gen_token[0])
30
  "織田信長は、政治・軍事の中枢まで掌握した政治家であり、日本史上類を見ない驚異的な軍事侵攻を続け..."
31
  ```
32
 
33
 
34
 
35
+
36
  ## Text Generation with Prefix-LM model
37
 
38
  ```python
39
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
40
+
41
  >>> device = "cuda"
42
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
43
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
44
+ >>> x_token = tokenizer("", prefix_text="織田信長は、", return_tensors="pt")
45
  >>> trainer_utils.set_seed(30)
46
+ >>> input_ids = x_token.input_ids.to(device)
47
+ >>> token_type_ids = x_token.token_type_ids.to(device)
48
+ >>> gen_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
49
  >>> tokenizer.decode(gen_token[0])
50
  "織田信長は、政治・外交で数々の戦果を上げるが、1568年からは、いわゆる本能寺の変で細川晴元に暗殺される..."
51
  ```
 
55
 
56
  ```python
57
  >>> from transformers import AutoModel, AutoTokenizer, trainer_utils
58
+
59
  >>> device = "cuda"
60
  >>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
61
  >>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
62
+ >>> x_token = tokenizer(
63
+ "", prefix_text="武田信玄は、<|inputmask|>時代ファンならぜひ押さえ<|inputmask|>きたい名将の一人。", return_tensors="pt"
64
+ )
65
  >>> trainer_utils.set_seed(30)
66
+ >>> input_ids = x_token.input_ids.to(device)
67
+ >>> token_type_ids = x_token.token_type_ids.to(device)
68
+ >>> out_lm_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
69
+ >>> out_mlm_token = model(input_ids, token_type_ids=token_type_ids).logits.argmax(axis=-1)
70
  >>> tokenizer.decode(out_mlm_token[0])
71
  "武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。"
72
+ >>> tokenizer.decode(out_lm_token[0][input_ids.shape[1] :])
73
  "武田氏の三代に渡った武田家のひとり\n甲斐市に住む、日本史上最大の戦国大名。..."
74
  ```
75
 
 
88
  - **Language(s) (NLP):** Japanese
89
  - **License:** MIT License
90
 
91
+ ### Prefix-LM Model
92
+
93
+ GPTSAN has the structure of the model named Prefix-LM in the `T5` paper. (The original GPTSAN repository calls it `hybrid`)
94
+ In GPTSAN, the `Prefix` part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length.
95
+ Arbitrary lengths can also be specified differently for each batch.
96
+ This length applies to the text entered in `prefix_text` for the tokenizer.
97
+ The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.
98
+ The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.
99
+
100
+ ### Spout Vector
101
+
102
+ A Spout Vector is a special vector for controlling text generation.
103
+ This vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens.
104
+ In this pre-trained model, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention.
105
+ The Spout Vector projected by the fully connected layer is split to be passed to all self-attentions.
106
+
107
  ## Model Sources
108
 
109
  <!-- Provide the basic links for the model. -->