Update README.md
Browse files
README.md
CHANGED
@@ -6,36 +6,46 @@ pipeline_tag: text-generation
|
|
6 |
---
|
7 |
# Model Card for Tanrei/GPTSAN-japanese
|
8 |
|
|
|
|
|
9 |
General-purpose Swich transformer based Japanese language model
|
10 |
|
|
|
|
|
|
|
|
|
11 |
## Text Generation
|
12 |
|
13 |
```python
|
14 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
15 |
-
|
16 |
>>> device = "cuda"
|
17 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
18 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
19 |
-
>>> x_token = tokenizer
|
20 |
>>> trainer_utils.set_seed(30)
|
21 |
-
>>>
|
|
|
22 |
>>> tokenizer.decode(gen_token[0])
|
23 |
"織田信長は、政治・軍事の中枢まで掌握した政治家であり、日本史上類を見ない驚異的な軍事侵攻を続け..."
|
24 |
```
|
25 |
|
26 |
|
27 |
|
|
|
28 |
## Text Generation with Prefix-LM model
|
29 |
|
30 |
```python
|
31 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
32 |
-
|
33 |
>>> device = "cuda"
|
34 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
35 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
36 |
-
>>> x_token = tokenizer
|
37 |
>>> trainer_utils.set_seed(30)
|
38 |
-
>>>
|
|
|
|
|
39 |
>>> tokenizer.decode(gen_token[0])
|
40 |
"織田信長は、政治・外交で数々の戦果を上げるが、1568年からは、いわゆる本能寺の変で細川晴元に暗殺される..."
|
41 |
```
|
@@ -45,17 +55,21 @@ General-purpose Swich transformer based Japanese language model
|
|
45 |
|
46 |
```python
|
47 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
48 |
-
|
49 |
>>> device = "cuda"
|
50 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
51 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
52 |
-
>>> x_token = tokenizer
|
|
|
|
|
53 |
>>> trainer_utils.set_seed(30)
|
54 |
-
>>>
|
55 |
-
>>>
|
|
|
|
|
56 |
>>> tokenizer.decode(out_mlm_token[0])
|
57 |
"武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。"
|
58 |
-
>>> tokenizer.decode(out_lm_token[0][
|
59 |
"武田氏の三代に渡った武田家のひとり\n甲斐市に住む、日本史上最大の戦国大名。..."
|
60 |
```
|
61 |
|
@@ -74,6 +88,22 @@ It has the same structure as the model introduced as `Prefix LM` in the T5 paper
|
|
74 |
- **Language(s) (NLP):** Japanese
|
75 |
- **License:** MIT License
|
76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
## Model Sources
|
78 |
|
79 |
<!-- Provide the basic links for the model. -->
|
|
|
6 |
---
|
7 |
# Model Card for Tanrei/GPTSAN-japanese
|
8 |
|
9 |
+
![GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/logo-bk.png?raw=true)
|
10 |
+
|
11 |
General-purpose Swich transformer based Japanese language model
|
12 |
|
13 |
+
GPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models.
|
14 |
+
The Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text.
|
15 |
+
GPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original [GPTSAN repository](https://github.com/tanreinama/GPTSAN/) for details.
|
16 |
+
|
17 |
## Text Generation
|
18 |
|
19 |
```python
|
20 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
21 |
+
|
22 |
>>> device = "cuda"
|
23 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
24 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
25 |
+
>>> x_token = tokenizer("織田信長は、", return_tensors="pt")
|
26 |
>>> trainer_utils.set_seed(30)
|
27 |
+
>>> input_ids = x_token.input_ids.to(device)
|
28 |
+
>>> gen_token = model.generate(input_ids, max_new_tokens=50)
|
29 |
>>> tokenizer.decode(gen_token[0])
|
30 |
"織田信長は、政治・軍事の中枢まで掌握した政治家であり、日本史上類を見ない驚異的な軍事侵攻を続け..."
|
31 |
```
|
32 |
|
33 |
|
34 |
|
35 |
+
|
36 |
## Text Generation with Prefix-LM model
|
37 |
|
38 |
```python
|
39 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
40 |
+
|
41 |
>>> device = "cuda"
|
42 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
43 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
44 |
+
>>> x_token = tokenizer("", prefix_text="織田信長は、", return_tensors="pt")
|
45 |
>>> trainer_utils.set_seed(30)
|
46 |
+
>>> input_ids = x_token.input_ids.to(device)
|
47 |
+
>>> token_type_ids = x_token.token_type_ids.to(device)
|
48 |
+
>>> gen_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
|
49 |
>>> tokenizer.decode(gen_token[0])
|
50 |
"織田信長は、政治・外交で数々の戦果を上げるが、1568年からは、いわゆる本能寺の変で細川晴元に暗殺される..."
|
51 |
```
|
|
|
55 |
|
56 |
```python
|
57 |
>>> from transformers import AutoModel, AutoTokenizer, trainer_utils
|
58 |
+
|
59 |
>>> device = "cuda"
|
60 |
>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").to(device)
|
61 |
>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese")
|
62 |
+
>>> x_token = tokenizer(
|
63 |
+
"", prefix_text="武田信玄は、<|inputmask|>時代ファンならぜひ押さえ<|inputmask|>きたい名将の一人。", return_tensors="pt"
|
64 |
+
)
|
65 |
>>> trainer_utils.set_seed(30)
|
66 |
+
>>> input_ids = x_token.input_ids.to(device)
|
67 |
+
>>> token_type_ids = x_token.token_type_ids.to(device)
|
68 |
+
>>> out_lm_token = model.generate(input_ids, token_type_ids=token_type_ids, max_new_tokens=50)
|
69 |
+
>>> out_mlm_token = model(input_ids, token_type_ids=token_type_ids).logits.argmax(axis=-1)
|
70 |
>>> tokenizer.decode(out_mlm_token[0])
|
71 |
"武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。"
|
72 |
+
>>> tokenizer.decode(out_lm_token[0][input_ids.shape[1] :])
|
73 |
"武田氏の三代に渡った武田家のひとり\n甲斐市に住む、日本史上最大の戦国大名。..."
|
74 |
```
|
75 |
|
|
|
88 |
- **Language(s) (NLP):** Japanese
|
89 |
- **License:** MIT License
|
90 |
|
91 |
+
### Prefix-LM Model
|
92 |
+
|
93 |
+
GPTSAN has the structure of the model named Prefix-LM in the `T5` paper. (The original GPTSAN repository calls it `hybrid`)
|
94 |
+
In GPTSAN, the `Prefix` part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length.
|
95 |
+
Arbitrary lengths can also be specified differently for each batch.
|
96 |
+
This length applies to the text entered in `prefix_text` for the tokenizer.
|
97 |
+
The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.
|
98 |
+
The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.
|
99 |
+
|
100 |
+
### Spout Vector
|
101 |
+
|
102 |
+
A Spout Vector is a special vector for controlling text generation.
|
103 |
+
This vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens.
|
104 |
+
In this pre-trained model, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention.
|
105 |
+
The Spout Vector projected by the fully connected layer is split to be passed to all self-attentions.
|
106 |
+
|
107 |
## Model Sources
|
108 |
|
109 |
<!-- Provide the basic links for the model. -->
|