hotchpotch commited on
Commit
7620bf0
·
verified ·
1 Parent(s): efd90a1

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - hpprc/emb
5
+ - hotchpotch/hpprc_emb-scores
6
+ - microsoft/ms_marco
7
+ language:
8
+ - ja
9
+ base_model:
10
+ - tohoku-nlp/bert-base-japanese-v3
11
+ ---
12
+
13
+ 高性能な日本語 [SPLADE](https://github.com/naver/splade) (Sparse Lexical and Expansion Model) モデルです。[テキストからスパースベクトルへの変換デモ](https://huggingface.co/spaces/hotchpotch/japanese-splade-demo-streamlit)で、どのようなスパースベクトルに変換できるか、WebUI から気軽にお試しいただけます。
14
+
15
+ なお、テクニカルレポートは後日公開予定です。
16
+
17
+
18
+ # 利用方法
19
+
20
+ ## [YASEM (Yet Another Splade|Sparse Embedder)](https://github.com/hotchpotch/yasem)
21
+
22
+ ```bash
23
+ pip install yasem
24
+ ```
25
+
26
+ ```python
27
+ from yasem import SpladeEmbedder
28
+
29
+ model_name = "hotchpotch/japanese-splade-base-v1"
30
+ embedder = SpladeEmbedder(model_name)
31
+
32
+ sentences = [
33
+ "車の燃費を向上させる方法は?",
34
+ "急発進や急ブレーキを避け、一定速度で走行することで燃費が向上します。",
35
+ "車を長持ちさせるには、消耗品を適切なタイミングで交換することが重要です。",
36
+ ]
37
+
38
+ embeddings = embedder.encode(sentences)
39
+ similarity = embedder.similarity(embeddings, embeddings)
40
+
41
+ print(similarity)
42
+ # [[21.49299249 10.48868281 6.25582337]
43
+ # [10.48868281 12.90587398 3.19429791]
44
+ # [ 6.25582337 3.19429791 12.89678271]]
45
+ ```
46
+
47
+ ```python
48
+ token_values = embedder.get_token_values(embeddings[0])
49
+
50
+ print(token_values)
51
+
52
+ #{
53
+ # '車': 2.1796875,
54
+ # '燃費': 2.146484375,
55
+ # '向上': 1.7353515625,
56
+ # '方法': 1.55859375,
57
+ # '燃料': 1.3291015625,
58
+ # '効果': 1.1376953125,
59
+ # '良い': 0.873046875,
60
+ # '改善': 0.8466796875,
61
+ # 'アップ': 0.833984375,
62
+ # 'いう': 0.70849609375,
63
+ # '理由': 0.64453125,
64
+ # ...
65
+ ```
66
+
67
+ ## transformers
68
+
69
+ ```python
70
+
71
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
72
+ import torch
73
+
74
+ model = AutoModelForMaskedLM.from_pretrained(model_name)
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+
77
+ def splade_max_pooling(logits, attention_mask):
78
+ relu_log = torch.log(1 + torch.relu(logits))
79
+ weighted_log = relu_log * attention_mask.unsqueeze(-1)
80
+ max_val, _ = torch.max(weighted_log, dim=1)
81
+ return max_val
82
+
83
+ tokens = tokenizer(
84
+ sentences, return_tensors="pt", padding=True, truncation=True, max_length=512
85
+ )
86
+ tokens = {k: v.to(model.device) for k, v in tokens.items()}
87
+
88
+ with torch.no_grad():
89
+ outputs = model(**tokens)
90
+ embeddings = splade_max_pooling(outputs.logits, tokens["attention_mask"])
91
+
92
+ similarity = torch.matmul(embeddings.unsqueeze(0), embeddings.T).squeeze(0)
93
+ print(similarity)
94
+
95
+ # tensor([[21.4943, 10.4816, 6.2540],
96
+ # [10.4816, 12.9024, 3.1939],
97
+ # [ 6.2540, 3.1939, 12.8919]])
98
+ ```
99
+
100
+ # ベンチマークスコア
101
+
102
+ ## retrieval (JMTEB)
103
+
104
+ [JMTEB](https://github.com/sbintuitions/JMTEB) の評価結果です。japanese-splade-base-v1 は [JMTEB をスパースベクトルで評価できるように変更したコード](https://github.com/hotchpotch/JMTEB/tree/add_splade)での評価となっています。
105
+ なお、japanese-splade-base-v1 は jaqket, mrtydi のドメインを学習(testのデータ以外)しています。
106
+
107
+ | model_name | Avg. | jagovfaqs_22k | jaqket | mrtydi | nlp_journal_abs_intro | nlp_journal_title_abs | nlp_journal_title_intro |
108
+ | :------------------------------------------------------------------------- | ------: | ------------: | -----: | -----: | ---------------------: | ---------------------: | -----------------------: |
109
+ | [japanese-splade-base-v1](https://huggingface.co/hotchpotch/japanese-splade-base-v1) | **0.7465** | 0.6499 | **0.6992** | **0.4365** | 0.8967 | **0.9766** | 0.8203 |
110
+ | [text-embedding-3-large](https://huggingface.co/OpenAI/text-embedding-3-large) | 0.7448 | 0.7241 | 0.4821 | 0.3488 | **0.9933** | 0.9655 | **0.9547** |
111
+ | [GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 0.7336 | 0.6979 | 0.6729 | 0.4186 | 0.9029 | 0.9511 | 0.7580 |
112
+ | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.7098 | 0.7030 | 0.5878 | 0.4363 | 0.8600 | 0.9470 | 0.7248 |
113
+ | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 0.6727 | 0.6411 | 0.4997 | 0.3605 | 0.8521 | 0.9526 | 0.7299 |
114
+ | [ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.7302 | **0.7668** | 0.6174 | 0.3803 | 0.8712 | 0.9658 | 0.7797 |
115
+
116
+
117
+ ## reranking
118
+
119
+ ### [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR)
120
+
121
+ なお、japanese-splade-base-v1 は **JaCWIR のドメインを学習していません**。
122
+
123
+ | model_names | map@10 | hit_rate@10 |
124
+ | :------------------------------------------------------------------------------ | -----: | ----------: |
125
+ | [japanese-splade-base-v1](https://huggingface.co/hotchpotch/japanese-splade-base-v1) | **0.9122** | **0.9854** |
126
+ | [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) | 0.8168 | 0.9506 |
127
+ | [GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 0.8567 | 0.9676 |
128
+ | [bge-m3+dense](https://huggingface.co/BAAI/bge-m3) | 0.8642 | 0.9684 |
129
+ | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.8759 | 0.9726 |
130
+ | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 0.869 | 0.97 |
131
+ | [ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.8291 | 0.9594 |
132
+
133
+ ### [JQaRA](https://github.com/hotchpotch/JQaRA)
134
+ なお、japanese-splade-base-v1 は JQaRA のドメイン(test以外)を学習したものとなっています。
135
+
136
+ | model_names | ndcg@10 | mrr@10 |
137
+ | :------------------------------------------------------------------------------ | ------: | -----: |
138
+ | [japanese-splade-base-v1](https://huggingface.co/hotchpotch/japanese-splade-base-v1) | **0.6441** | **0.8616** |
139
+ | [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) | 0.3881 | 0.6107 |
140
+ | [bge-m3+dense](https://huggingface.co/BAAI/bge-m3) | 0.539 | 0.7854 |
141
+ | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.554 | 0.7988 |
142
+ | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 0.4917 | 0.7291 |
143
+ | [GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 0.606 | 0.8359 |
144
+ | [ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.6287 | 0.8418 |
145
+
146
+ ## 学習元データセット
147
+
148
+ [hpprc/emb](https://huggingface.co/datasets/hpprc/emb) から、auto-wiki-qa, mmarco, jsquad jaquad, auto-wiki-qa-nemotron, quiz-works quiz-no-mori, miracl, jqara mr-tydi, baobab-wiki-retrieval, mkqa データセットを利用しています。
149
+ また英語データセットとして、MS Marcoを利用しています。
150
+
151
+ ## 注意事項
152
+
153
+ `tokenizer.json` ファイルを同梱していますが、このファイルは text-embeddings-inference を動かすためのダミーファイルです。詳細は、[text-embeddings-inference で日本語トークナイザーモデルの推論をする](https://secon.dev/entry/2024/09/30/160000/)をご覧ください。
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float16",
21
+ "transformers_version": "4.43.0.dev0",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 32768
25
+ }
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "pad_token_id": 0,
4
+ "transformers_version": "4.43.0.dev0"
5
+ }
model_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bfbc21a3e0df40bbb2536a2ee9f50a688c70754b63af7c854de0f86ffbbfdd90
3
+ size 1004
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79b9e3c40771abae754a234c6bef00ac531c0e7d903efb478090dd03f9b12ff1
3
+ size 222563762
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
@@ -0,0 +1,522 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "[PAD]",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 1,
17
+ "content": "[UNK]",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 2,
26
+ "content": "[CLS]",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ },
33
+ {
34
+ "id": 3,
35
+ "content": "[SEP]",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 4,
44
+ "content": "[MASK]",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
+ }
51
+ ],
52
+ "normalizer": null,
53
+ "pre_tokenizer": {
54
+ "type": "Whitespace"
55
+ },
56
+ "post_processor": {
57
+ "type": "TemplateProcessing",
58
+ "single": [
59
+ {
60
+ "SpecialToken": {
61
+ "id": "[CLS]",
62
+ "type_id": 0
63
+ }
64
+ },
65
+ {
66
+ "Sequence": {
67
+ "id": "A",
68
+ "type_id": 0
69
+ }
70
+ },
71
+ {
72
+ "SpecialToken": {
73
+ "id": "[SEP]",
74
+ "type_id": 0
75
+ }
76
+ }
77
+ ],
78
+ "pair": [
79
+ {
80
+ "SpecialToken": {
81
+ "id": "[CLS]",
82
+ "type_id": 0
83
+ }
84
+ },
85
+ {
86
+ "Sequence": {
87
+ "id": "A",
88
+ "type_id": 0
89
+ }
90
+ },
91
+ {
92
+ "SpecialToken": {
93
+ "id": "[SEP]",
94
+ "type_id": 0
95
+ }
96
+ },
97
+ {
98
+ "SpecialToken": {
99
+ "id": "[CLS]",
100
+ "type_id": 0
101
+ }
102
+ },
103
+ {
104
+ "Sequence": {
105
+ "id": "B",
106
+ "type_id": 0
107
+ }
108
+ },
109
+ {
110
+ "SpecialToken": {
111
+ "id": "[SEP]",
112
+ "type_id": 0
113
+ }
114
+ }
115
+ ],
116
+ "special_tokens": {
117
+ "[CLS]": {
118
+ "id": "[CLS]",
119
+ "ids": [
120
+ 2
121
+ ],
122
+ "tokens": [
123
+ "[CLS]"
124
+ ]
125
+ },
126
+ "[SEP]": {
127
+ "id": "[SEP]",
128
+ "ids": [
129
+ 3
130
+ ],
131
+ "tokens": [
132
+ "[SEP]"
133
+ ]
134
+ }
135
+ }
136
+ },
137
+ "decoder": {
138
+ "type": "BPEDecoder",
139
+ "suffix": "</w>"
140
+ },
141
+ "model": {
142
+ "type": "BPE",
143
+ "dropout": null,
144
+ "unk_token": null,
145
+ "continuing_subword_prefix": null,
146
+ "end_of_word_suffix": null,
147
+ "fuse_unk": false,
148
+ "byte_fallback": false,
149
+ "ignore_merges": false,
150
+ "vocab": {
151
+ "[PAD]": 0,
152
+ "[UNK]": 1,
153
+ "[CLS]": 2,
154
+ "[SEP]": 3,
155
+ "[MASK]": 4,
156
+ "!": 5,
157
+ "\"": 6,
158
+ "%": 7,
159
+ "&": 8,
160
+ "'": 9,
161
+ "(": 10,
162
+ ")": 11,
163
+ "*": 12,
164
+ "+": 13,
165
+ ",": 14,
166
+ "-": 15,
167
+ ".": 16,
168
+ "/": 17,
169
+ "0": 18,
170
+ "1": 19,
171
+ "2": 20,
172
+ "3": 21,
173
+ "4": 22,
174
+ "5": 23,
175
+ "6": 24,
176
+ "7": 25,
177
+ "8": 26,
178
+ "9": 27,
179
+ ":": 28,
180
+ ";": 29,
181
+ "?": 30,
182
+ "A": 31,
183
+ "B": 32,
184
+ "C": 33,
185
+ "D": 34,
186
+ "E": 35,
187
+ "F": 36,
188
+ "G": 37,
189
+ "H": 38,
190
+ "I": 39,
191
+ "J": 40,
192
+ "K": 41,
193
+ "L": 42,
194
+ "M": 43,
195
+ "N": 44,
196
+ "O": 45,
197
+ "P": 46,
198
+ "Q": 47,
199
+ "R": 48,
200
+ "S": 49,
201
+ "T": 50,
202
+ "U": 51,
203
+ "V": 52,
204
+ "W": 53,
205
+ "X": 54,
206
+ "Y": 55,
207
+ "Z": 56,
208
+ "[": 57,
209
+ "]": 58,
210
+ "_": 59,
211
+ "a": 60,
212
+ "b": 61,
213
+ "c": 62,
214
+ "d": 63,
215
+ "e": 64,
216
+ "f": 65,
217
+ "g": 66,
218
+ "h": 67,
219
+ "i": 68,
220
+ "j": 69,
221
+ "k": 70,
222
+ "l": 71,
223
+ "m": 72,
224
+ "n": 73,
225
+ "o": 74,
226
+ "p": 75,
227
+ "q": 76,
228
+ "r": 77,
229
+ "s": 78,
230
+ "t": 79,
231
+ "u": 80,
232
+ "v": 81,
233
+ "w": 82,
234
+ "x": 83,
235
+ "y": 84,
236
+ "z": 85,
237
+ "|": 86,
238
+ "§": 87,
239
+ "Á": 88,
240
+ "Æ": 89,
241
+ "á": 90,
242
+ "æ": 91,
243
+ "ç": 92,
244
+ "è": 93,
245
+ "é": 94,
246
+ "í": 95,
247
+ "ð": 96,
248
+ "ö": 97,
249
+ "ú": 98,
250
+ "ü": 99,
251
+ "þ": 100,
252
+ "ā": 101,
253
+ "ē": 102,
254
+ "ŋ": 103,
255
+ "ƿ": 104,
256
+ "ɑ": 105,
257
+ "ɒ": 106,
258
+ "ɔ": 107,
259
+ "ɖ": 108,
260
+ "ə": 109,
261
+ "ɚ": 110,
262
+ "ɛ": 111,
263
+ "ɜ": 112,
264
+ "ɡ": 113,
265
+ "ɪ": 114,
266
+ "ɫ": 115,
267
+ "ɹ": 116,
268
+ "ɾ": 117,
269
+ "ʃ": 118,
270
+ "ʈ": 119,
271
+ "ʊ": 120,
272
+ "ʌ": 121,
273
+ "ʍ": 122,
274
+ "ʒ": 123,
275
+ "ʔ": 124,
276
+ "ʰ": 125,
277
+ "ʱ": 126,
278
+ "ʲ": 127,
279
+ "ʷ": 128,
280
+ "ˈ": 129,
281
+ "ː": 130,
282
+ "ˑ": 131,
283
+ "̚": 132,
284
+ "̥": 133,
285
+ "̩": 134,
286
+ "̪": 135,
287
+ "̯": 136,
288
+ "͡": 137,
289
+ "θ": 138,
290
+ "‑": 139,
291
+ "–": 140,
292
+ "—": 141,
293
+ "∅": 142,
294
+ "⟨": 143,
295
+ "⟩": 144,
296
+ "an": 145,
297
+ "th": 146,
298
+ "in": 147,
299
+ "on": 148,
300
+ "er": 149,
301
+ "is": 150,
302
+ "es": 151,
303
+ "or": 152,
304
+ "the": 153,
305
+ "ti": 154,
306
+ "ar": 155,
307
+ "al": 156,
308
+ "en": 157,
309
+ "ed": 158,
310
+ "of": 159,
311
+ "and": 160,
312
+ "gl": 161,
313
+ "ish": 162,
314
+ "ngl": 163,
315
+ "Engl": 164,
316
+ "English": 165,
317
+ "as": 166,
318
+ "ic": 167,
319
+ "ou": 168,
320
+ "20": 169,
321
+ "tion": 170,
322
+ "ing": 171,
323
+ "ec": 172,
324
+ "om": 173,
325
+ "at": 174,
326
+ "st": 175,
327
+ "it": 176,
328
+ "le": 177,
329
+ "ge": 178,
330
+ "re": 179,
331
+ "gu": 180,
332
+ "angu": 181,
333
+ "angua": 182,
334
+ "ch": 183,
335
+ "ent": 184,
336
+ "ve": 185,
337
+ "to": 186,
338
+ ").": 187,
339
+ "ation": 188,
340
+ "ri": 189,
341
+ "ly": 190,
342
+ "am": 191,
343
+ "oun": 192,
344
+ "ers": 193,
345
+ "anguage": 194,
346
+ "for": 195,
347
+ "fr": 196,
348
+ "ll": 197,
349
+ "us": 198,
350
+ "200": 199,
351
+ "he": 200,
352
+ "tic": 201,
353
+ "pr": 202,
354
+ "di": 203,
355
+ "ow": 204,
356
+ "et": 205,
357
+ "ig": 206,
358
+ "19": 207,
359
+ "pe": 208,
360
+ "ac": 209,
361
+ ".[": 210,
362
+ "ur": 211,
363
+ "wi": 212,
364
+ "201": 213,
365
+ "ect": 214,
366
+ "iv": 215,
367
+ "ess": 216,
368
+ "The": 217,
369
+ "ol": 218,
370
+ "ter": 219,
371
+ "de": 220,
372
+ "language": 221,
373
+ "wor": 222,
374
+ "from": 223,
375
+ "un": 224,
376
+ "In": 225,
377
+ "ver": 226,
378
+ "ir": 227,
379
+ "are": 228,
380
+ "cl": 229,
381
+ "ther": 230,
382
+ "ad": 231,
383
+ "man": 232,
384
+ "con": 233,
385
+ "ab": 234,
386
+ "ex": 235,
387
+ "with": 236,
388
+ "pp": 237,
389
+ "wh": 238,
390
+ "el": 239,
391
+ "97": 240,
392
+ "ary": 241,
393
+ "10": 242,
394
+ "su": 243,
395
+ "ph": 244,
396
+ "ul": 245,
397
+ "po": 246,
398
+ "978": 247,
399
+ "ld": 248,
400
+ "ak": 249,
401
+ "si": 250,
402
+ "ru": 251,
403
+ "tive": 252,
404
+ "ds": 253,
405
+ "oc": 254,
406
+ "enc": 255
407
+ },
408
+ "merges": [
409
+ "a n",
410
+ "t h",
411
+ "i n",
412
+ "o n",
413
+ "e r",
414
+ "i s",
415
+ "e s",
416
+ "o r",
417
+ "th e",
418
+ "t i",
419
+ "a r",
420
+ "a l",
421
+ "e n",
422
+ "e d",
423
+ "o f",
424
+ "an d",
425
+ "g l",
426
+ "is h",
427
+ "n gl",
428
+ "E ngl",
429
+ "Engl ish",
430
+ "a s",
431
+ "i c",
432
+ "o u",
433
+ "2 0",
434
+ "ti on",
435
+ "in g",
436
+ "e c",
437
+ "o m",
438
+ "a t",
439
+ "s t",
440
+ "i t",
441
+ "l e",
442
+ "g e",
443
+ "r e",
444
+ "g u",
445
+ "an gu",
446
+ "angu a",
447
+ "c h",
448
+ "en t",
449
+ "v e",
450
+ "t o",
451
+ ") .",
452
+ "a tion",
453
+ "r i",
454
+ "l y",
455
+ "a m",
456
+ "ou n",
457
+ "er s",
458
+ "angua ge",
459
+ "f or",
460
+ "f r",
461
+ "l l",
462
+ "u s",
463
+ "20 0",
464
+ "h e",
465
+ "ti c",
466
+ "p r",
467
+ "d i",
468
+ "o w",
469
+ "e t",
470
+ "i g",
471
+ "1 9",
472
+ "p e",
473
+ "a c",
474
+ ". [",
475
+ "u r",
476
+ "w i",
477
+ "20 1",
478
+ "ec t",
479
+ "i v",
480
+ "es s",
481
+ "T he",
482
+ "o l",
483
+ "t er",
484
+ "d e",
485
+ "l anguage",
486
+ "w or",
487
+ "fr om",
488
+ "u n",
489
+ "I n",
490
+ "v er",
491
+ "i r",
492
+ "ar e",
493
+ "c l",
494
+ "th er",
495
+ "a d",
496
+ "m an",
497
+ "c on",
498
+ "a b",
499
+ "e x",
500
+ "wi th",
501
+ "p p",
502
+ "w h",
503
+ "e l",
504
+ "9 7",
505
+ "ar y",
506
+ "1 0",
507
+ "s u",
508
+ "p h",
509
+ "u l",
510
+ "p o",
511
+ "97 8",
512
+ "l d",
513
+ "a k",
514
+ "s i",
515
+ "r u",
516
+ "ti ve",
517
+ "d s",
518
+ "o c",
519
+ "en c"
520
+ ]
521
+ }
522
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "do_subword_tokenize": true,
48
+ "do_word_tokenize": true,
49
+ "jumanpp_kwargs": null,
50
+ "mask_token": "[MASK]",
51
+ "mecab_kwargs": {
52
+ "mecab_dic": "unidic_lite"
53
+ },
54
+ "model_max_length": 512,
55
+ "never_split": null,
56
+ "pad_token": "[PAD]",
57
+ "sep_token": "[SEP]",
58
+ "subword_tokenizer_type": "wordpiece",
59
+ "sudachi_kwargs": null,
60
+ "tokenizer_class": "BertJapaneseTokenizer",
61
+ "unk_token": "[UNK]",
62
+ "word_tokenizer_type": "mecab"
63
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1e9d53471fe8d498617107aa56613e484338c9a103459729aceae427b5c8dc1
3
+ size 6776
vocab.txt ADDED
The diff for this file is too large to render. See raw diff