Bofandra commited on
Commit
46d1967
1 Parent(s): a288521

Add new SentenceTransformer model.

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Bofandra/fine-tuning-use-cmlm-multilingual-quran
3
+ datasets: []
4
+ language: []
5
+ library_name: sentence-transformers
6
+ pipeline_tag: sentence-similarity
7
+ tags:
8
+ - sentence-transformers
9
+ - sentence-similarity
10
+ - feature-extraction
11
+ - generated_from_trainer
12
+ - dataset_size:6225
13
+ - loss:MegaBatchMarginLoss
14
+ widget:
15
+ - source_sentence: يا أيها الذين آمنوا لا تتخذوا الكافرين أولياء من دون المؤمنين أتريدون
16
+ أن تجعلوا لله عليكم سلطانا مبينا
17
+ sentences:
18
+ - And when he attained his full strength and was [mentally] mature, We bestowed
19
+ upon him judgement and knowledge. And thus do We reward the doers of good.
20
+ - Then Moses threw his staff, and at once it devoured what they falsified.
21
+ - O you who have believed, do not take the disbelievers as allies instead of the
22
+ believers. Do you wish to give Allah against yourselves a clear case?
23
+ - source_sentence: قال لم أكن لأسجد لبشر خلقته من صلصال من حمإ مسنون
24
+ sentences:
25
+ - And We left it as a sign, so is there any who will remember?
26
+ - Gardens of perpetual residence, whose doors will be opened to them.
27
+ - He said, "Never would I prostrate to a human whom You created out of clay from
28
+ an altered black mud."
29
+ - source_sentence: وسخر لكم الشمس والقمر دائبين وسخر لكم الليل والنهار
30
+ sentences:
31
+ - And He subjected for you the sun and the moon, continuous [in orbit], and subjected
32
+ for you the night and the day.
33
+ - And We called him from the side of the mount at [his] right and brought him near,
34
+ confiding [to him].
35
+ - And We send not the messengers except as bringers of good tidings and warners.
36
+ And those who disbelieve dispute by [using] falsehood to [attempt to] invalidate
37
+ thereby the truth and have taken My verses, and that of which they are warned,
38
+ in ridicule.
39
+ - source_sentence: إذ دخلوا عليه فقالوا سلاما قال إنا منكم وجلون
40
+ sentences:
41
+ - Indeed, your Lord is most knowing of who strays from His way, and He is most knowing
42
+ of the [rightly] guided.
43
+ - Then they turned away from him and said, "[He was] taught [and is] a madman."
44
+ - When they entered upon him and said, "Peace." [Abraham] said, "Indeed, we are
45
+ fearful of you."
46
+ - source_sentence: فأما من أوتي كتابه بيمينه فيقول هاؤم اقرءوا كتابيه
47
+ sentences:
48
+ - So as for he who is given his record in his right hand, he will say, "Here, read
49
+ my record!
50
+ - And whoever is patient and forgives - indeed, that is of the matters [requiring]
51
+ determination.
52
+ - Indeed, he had [once] been among his people in happiness;
53
+ ---
54
+
55
+ # SentenceTransformer based on Bofandra/fine-tuning-use-cmlm-multilingual-quran
56
+
57
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Bofandra/fine-tuning-use-cmlm-multilingual-quran](https://huggingface.co/Bofandra/fine-tuning-use-cmlm-multilingual-quran). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
58
+
59
+ ## Model Details
60
+
61
+ ### Model Description
62
+ - **Model Type:** Sentence Transformer
63
+ - **Base model:** [Bofandra/fine-tuning-use-cmlm-multilingual-quran](https://huggingface.co/Bofandra/fine-tuning-use-cmlm-multilingual-quran) <!-- at revision 7a1271e8909e29e5840b034feeefe22e45dd7a97 -->
64
+ - **Maximum Sequence Length:** 256 tokens
65
+ - **Output Dimensionality:** 768 tokens
66
+ - **Similarity Function:** Cosine Similarity
67
+ <!-- - **Training Dataset:** Unknown -->
68
+ <!-- - **Language:** Unknown -->
69
+ <!-- - **License:** Unknown -->
70
+
71
+ ### Model Sources
72
+
73
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
74
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
75
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
76
+
77
+ ### Full Model Architecture
78
+
79
+ ```
80
+ SentenceTransformer(
81
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
82
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
83
+ (2): Normalize()
84
+ )
85
+ ```
86
+
87
+ ## Usage
88
+
89
+ ### Direct Usage (Sentence Transformers)
90
+
91
+ First install the Sentence Transformers library:
92
+
93
+ ```bash
94
+ pip install -U sentence-transformers
95
+ ```
96
+
97
+ Then you can load this model and run inference.
98
+ ```python
99
+ from sentence_transformers import SentenceTransformer
100
+
101
+ # Download from the 🤗 Hub
102
+ model = SentenceTransformer("Bofandra/fine-tuning-use-cmlm-multilingual-quran-translation")
103
+ # Run inference
104
+ sentences = [
105
+ 'فأما من أوتي كتابه بيمينه فيقول هاؤم اقرءوا كتابيه',
106
+ 'So as for he who is given his record in his right hand, he will say, "Here, read my record!',
107
+ 'Indeed, he had [once] been among his people in happiness;',
108
+ ]
109
+ embeddings = model.encode(sentences)
110
+ print(embeddings.shape)
111
+ # [3, 768]
112
+
113
+ # Get the similarity scores for the embeddings
114
+ similarities = model.similarity(embeddings, embeddings)
115
+ print(similarities.shape)
116
+ # [3, 3]
117
+ ```
118
+
119
+ <!--
120
+ ### Direct Usage (Transformers)
121
+
122
+ <details><summary>Click to see the direct usage in Transformers</summary>
123
+
124
+ </details>
125
+ -->
126
+
127
+ <!--
128
+ ### Downstream Usage (Sentence Transformers)
129
+
130
+ You can finetune this model on your own dataset.
131
+
132
+ <details><summary>Click to expand</summary>
133
+
134
+ </details>
135
+ -->
136
+
137
+ <!--
138
+ ### Out-of-Scope Use
139
+
140
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
141
+ -->
142
+
143
+ <!--
144
+ ## Bias, Risks and Limitations
145
+
146
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
147
+ -->
148
+
149
+ <!--
150
+ ### Recommendations
151
+
152
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
153
+ -->
154
+
155
+ ## Training Details
156
+
157
+ ### Training Dataset
158
+
159
+ #### Unnamed Dataset
160
+
161
+
162
+ * Size: 6,225 training samples
163
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
164
+ * Approximate statistics based on the first 1000 samples:
165
+ | | sentence_0 | sentence_1 |
166
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
167
+ | type | string | string |
168
+ | details | <ul><li>min: 3 tokens</li><li>mean: 23.11 tokens</li><li>max: 163 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 34.63 tokens</li><li>max: 180 tokens</li></ul> |
169
+ * Samples:
170
+ | sentence_0 | sentence_1 |
171
+ |:------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
172
+ | <code>ومن آياته أنك ترى الأرض خاشعة فإذا أنزلنا عليها الماء اهتزت وربت إن الذي أحياها لمحيي الموتى إنه على كل شيء قدير</code> | <code>And of His signs is that you see the earth stilled, but when We send down upon it rain, it quivers and grows. Indeed, He who has given it life is the Giver of Life to the dead. Indeed, He is over all things competent.</code> |
173
+ | <code>من دون الله قالوا ضلوا عنا بل لم نكن ندعو من قبل شيئا كذلك يضل الله الكافرين</code> | <code>Other than Allah?" They will say, "They have departed from us; rather, we did not used to invoke previously anything." Thus does Allah put astray the disbelievers.</code> |
174
+ | <code>أرأيت الذي ينهى</code> | <code>Have you seen the one who forbids</code> |
175
+ * Loss: [<code>MegaBatchMarginLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#megabatchmarginloss)
176
+
177
+ ### Training Hyperparameters
178
+ #### Non-Default Hyperparameters
179
+
180
+ - `per_device_train_batch_size`: 4
181
+ - `per_device_eval_batch_size`: 4
182
+ - `num_train_epochs`: 1
183
+ - `multi_dataset_batch_sampler`: round_robin
184
+
185
+ #### All Hyperparameters
186
+ <details><summary>Click to expand</summary>
187
+
188
+ - `overwrite_output_dir`: False
189
+ - `do_predict`: False
190
+ - `eval_strategy`: no
191
+ - `prediction_loss_only`: True
192
+ - `per_device_train_batch_size`: 4
193
+ - `per_device_eval_batch_size`: 4
194
+ - `per_gpu_train_batch_size`: None
195
+ - `per_gpu_eval_batch_size`: None
196
+ - `gradient_accumulation_steps`: 1
197
+ - `eval_accumulation_steps`: None
198
+ - `learning_rate`: 5e-05
199
+ - `weight_decay`: 0.0
200
+ - `adam_beta1`: 0.9
201
+ - `adam_beta2`: 0.999
202
+ - `adam_epsilon`: 1e-08
203
+ - `max_grad_norm`: 1
204
+ - `num_train_epochs`: 1
205
+ - `max_steps`: -1
206
+ - `lr_scheduler_type`: linear
207
+ - `lr_scheduler_kwargs`: {}
208
+ - `warmup_ratio`: 0.0
209
+ - `warmup_steps`: 0
210
+ - `log_level`: passive
211
+ - `log_level_replica`: warning
212
+ - `log_on_each_node`: True
213
+ - `logging_nan_inf_filter`: True
214
+ - `save_safetensors`: True
215
+ - `save_on_each_node`: False
216
+ - `save_only_model`: False
217
+ - `restore_callback_states_from_checkpoint`: False
218
+ - `no_cuda`: False
219
+ - `use_cpu`: False
220
+ - `use_mps_device`: False
221
+ - `seed`: 42
222
+ - `data_seed`: None
223
+ - `jit_mode_eval`: False
224
+ - `use_ipex`: False
225
+ - `bf16`: False
226
+ - `fp16`: False
227
+ - `fp16_opt_level`: O1
228
+ - `half_precision_backend`: auto
229
+ - `bf16_full_eval`: False
230
+ - `fp16_full_eval`: False
231
+ - `tf32`: None
232
+ - `local_rank`: 0
233
+ - `ddp_backend`: None
234
+ - `tpu_num_cores`: None
235
+ - `tpu_metrics_debug`: False
236
+ - `debug`: []
237
+ - `dataloader_drop_last`: False
238
+ - `dataloader_num_workers`: 0
239
+ - `dataloader_prefetch_factor`: None
240
+ - `past_index`: -1
241
+ - `disable_tqdm`: False
242
+ - `remove_unused_columns`: True
243
+ - `label_names`: None
244
+ - `load_best_model_at_end`: False
245
+ - `ignore_data_skip`: False
246
+ - `fsdp`: []
247
+ - `fsdp_min_num_params`: 0
248
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
249
+ - `fsdp_transformer_layer_cls_to_wrap`: None
250
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
251
+ - `deepspeed`: None
252
+ - `label_smoothing_factor`: 0.0
253
+ - `optim`: adamw_torch
254
+ - `optim_args`: None
255
+ - `adafactor`: False
256
+ - `group_by_length`: False
257
+ - `length_column_name`: length
258
+ - `ddp_find_unused_parameters`: None
259
+ - `ddp_bucket_cap_mb`: None
260
+ - `ddp_broadcast_buffers`: False
261
+ - `dataloader_pin_memory`: True
262
+ - `dataloader_persistent_workers`: False
263
+ - `skip_memory_metrics`: True
264
+ - `use_legacy_prediction_loop`: False
265
+ - `push_to_hub`: False
266
+ - `resume_from_checkpoint`: None
267
+ - `hub_model_id`: None
268
+ - `hub_strategy`: every_save
269
+ - `hub_private_repo`: False
270
+ - `hub_always_push`: False
271
+ - `gradient_checkpointing`: False
272
+ - `gradient_checkpointing_kwargs`: None
273
+ - `include_inputs_for_metrics`: False
274
+ - `eval_do_concat_batches`: True
275
+ - `fp16_backend`: auto
276
+ - `push_to_hub_model_id`: None
277
+ - `push_to_hub_organization`: None
278
+ - `mp_parameters`:
279
+ - `auto_find_batch_size`: False
280
+ - `full_determinism`: False
281
+ - `torchdynamo`: None
282
+ - `ray_scope`: last
283
+ - `ddp_timeout`: 1800
284
+ - `torch_compile`: False
285
+ - `torch_compile_backend`: None
286
+ - `torch_compile_mode`: None
287
+ - `dispatch_batches`: None
288
+ - `split_batches`: None
289
+ - `include_tokens_per_second`: False
290
+ - `include_num_input_tokens_seen`: False
291
+ - `neftune_noise_alpha`: None
292
+ - `optim_target_modules`: None
293
+ - `batch_eval_metrics`: False
294
+ - `batch_sampler`: batch_sampler
295
+ - `multi_dataset_batch_sampler`: round_robin
296
+
297
+ </details>
298
+
299
+ ### Training Logs
300
+ | Epoch | Step | Training Loss |
301
+ |:------:|:----:|:-------------:|
302
+ | 0.3211 | 500 | 0.2393 |
303
+ | 0.6423 | 1000 | 0.1212 |
304
+ | 0.9634 | 1500 | 0.0715 |
305
+
306
+
307
+ ### Framework Versions
308
+ - Python: 3.10.12
309
+ - Sentence Transformers: 3.0.1
310
+ - Transformers: 4.41.2
311
+ - PyTorch: 2.3.0+cu121
312
+ - Accelerate: 0.31.0
313
+ - Datasets: 2.20.0
314
+ - Tokenizers: 0.19.1
315
+
316
+ ## Citation
317
+
318
+ ### BibTeX
319
+
320
+ #### Sentence Transformers
321
+ ```bibtex
322
+ @inproceedings{reimers-2019-sentence-bert,
323
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
324
+ author = "Reimers, Nils and Gurevych, Iryna",
325
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
326
+ month = "11",
327
+ year = "2019",
328
+ publisher = "Association for Computational Linguistics",
329
+ url = "https://arxiv.org/abs/1908.10084",
330
+ }
331
+ ```
332
+
333
+ #### MegaBatchMarginLoss
334
+ ```bibtex
335
+ @inproceedings{wieting-gimpel-2018-paranmt,
336
+ title = "{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations",
337
+ author = "Wieting, John and Gimpel, Kevin",
338
+ editor = "Gurevych, Iryna and Miyao, Yusuke",
339
+ booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
340
+ month = jul,
341
+ year = "2018",
342
+ address = "Melbourne, Australia",
343
+ publisher = "Association for Computational Linguistics",
344
+ url = "https://aclanthology.org/P18-1042",
345
+ doi = "10.18653/v1/P18-1042",
346
+ pages = "451--462",
347
+ }
348
+ ```
349
+
350
+ <!--
351
+ ## Glossary
352
+
353
+ *Clearly define terms in order to be accessible across audiences.*
354
+ -->
355
+
356
+ <!--
357
+ ## Model Card Authors
358
+
359
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
360
+ -->
361
+
362
+ <!--
363
+ ## Model Card Contact
364
+
365
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
366
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Bofandra/fine-tuning-use-cmlm-multilingual-quran",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "directionality": "bidi",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "bert",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 501153
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a0132624d7a05a7ab685d73a2f7898c32196e0015b158c59ea9814c6fdff9d5
3
+ size 1883730160
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92262b29204f8fdc169a63f9005a0e311a16262cef4d96ecfe2a7ed638662ed3
3
+ size 13632172
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "full_tokenizer_file": null,
49
+ "mask_token": "[MASK]",
50
+ "max_length": 256,
51
+ "model_max_length": 256,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff