Areeb-02 commited on
Commit
d7f1727
1 Parent(s): 7e05342

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,465 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-large-en-v1.5
3
+ datasets: []
4
+ language: []
5
+ library_name: sentence-transformers
6
+ metrics:
7
+ - pearson_cosine
8
+ - spearman_cosine
9
+ - pearson_manhattan
10
+ - spearman_manhattan
11
+ - pearson_euclidean
12
+ - spearman_euclidean
13
+ - pearson_dot
14
+ - spearman_dot
15
+ - pearson_max
16
+ - spearman_max
17
+ pipeline_tag: sentence-similarity
18
+ tags:
19
+ - sentence-transformers
20
+ - sentence-similarity
21
+ - feature-extraction
22
+ - generated_from_trainer
23
+ - dataset_size:132
24
+ - loss:CoSENTLoss
25
+ widget:
26
+ - source_sentence: A person shall have 3045 days after commencing business within
27
+ the City to apply for a registration certificate.
28
+ sentences:
29
+ - The new transportation plan replaces the previous one approved by San Francisco
30
+ voters in 2003. |
31
+ - The Department of Elections is revising sections of its definitions and deleting
32
+ a section to operate definitions for Article 12. |
33
+ - A newly-established business shall have 3045 days after commencing business within
34
+ the City to apply for a registration certificate, and the registration fee for
35
+ such businesses shall be prorated based on the estimated gross receipts for the
36
+ tax year in which the business commences.
37
+ - source_sentence: The homelessness gross receipts tax is a privilege tax imposed
38
+ upon persons engaging in business within the City for the privilege of engaging
39
+ in a business or occupation in the City. |
40
+ sentences:
41
+ - The City imposes an annual Homelessness Gross Receipts Tax on businesses with
42
+ more than $50,000,000 in total taxable gross receipts. |
43
+ - The tax on Administrative Office Business Activities imposed by Section 2804.9
44
+ is intended as a complementary tax to the homelessness gross receipts tax, and
45
+ shall be considered a homelessness gross receipts tax for purposes of this Article
46
+ 28. |
47
+ - '"The 5YPPs shall at a minimum address the following factors: compatibility with
48
+ existing and planned land uses, and with adopted standards for urban design and
49
+ for the provision of pedestrian amenities; and supportiveness of planned growth
50
+ in transit-friendly housing, employment, and services." |'
51
+ - source_sentence: '"The total worldwide compensation paid by the person and all related
52
+ entities to the person is referred to as combined payroll." |'
53
+ sentences:
54
+ - '"A taxpayer is eligible to claim a credit against their immediately succeeding
55
+ payments due for tax years or periods ending on or before December 31, 2024, of
56
+ the respective tax type by applying all or part of an overpayment of the Homelessness
57
+ Gross Receipts Tax in Article 28 (including the homelessness administrative office
58
+ tax under Section 2804(d) of Article 28)." |'
59
+ - '"Receipts from the sale of real property are exempt from the gross receipts tax
60
+ if the Real Property Transfer Tax imposed by Article 12-C has been paid to the
61
+ City."'
62
+ - '"The total amount paid for compensation in the City by the person and by all
63
+ related entities to the person is referred to as payroll in the City." |'
64
+ - source_sentence: '"The gross receipts tax rates applicable to Category 6 Business
65
+ Activities are determined based on the amount of taxable gross receipts from these
66
+ activities." |'
67
+ sentences:
68
+ - '"The project meets the criteria outlined in Section 131051(d) of the Public Utilities
69
+ Code."'
70
+ - For the business activity of clean technology, a tax rate of 0.175% (e.g. $1.75
71
+ per $1,000) applies to taxable gross receipts between $0 and $1,000,000 for tax
72
+ years beginning on or after January 1, 2021 through and including 2024. |
73
+ - '"The tax rates for Category 7 Business Activities are also determined based on
74
+ the amount of taxable gross receipts." |'
75
+ - source_sentence: '"Compensation" refers to wages, salaries, commissions, bonuses,
76
+ and property issued or transferred in exchange for services, as well as compensation
77
+ for services to owners of pass-through entities, and any other form of remuneration
78
+ paid to employees for services.'
79
+ sentences:
80
+ - '"Every person engaging in business within the City as an administrative office,
81
+ as defined below, shall pay an annual administrative office tax measured by its
82
+ total payroll expense that is attributable to the City:" |'
83
+ - '"Remuneration" refers to any payment or reward, including but not limited to
84
+ wages, salaries, commissions, bonuses, and property issued or transferred in exchange
85
+ for services, as well as compensation for services to owners of pass-through entities,
86
+ and any other form of compensation paid to employees for services.'
87
+ - '"Construction of new Americans with Disabilities Act (ADA)-compliant curb ramps
88
+ and related roadway work to permit ease of movement." |'
89
+ model-index:
90
+ - name: SentenceTransformer based on BAAI/bge-large-en-v1.5
91
+ results:
92
+ - task:
93
+ type: semantic-similarity
94
+ name: Semantic Similarity
95
+ dataset:
96
+ name: Unknown
97
+ type: unknown
98
+ metrics:
99
+ - type: pearson_cosine
100
+ value: 0.3338547038124495
101
+ name: Pearson Cosine
102
+ - type: spearman_cosine
103
+ value: 0.41279297374061835
104
+ name: Spearman Cosine
105
+ - type: pearson_manhattan
106
+ value: 0.3102979152053135
107
+ name: Pearson Manhattan
108
+ - type: spearman_manhattan
109
+ value: 0.41673878893078603
110
+ name: Spearman Manhattan
111
+ - type: pearson_euclidean
112
+ value: 0.30953937257079917
113
+ name: Pearson Euclidean
114
+ - type: spearman_euclidean
115
+ value: 0.41279297374061835
116
+ name: Spearman Euclidean
117
+ - type: pearson_dot
118
+ value: 0.3338548430968143
119
+ name: Pearson Dot
120
+ - type: spearman_dot
121
+ value: 0.41279297374061835
122
+ name: Spearman Dot
123
+ - type: pearson_max
124
+ value: 0.3338548430968143
125
+ name: Pearson Max
126
+ - type: spearman_max
127
+ value: 0.41673878893078603
128
+ name: Spearman Max
129
+ ---
130
+
131
+ # SentenceTransformer based on BAAI/bge-large-en-v1.5
132
+
133
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
134
+
135
+ ## Model Details
136
+
137
+ ### Model Description
138
+ - **Model Type:** Sentence Transformer
139
+ - **Base model:** [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) <!-- at revision d4aa6901d3a41ba39fb536a557fa166f842b0e09 -->
140
+ - **Maximum Sequence Length:** 512 tokens
141
+ - **Output Dimensionality:** 1024 tokens
142
+ - **Similarity Function:** Cosine Similarity
143
+ <!-- - **Training Dataset:** Unknown -->
144
+ <!-- - **Language:** Unknown -->
145
+ <!-- - **License:** Unknown -->
146
+
147
+ ### Model Sources
148
+
149
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
150
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
151
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
152
+
153
+ ### Full Model Architecture
154
+
155
+ ```
156
+ SentenceTransformer(
157
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
158
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
159
+ (2): Normalize()
160
+ )
161
+ ```
162
+
163
+ ## Usage
164
+
165
+ ### Direct Usage (Sentence Transformers)
166
+
167
+ First install the Sentence Transformers library:
168
+
169
+ ```bash
170
+ pip install -U sentence-transformers
171
+ ```
172
+
173
+ Then you can load this model and run inference.
174
+ ```python
175
+ from sentence_transformers import SentenceTransformer
176
+
177
+ # Download from the 🤗 Hub
178
+ model = SentenceTransformer("Areeb-02/bge-large-en-v1.5-CosentLoss")
179
+ # Run inference
180
+ sentences = [
181
+ '"Compensation" refers to wages, salaries, commissions, bonuses, and property issued or transferred in exchange for services, as well as compensation for services to owners of pass-through entities, and any other form of remuneration paid to employees for services.',
182
+ '"Remuneration" refers to any payment or reward, including but not limited to wages, salaries, commissions, bonuses, and property issued or transferred in exchange for services, as well as compensation for services to owners of pass-through entities, and any other form of compensation paid to employees for services.',
183
+ '"Every person engaging in business within the City as an administrative office, as defined below, shall pay an annual administrative office tax measured by its total payroll expense that is attributable to the City:" |',
184
+ ]
185
+ embeddings = model.encode(sentences)
186
+ print(embeddings.shape)
187
+ # [3, 1024]
188
+
189
+ # Get the similarity scores for the embeddings
190
+ similarities = model.similarity(embeddings, embeddings)
191
+ print(similarities.shape)
192
+ # [3, 3]
193
+ ```
194
+
195
+ <!--
196
+ ### Direct Usage (Transformers)
197
+
198
+ <details><summary>Click to see the direct usage in Transformers</summary>
199
+
200
+ </details>
201
+ -->
202
+
203
+ <!--
204
+ ### Downstream Usage (Sentence Transformers)
205
+
206
+ You can finetune this model on your own dataset.
207
+
208
+ <details><summary>Click to expand</summary>
209
+
210
+ </details>
211
+ -->
212
+
213
+ <!--
214
+ ### Out-of-Scope Use
215
+
216
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
217
+ -->
218
+
219
+ ## Evaluation
220
+
221
+ ### Metrics
222
+
223
+ #### Semantic Similarity
224
+
225
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
226
+
227
+ | Metric | Value |
228
+ |:--------------------|:-----------|
229
+ | pearson_cosine | 0.3339 |
230
+ | **spearman_cosine** | **0.4128** |
231
+ | pearson_manhattan | 0.3103 |
232
+ | spearman_manhattan | 0.4167 |
233
+ | pearson_euclidean | 0.3095 |
234
+ | spearman_euclidean | 0.4128 |
235
+ | pearson_dot | 0.3339 |
236
+ | spearman_dot | 0.4128 |
237
+ | pearson_max | 0.3339 |
238
+ | spearman_max | 0.4167 |
239
+
240
+ <!--
241
+ ## Bias, Risks and Limitations
242
+
243
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
244
+ -->
245
+
246
+ <!--
247
+ ### Recommendations
248
+
249
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
250
+ -->
251
+
252
+ ## Training Details
253
+
254
+ ### Training Dataset
255
+
256
+ #### Unnamed Dataset
257
+
258
+
259
+ * Size: 132 training samples
260
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
261
+ * Approximate statistics based on the first 1000 samples:
262
+ | | sentence1 | sentence2 | score |
263
+ |:--------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:----------------------------------------------------------------|
264
+ | type | string | string | float |
265
+ | details | <ul><li>min: 10 tokens</li><li>mean: 41.99 tokens</li><li>max: 126 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 42.72 tokens</li><li>max: 162 tokens</li></ul> | <ul><li>min: 0.25</li><li>mean: 0.93</li><li>max: 1.0</li></ul> |
266
+ * Samples:
267
+ | sentence1 | sentence2 | score |
268
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
269
+ | <code>"Gross receipts as defined in Section 952.3 shall not include receipts from any sales of real property with respect to which the Real Property Transfer Tax imposed by Article 12-C has been paid to the City."</code> | <code>"Receipts from the sale of real property are exempt from the gross receipts tax if the Real Property Transfer Tax imposed by Article 12-C has been paid to the City."</code> | <code>1.0</code> |
270
+ | <code>For tax years beginning on or after January 1, 2025, any person or combined group, except for a lessor of residential real estate, whose gross receipts within the City did not exceed $5,000,000, adjusted annually in accordance with the increase in the Consumer Price Index: All Urban Consumers for the San Francisco/Oakland/Hayward Area for All Items as reported by the United States Bureau of Labor Statistics, or any successor to that index, as of December 31 of the calendar year two years prior to the tax year, beginning with tax year 2026, and rounded to the nearest $10,000.</code> | <code>For taxable years ending on or before December 31, 2024, using the rules set forth in Sections 956.1 and 956.2, in the manner directed in Sections 953.1 through 953.7, inclusive, and in Section 953.9 of this Article 12-A-1; and</code> | <code>0.95</code> |
271
+ | <code>"San Francisco Gross Receipts" refers to the revenue generated from sales and services within the city limits of San Francisco.</code> | <code>"Revenue generated from sales and services within the city limits of San Francisco"</code> | <code>1.0</code> |
272
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
273
+ ```json
274
+ {
275
+ "scale": 20.0,
276
+ "similarity_fct": "pairwise_cos_sim"
277
+ }
278
+ ```
279
+
280
+ ### Training Hyperparameters
281
+ #### Non-Default Hyperparameters
282
+
283
+ - `eval_strategy`: steps
284
+ - `per_device_train_batch_size`: 16
285
+ - `per_device_eval_batch_size`: 16
286
+ - `num_train_epochs`: 5
287
+ - `warmup_ratio`: 0.1
288
+ - `fp16`: True
289
+
290
+ #### All Hyperparameters
291
+ <details><summary>Click to expand</summary>
292
+
293
+ - `overwrite_output_dir`: False
294
+ - `do_predict`: False
295
+ - `eval_strategy`: steps
296
+ - `prediction_loss_only`: True
297
+ - `per_device_train_batch_size`: 16
298
+ - `per_device_eval_batch_size`: 16
299
+ - `per_gpu_train_batch_size`: None
300
+ - `per_gpu_eval_batch_size`: None
301
+ - `gradient_accumulation_steps`: 1
302
+ - `eval_accumulation_steps`: None
303
+ - `learning_rate`: 5e-05
304
+ - `weight_decay`: 0.0
305
+ - `adam_beta1`: 0.9
306
+ - `adam_beta2`: 0.999
307
+ - `adam_epsilon`: 1e-08
308
+ - `max_grad_norm`: 1.0
309
+ - `num_train_epochs`: 5
310
+ - `max_steps`: -1
311
+ - `lr_scheduler_type`: linear
312
+ - `lr_scheduler_kwargs`: {}
313
+ - `warmup_ratio`: 0.1
314
+ - `warmup_steps`: 0
315
+ - `log_level`: passive
316
+ - `log_level_replica`: warning
317
+ - `log_on_each_node`: True
318
+ - `logging_nan_inf_filter`: True
319
+ - `save_safetensors`: True
320
+ - `save_on_each_node`: False
321
+ - `save_only_model`: False
322
+ - `restore_callback_states_from_checkpoint`: False
323
+ - `no_cuda`: False
324
+ - `use_cpu`: False
325
+ - `use_mps_device`: False
326
+ - `seed`: 42
327
+ - `data_seed`: None
328
+ - `jit_mode_eval`: False
329
+ - `use_ipex`: False
330
+ - `bf16`: False
331
+ - `fp16`: True
332
+ - `fp16_opt_level`: O1
333
+ - `half_precision_backend`: auto
334
+ - `bf16_full_eval`: False
335
+ - `fp16_full_eval`: False
336
+ - `tf32`: None
337
+ - `local_rank`: 0
338
+ - `ddp_backend`: None
339
+ - `tpu_num_cores`: None
340
+ - `tpu_metrics_debug`: False
341
+ - `debug`: []
342
+ - `dataloader_drop_last`: False
343
+ - `dataloader_num_workers`: 0
344
+ - `dataloader_prefetch_factor`: None
345
+ - `past_index`: -1
346
+ - `disable_tqdm`: False
347
+ - `remove_unused_columns`: True
348
+ - `label_names`: None
349
+ - `load_best_model_at_end`: False
350
+ - `ignore_data_skip`: False
351
+ - `fsdp`: []
352
+ - `fsdp_min_num_params`: 0
353
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
354
+ - `fsdp_transformer_layer_cls_to_wrap`: None
355
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
356
+ - `deepspeed`: None
357
+ - `label_smoothing_factor`: 0.0
358
+ - `optim`: adamw_torch
359
+ - `optim_args`: None
360
+ - `adafactor`: False
361
+ - `group_by_length`: False
362
+ - `length_column_name`: length
363
+ - `ddp_find_unused_parameters`: None
364
+ - `ddp_bucket_cap_mb`: None
365
+ - `ddp_broadcast_buffers`: False
366
+ - `dataloader_pin_memory`: True
367
+ - `dataloader_persistent_workers`: False
368
+ - `skip_memory_metrics`: True
369
+ - `use_legacy_prediction_loop`: False
370
+ - `push_to_hub`: False
371
+ - `resume_from_checkpoint`: None
372
+ - `hub_model_id`: None
373
+ - `hub_strategy`: every_save
374
+ - `hub_private_repo`: False
375
+ - `hub_always_push`: False
376
+ - `gradient_checkpointing`: False
377
+ - `gradient_checkpointing_kwargs`: None
378
+ - `include_inputs_for_metrics`: False
379
+ - `eval_do_concat_batches`: True
380
+ - `fp16_backend`: auto
381
+ - `push_to_hub_model_id`: None
382
+ - `push_to_hub_organization`: None
383
+ - `mp_parameters`:
384
+ - `auto_find_batch_size`: False
385
+ - `full_determinism`: False
386
+ - `torchdynamo`: None
387
+ - `ray_scope`: last
388
+ - `ddp_timeout`: 1800
389
+ - `torch_compile`: False
390
+ - `torch_compile_backend`: None
391
+ - `torch_compile_mode`: None
392
+ - `dispatch_batches`: None
393
+ - `split_batches`: None
394
+ - `include_tokens_per_second`: False
395
+ - `include_num_input_tokens_seen`: False
396
+ - `neftune_noise_alpha`: None
397
+ - `optim_target_modules`: None
398
+ - `batch_eval_metrics`: False
399
+ - `eval_on_start`: False
400
+ - `batch_sampler`: batch_sampler
401
+ - `multi_dataset_batch_sampler`: proportional
402
+
403
+ </details>
404
+
405
+ ### Training Logs
406
+ | Epoch | Step | spearman_cosine |
407
+ |:-----:|:----:|:---------------:|
408
+ | 3.0 | 51 | 0.4078 |
409
+ | 5.0 | 45 | 0.4128 |
410
+
411
+
412
+ ### Framework Versions
413
+ - Python: 3.10.12
414
+ - Sentence Transformers: 3.0.1
415
+ - Transformers: 4.42.0.dev0
416
+ - PyTorch: 2.3.0+cu121
417
+ - Accelerate: 0.31.0
418
+ - Datasets: 2.19.2
419
+ - Tokenizers: 0.19.1
420
+
421
+ ## Citation
422
+
423
+ ### BibTeX
424
+
425
+ #### Sentence Transformers
426
+ ```bibtex
427
+ @inproceedings{reimers-2019-sentence-bert,
428
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
429
+ author = "Reimers, Nils and Gurevych, Iryna",
430
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
431
+ month = "11",
432
+ year = "2019",
433
+ publisher = "Association for Computational Linguistics",
434
+ url = "https://arxiv.org/abs/1908.10084",
435
+ }
436
+ ```
437
+
438
+ #### CoSENTLoss
439
+ ```bibtex
440
+ @online{kexuefm-8847,
441
+ title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
442
+ author={Su Jianlin},
443
+ year={2022},
444
+ month={Jan},
445
+ url={https://kexue.fm/archives/8847},
446
+ }
447
+ ```
448
+
449
+ <!--
450
+ ## Glossary
451
+
452
+ *Clearly define terms in order to be accessible across audiences.*
453
+ -->
454
+
455
+ <!--
456
+ ## Model Card Authors
457
+
458
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
459
+ -->
460
+
461
+ <!--
462
+ ## Model Card Contact
463
+
464
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
465
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-large-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 4096,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 16,
24
+ "num_hidden_layers": 24,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.42.0.dev0",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.42.0.dev0",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa897fc8821e685d7452d8577b3d10ad3c9af1b204eb30aa5910c77e191f8c38
3
+ size 1340612432
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff