SentenceTransformer based on dunzhang/stella_en_400M_v5

This is a sentence-transformers model finetuned from dunzhang/stella_en_400M_v5. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: dunzhang/stella_en_400M_v5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("kperkins411/stella_en_400M_v5_MultipleNegativesRankingLoss")
# Run inference
sentences = [
    'How might an individual residing in the western coastal state of the U.S. obtain a record of the entities to which a particular application has provided their personal data for marketing use within the last calendar year?',
    'Specific Location Practices: California, EU residents California Privacy Rights Residents of the State of California can request a list of all third-parties to which our App has disclosed certain personal information (as defined by California law) during the preceding year for those third-parties\' direct marketing purposes. If you are a California resident and want such a list, please contact us at CaliforniaRequest@viber.com. For all requests, please ensure you put the statement "Your California Privacy Rights" in the body of your request, as well as your name, street address, city, state, and zip code. In the body of your request, please provide enough information for us to determine if this applies to you. You need to attest to the fact that you are a California resident and provide a current California address for our response. Please note that we will not accept requests via the telephone, mail, or by facsimile, and we are not responsible for notices that are not labeled or sent properly, or that do not have complete information. Viber does not currently take actions to respond to Do Not Track signals because a uniform technological standard has not yet been developed. We continue to review new technologies and may adopt a standard once one is created.',
    'The term “Confidential Information” means any and all tangible and intangible information disclosed to Receiver in oral, written, graphic, recorded, photographic, any machine-readable or in any other medium or form relating to the intellectual property, management, operations, products, inventions, suppliers, customers, financials of VIDAR or any present or contemplated project, contract or relationship between VIDAR and Receiver, including without limitation, any and all plans, Intellectual Property (defined below), know-how, computer programs, software (source and object code), algorithms, computer processing systems, techniques, methodologies, formulae, compilations of information, designs, drawings, schematics, analyses, evaluations, formulations, ingredients, samples, processes, machines, prototypes, mock-ups, product performance data, proposals, job notes, reports, records, specifications, manuals, supplier and customer lists and information, licenses, the prices it obtains or has obtained for the licensing of its software products and services, purchase and sales records, marketing information or any other information concerning the business and goodwill of VIDAR and any information which is identified as being of a confidential or proprietary nature or should be considered confidential under the circumstances.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: stella_en_400M_v5
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.5986
cosine_accuracy@3	0.752
cosine_accuracy@5	0.8008
cosine_accuracy@10	0.8527
cosine_precision@1	0.5986
cosine_precision@3	0.2507
cosine_precision@5	0.1602
cosine_precision@10	0.0853
cosine_recall@1	0.5986
cosine_recall@3	0.752
cosine_recall@5	0.8008
cosine_recall@10	0.8527
cosine_ndcg@10	0.7263
cosine_mrr@10	0.6858
cosine_map@100	0.6903
dot_accuracy@1	0.5937
dot_accuracy@3	0.7425
dot_accuracy@5	0.8008
dot_accuracy@10	0.8512
dot_precision@1	0.5937
dot_precision@3	0.2475
dot_precision@5	0.1602
dot_precision@10	0.0851
dot_recall@1	0.5937
dot_recall@3	0.7425
dot_recall@5	0.8008
dot_recall@10	0.8512
dot_ndcg@10	0.7219
dot_mrr@10	0.6805
dot_map@100	0.6852

Training Details

Training Dataset

Unnamed Dataset

Size: 491,850 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 7 tokens mean: 17.09 tokens max: 58 tokens	min: 7 tokens mean: 102.69 tokens max: 512 tokens	min: 6 tokens mean: 96.04 tokens max: 512 tokens

Samples:

anchor	positive	negative
`What safeguards are in place to protect the information obtained from third-party sources?`	`Information We Collect From Other Sources We may also receive information from other sources and combine that with information we collect through our Services. For example: If you choose to link, create, or log in to your Uber account with a payment provider (e.g., Google Wallet) or social media service (e.g., Facebook), or if you engage with a separate app or website that uses our API (or whose API we use), we may receive information about you or your connections from that site or app.`	`We receive data from Public Resources (as defined under the Terms of Service) associated with users and user Contacts, including from social networks to which users or user Contacts are registered, such as Facebook, Google+, Linkedin, Twitter, and Foursquare.`
`What safeguards are in place to protect the information obtained from third-party sources?`	`Information We Collect From Other Sources We may also receive information from other sources and combine that with information we collect through our Services. For example: If you choose to link, create, or log in to your Uber account with a payment provider (e.g., Google Wallet) or social media service (e.g., Facebook), or if you engage with a separate app or website that uses our API (or whose API we use), we may receive information about you or your connections from that site or app.`	`You also may be able to link an account from a social networking service (e.g., Facebook, Google+, Yahoo!) to an account through our Services. This may allow you to use your credentials from the other site or service to sign in to certain features on our Services. If you link your account from a third-party site or service, we may collect information from those third-party accounts, and any information that we collect will be governed by this Privacy Policy.`
`What safeguards are in place to protect the information obtained from third-party sources?`	`Information We Collect From Other Sources We may also receive information from other sources and combine that with information we collect through our Services. For example: If you choose to link, create, or log in to your Uber account with a payment provider (e.g., Google Wallet) or social media service (e.g., Facebook), or if you engage with a separate app or website that uses our API (or whose API we use), we may receive information about you or your connections from that site or app.`	Information We Collect Personal data ("Personal Information") may be required to use some services offered by PSafe, or users may have the option of providing it, including name, home address, email address and contact telephone number. We may collect Personal Information about you from third parties and add this information to the information we have already collected from you via our services. PSafe may confirm the provided Personal Information by consulting with public authorities, specialized companies or databases. The information that PSafe obtains from these entities will be treated confidentially.

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 6,000 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 8 tokens mean: 23.16 tokens max: 124 tokens	min: 7 tokens mean: 96.66 tokens max: 512 tokens	min: 6 tokens mean: 94.79 tokens max: 512 tokens

Samples:

anchor	positive	negative
`What term is used to describe sensitive materials unique to the involved entities and not accessible by the general populace, regardless of its physical state or the manner of its revelation?`	`For purposes of this Agreement, "Confidential Information" means any data or information that is proprietary to the Parties and not generally known to the public, whether in tangible or intangible form, whenever and however disclosed, including but not limited to:`	A. "Confidential Information" of a party shall mean any trade secrets, know-how, inventions, products, designs, methods, techniques, systems, processes, software programs, works of authorship, business plans, customer lists, projects, plans, pricing, proposals and any other information which a party discloses to the Recipient Party that: (i) if disclosed in writing is clearly marked as confidential or carries a similar legend; or (ii) if disclosed verbally or in tangible form is identified as confidential at the time of disclosure, then summarized in a writing so marked by the Disclosing Party and delivered to the Recipient Party with fifteen (15) days.
`What term is used to describe sensitive materials unique to the involved entities and not accessible by the general populace, regardless of its physical state or the manner of its revelation?`	`For purposes of this Agreement, "Confidential Information" means any data or information that is proprietary to the Parties and not generally known to the public, whether in tangible or intangible form, whenever and however disclosed, including but not limited to:`	`1. Disclosure: Recipient agrees not to disclose and the Discloser agrees to let the Recipient have the access to the Confidential Information as identified and reduced in writing or provided verbally or in any other way not reduced in writing at the time of such disclosure of the information.`
`What term is used to describe sensitive materials unique to the involved entities and not accessible by the general populace, regardless of its physical state or the manner of its revelation?`	`For purposes of this Agreement, "Confidential Information" means any data or information that is proprietary to the Parties and not generally known to the public, whether in tangible or intangible form, whenever and however disclosed, including but not limited to:`	Confidential Information - information of whatever kind and in whatever form contained (and includes in particular but without prejudice to the generality of the foregoing, documents, drawings, computerized information, films, tapes, specifications, designs, models, equipment or data of any kind) which is clearly identified by the Disclosing Party as confidential by an appropriate legend or if orally disclosed then upon disclosure or within 30 days of such oral disclosure identified in writing by the Disclosing Party as confidential.

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 2e-05
num_train_epochs: 2
warmup_ratio: 0.1
fp16: True
load_best_model_at_end: True

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	loss	stella_en_400M_v5_cosine_map@100
0	0	-	-	0.5279
0.0260	100	1.5185	-	-
0.0520	200	0.9779	-	-
0.0781	300	0.828	-	-
0.1041	400	0.7038	-	-
0.1301	500	0.6537	-	-
0.1561	600	0.5801	-	-
0.1821	700	0.5588	-	-
0.2082	800	0.5124	-	-
0.2342	900	0.4827	-	-
0.2602	1000	0.4672	-	-
0.2862	1100	0.4285	-	-
0.3123	1200	0.3965	-	-
0.3383	1300	0.3759	-	-
0.3643	1400	0.3612	-	-
0.3903	1500	0.3209	-	-
0.4163	1600	0.3108	-	-
0.4424	1700	0.3012	-	-
0.4684	1800	0.2837	-	-
0.4944	1900	0.2801	-	-
0.5204	2000	0.2581	-	-
0.5464	2100	0.2502	-	-
0.5725	2200	0.2502	-	-
0.5985	2300	0.2271	-	-
0.6245	2400	0.2265	-	-
0.6505	2500	0.2144	-	-
0.6766	2600	0.2161	-	-
0.7026	2700	0.2071	-	-
0.7286	2800	0.197	-	-
0.7546	2900	0.1875	-	-
0.7806	3000	0.1846	-	-
0.8067	3100	0.1827	-	-
0.8327	3200	0.1732	-	-
0.8587	3300	0.1778	-	-
0.8847	3400	0.1679	-	-
0.9107	3500	0.1685	-	-
0.9368	3600	0.165	-	-
0.9628	3700	0.1716	-	-
0.9888	3800	0.1593	-	-
1.0	3843	-	0.9541	-
1.0148	3900	0.1463	-	-
1.0409	4000	0.1482	-	-
1.0669	4100	0.1446	-	-
1.0929	4200	0.1481	-	-
1.1189	4300	0.15	-	-
1.1449	4400	0.1446	-	-
1.1710	4500	0.1414	-	-
1.1970	4600	0.1427	-	-
1.2230	4700	0.1385	-	-
1.2490	4800	0.134	-	-
1.2750	4900	0.1343	-	-
1.3011	5000	0.1462	-	-
1.3271	5100	0.1343	-	-
1.3531	5200	0.1324	-	-
1.3791	5300	0.125	-	-
1.4052	5400	0.1299	-	-
1.4312	5500	0.1237	-	-
1.4572	5600	0.1349	-	-
1.4832	5700	0.1303	-	-
1.5092	5800	0.1272	-	-
1.5353	5900	0.1238	-	-
1.5613	6000	0.1194	-	-
1.5873	6100	0.1267	-	-
1.6133	6200	0.1187	-	-
1.6393	6300	0.123	-	-
1.6654	6400	0.1183	-	-
1.6914	6500	0.1245	-	-
1.7174	6600	0.1173	-	-
1.7434	6700	0.1164	-	-
1.7695	6800	0.1169	-	-
1.7955	6900	0.1181	-	-
1.8215	7000	0.1188	-	-
1.8475	7100	0.1166	-	-
1.8735	7200	0.1144	-	-
1.8996	7300	0.1116	-	-
1.9256	7400	0.1149	-	-
1.9516	7500	0.1137	-	-
1.9776	7600	0.1113	-	-
2.0	7686	-	1.0487	0.6903

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.9
Sentence Transformers: 3.1.0.dev0
Transformers: 4.41.2
PyTorch: 2.4.0+cu121
Accelerate: 0.31.0
Datasets: 2.19.1
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

kperkins411
/

stella_en_400M_v5_MultipleNegativesRankingLoss

SentenceTransformer based on dunzhang/stella_en_400M_v5

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Information Retrieval

Training Details

Training Dataset

Unnamed Dataset

Evaluation Dataset

Unnamed Dataset

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MultipleNegativesRankingLoss

Model tree for kperkins411/stella_en_400M_v5_MultipleNegativesRankingLoss

Evaluation results