metadata
base_model: Snowflake/snowflake-arctic-embed-m
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
- dot_accuracy@1
- dot_accuracy@3
- dot_accuracy@5
- dot_accuracy@10
- dot_precision@1
- dot_precision@3
- dot_precision@5
- dot_precision@10
- dot_recall@1
- dot_recall@3
- dot_recall@5
- dot_recall@10
- dot_ndcg@10
- dot_mrr@10
- dot_map@100
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:800
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
How have algorithms in hiring and credit decisions been shown to impact
existing inequities, according to the context?
sentences:
- >-
Shoshana Zuboff. The Age of Surveillance Capitalism: The Fight for a
Human Future at the New Frontier of
Power. Public Affairs. 2019.
64. Angela Chen. Why the Future of Life Insurance May Depend on Your
Online Presence. The Verge. Feb.
7, 2019.
https://www.theverge.com/2019/2/7/18211890/social-media-life-insurance-new-york-algorithms-big
data-discrimination-online-records
68
- >-
SECTION TITLE
FOREWORD
Among the great challenges posed to democracy today is the use of
technology, data, and automated systems in
ways that threaten the rights of the American public. Too often, these
tools are used to limit our opportunities and
prevent our access to critical resources or services. These problems are
well documented. In America and around
the world, systems supposed to help with patient care have proven
unsafe, ineffective, or biased. Algorithms used
in hiring and credit decisions have been found to reflect and reproduce
existing unwanted inequities or embed
new harmful bias and discrimination. Unchecked social media data
collection has been used to threaten people’s
- >-
ways and to the greatest extent possible; where not possible,
alternative privacy by design safeguards should be
used. Systems should not employ user experience and design decisions
that obfuscate user choice or burden
users with defaults that are privacy invasive. Consent should only be
used to justify collection of data in cases
where it can be appropriately and meaningfully given. Any consent
requests should be brief, be understandable
in plain language, and give you agency over data collection and the
specific context of use; current hard-to
understand notice-and-choice practices for broad uses of data should be
changed. Enhanced protections and
- source_sentence: >-
What factors should be considered when tailoring the extent of explanation
provided by a system based on risk level?
sentences:
- >-
ENDNOTES
96. National Science Foundation. NSF Program on Fairness in Artificial
Intelligence in Collaboration
with Amazon (FAI). Accessed July 20, 2022.
https://www.nsf.gov/pubs/2021/nsf21585/nsf21585.htm
97. Kyle Wiggers. Automatic signature verification software threatens to
disenfranchise U.S. voters.
VentureBeat. Oct. 25, 2020.
https://venturebeat.com/2020/10/25/automatic-signature-verification-software-threatens-to
disenfranchise-u-s-voters/
98. Ballotpedia. Cure period for absentee and mail-in ballots. Article
retrieved Apr 18, 2022.
https://ballotpedia.org/Cure_period_for_absentee_and_mail-in_ballots
99. Larry Buchanan and Alicia Parlapiano. Two of these Mail Ballot
Signatures are by the Same Person.
- >-
data. “Sensitive domains” are those in which activities being conducted
can cause material harms, including signifi
cant adverse effects on human rights such as autonomy and dignity, as
well as civil liberties and civil rights. Domains
that have historically been singled out as deserving of enhanced data
protections or where such enhanced protections
are reasonably expected by the public include, but are not limited to,
health, family planning and care, employment,
education, criminal justice, and personal finance. In the context of
this framework, such domains are considered
sensitive whether or not the specifics of a system context would
necessitate coverage under existing law, and domains
- >-
transparent models should be used), rather than as an after-the-decision
interpretation. In other settings, the
extent of explanation provided should be tailored to the risk level.
Valid. The explanation provided by a system should accurately reflect
the factors and the influences that led
to a particular decision, and should be meaningful for the particular
customization based on purpose, target,
and level of risk. While approximation and simplification may be
necessary for the system to succeed based on
the explanatory purpose and target of the explanation, or to account for
the risk of fraud or other concerns
related to revealing decision-making information, such simplifications
should be done in a scientifically
- source_sentence: >-
How do the five principles of the Blueprint for an AI Bill of Rights
function as backstops against potential harms?
sentences:
- >-
programs; or,
Access to critical resources or services, such as healthcare, financial
services, safety, social services,
non-deceptive information about goods and services, and government
benefits.
A list of examples of automated systems for which these principles
should be considered is provided in the
Appendix. The Technical Companion, which follows, offers supportive
guidance for any person or entity that
creates, deploys, or oversees automated systems.
Considered together, the five principles and associated practices of the
Blueprint for an AI Bill of
Rights form an overlapping set of backstops against potential harms.
This purposefully overlapping
- >-
those laws beyond providing them as examples, where appropriate, of
existing protective measures. This
framework instead shares a broad, forward-leaning vision of recommended
principles for automated system
development and use to inform private and public involvement with these
systems where they have the poten
tial to meaningfully impact rights, opportunities, or access.
Additionally, this framework does not analyze or
take a position on legislative and regulatory proposals in municipal,
state, and federal government, or those in
other countries.
We have seen modest progress in recent years, with some state and local
governments responding to these prob
- >-
HUMAN ALTERNATIVES,
CONSIDERATION, AND
FALLBACK
HOW THESE PRINCIPLES CAN MOVE INTO PRACTICE
Real-life examples of how these principles can become reality, through
laws, policies, and practical
technical and sociotechnical approaches to protecting rights,
opportunities, and access.
Healthcare “navigators” help people find their way through online signup
forms to choose
and obtain healthcare. A Navigator is “an individual or organization
that's trained and able to help
consumers, small businesses, and their employees as they look for health
coverage options through the
Marketplace (a government web site), including completing eligibility
and enrollment forms.”106 For
- source_sentence: >-
What should be documented to justify the use of each data attribute and
source in an automated system?
sentences:
- >-
hand and errors from data entry or other sources should be measured and
limited. Any data used as the target
of a prediction process should receive particular attention to the
quality and validity of the predicted outcome
or label to ensure the goal of the automated system is appropriately
identified and measured. Additionally,
justification should be documented for each data attribute and source to
explain why it is appropriate to use
that data to inform the results of the automated system and why such use
will not violate any applicable laws.
In cases of high-dimensional and/or derived attributes, such
justifications can be provided as overall
descriptions of the attribute generation process and appropriateness.
19
- >-
13. National Artificial Intelligence Initiative Office. Agency
Inventories of AI Use Cases. Accessed Sept. 8,
2022. https://www.ai.gov/ai-use-case-inventories/
14. National Highway Traffic Safety Administration.
https://www.nhtsa.gov/
15. See, e.g., Charles Pruitt. People Doing What They Do Best: The
Professional Engineers and NHTSA. Public
Administration Review. Vol. 39, No. 4. Jul.-Aug., 1979.
https://www.jstor.org/stable/976213?seq=1
16. The US Department of Transportation has publicly described the
health and other benefits of these
“traffic calming” measures. See, e.g.: U.S. Department of
Transportation. Traffic Calming to Slow Vehicle
- >-
target measure; unobservable targets may result in the inappropriate use
of proxies. Meeting these
standards may require instituting mitigation procedures and other
protective measures to address
algorithmic discrimination, avoid meaningful harm, and achieve equity
goals.
Ongoing monitoring and mitigation. Automated systems should be regularly
monitored to assess algo
rithmic discrimination that might arise from unforeseen interactions of
the system with inequities not
accounted for during the pre-deployment testing, changes to the system
after deployment, or changes to the
context of use or associated data. Monitoring and disparity assessment
should be performed by the entity
- source_sentence: >-
What are the implications of surveillance technologies on the rights and
opportunities of underserved communities?
sentences:
- >-
manage risks associated with activities or business processes common
across sectors, such as the use of
large language models (LLMs), cloud-based services, or acquisition.
This document defines risks that are novel to or exacerbated by the use
of GAI. After introducing and
describing these risks, the document provides a set of suggested actions
to help organizations govern,
map, measure, and manage these risks.
1 EO 14110 defines Generative AI as “the class of AI models that emulate
the structure and characteristics of input
data in order to generate derived synthetic content. This can include
images, videos, audio, text, and other digital
- >-
rights, and community health, safety and welfare, as well ensuring
better representation of all voices,
especially those traditionally marginalized by technological advances.
Some panelists also raised the issue of
power structures – providing examples of how strong transparency
requirements in smart city projects
helped to reshape power and give more voice to those lacking the
financial or political power to effect change.
In discussion of technical and governance interventions that that are
needed to protect against the harms
of these technologies, various panelists emphasized the need for
transparency, data collection, and
flexible and reactive policy development, analogous to how software is
continuously updated and deployed.
- >-
limits its focus to both government and commercial use of surveillance
technologies when juxtaposed with
real-time or subsequent automated analysis and when such systems have a
potential for meaningful impact
on individuals’ or communities’ rights, opportunities, or access.
UNDERSERVED COMMUNITIES: The term “underserved communities” refers to
communities that have
been systematically denied a full opportunity to participate in aspects
of economic, social, and civic life, as
exemplified by the list in the preceding definition of “equity.”
11
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.805
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.925
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.965
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.97
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.805
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.30833333333333335
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.193
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09699999999999999
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.805
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.925
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.965
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.97
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.8920929944400894
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.8662916666666668
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.8680077838827839
name: Cosine Map@100
- type: dot_accuracy@1
value: 0.805
name: Dot Accuracy@1
- type: dot_accuracy@3
value: 0.925
name: Dot Accuracy@3
- type: dot_accuracy@5
value: 0.965
name: Dot Accuracy@5
- type: dot_accuracy@10
value: 0.97
name: Dot Accuracy@10
- type: dot_precision@1
value: 0.805
name: Dot Precision@1
- type: dot_precision@3
value: 0.30833333333333335
name: Dot Precision@3
- type: dot_precision@5
value: 0.193
name: Dot Precision@5
- type: dot_precision@10
value: 0.09699999999999999
name: Dot Precision@10
- type: dot_recall@1
value: 0.805
name: Dot Recall@1
- type: dot_recall@3
value: 0.925
name: Dot Recall@3
- type: dot_recall@5
value: 0.965
name: Dot Recall@5
- type: dot_recall@10
value: 0.97
name: Dot Recall@10
- type: dot_ndcg@10
value: 0.8920929944400894
name: Dot Ndcg@10
- type: dot_mrr@10
value: 0.8662916666666668
name: Dot Mrr@10
- type: dot_map@100
value: 0.8680077838827839
name: Dot Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("checkthisout/finetuned_arctic")
# Run inference
sentences = [
'What are the implications of surveillance technologies on the rights and opportunities of underserved communities?',
'limits its focus to both government and commercial use of surveillance technologies when juxtaposed with \nreal-time or subsequent automated analysis and when such systems have a potential for meaningful impact \non individuals’ or communities’ rights, opportunities, or access. \nUNDERSERVED COMMUNITIES: The term “underserved communities” refers to communities that have \nbeen systematically denied a full opportunity to participate in aspects of economic, social, and civic life, as \nexemplified by the list in the preceding definition of “equity.” \n11',
'manage risks associated with activities or business processes common across sectors, such as the use of \nlarge language models (LLMs), cloud-based services, or acquisition. \nThis document defines risks that are novel to or exacerbated by the use of GAI. After introducing and \ndescribing these risks, the document provides a set of suggested actions to help organizations govern, \nmap, measure, and manage these risks. \n \n \n1 EO 14110 defines Generative AI as “the class of AI models that emulate the structure and characteristics of input \ndata in order to generate derived synthetic content. This can include images, videos, audio, text, and other digital',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.805 |
cosine_accuracy@3 | 0.925 |
cosine_accuracy@5 | 0.965 |
cosine_accuracy@10 | 0.97 |
cosine_precision@1 | 0.805 |
cosine_precision@3 | 0.3083 |
cosine_precision@5 | 0.193 |
cosine_precision@10 | 0.097 |
cosine_recall@1 | 0.805 |
cosine_recall@3 | 0.925 |
cosine_recall@5 | 0.965 |
cosine_recall@10 | 0.97 |
cosine_ndcg@10 | 0.8921 |
cosine_mrr@10 | 0.8663 |
cosine_map@100 | 0.868 |
dot_accuracy@1 | 0.805 |
dot_accuracy@3 | 0.925 |
dot_accuracy@5 | 0.965 |
dot_accuracy@10 | 0.97 |
dot_precision@1 | 0.805 |
dot_precision@3 | 0.3083 |
dot_precision@5 | 0.193 |
dot_precision@10 | 0.097 |
dot_recall@1 | 0.805 |
dot_recall@3 | 0.925 |
dot_recall@5 | 0.965 |
dot_recall@10 | 0.97 |
dot_ndcg@10 | 0.8921 |
dot_mrr@10 | 0.8663 |
dot_map@100 | 0.868 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 800 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 800 samples:
sentence_0 sentence_1 type string string details - min: 11 tokens
- mean: 20.1 tokens
- max: 36 tokens
- min: 3 tokens
- mean: 127.42 tokens
- max: 512 tokens
- Samples:
sentence_0 sentence_1 What groups are involved in the processes that require cooperation and collaboration?
processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers,
technologists, and the public.
14Why is collaboration among different sectors important in these processes?
processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers,
technologists, and the public.
14What did the panelists emphasize regarding the regulation of technology before it is built and instituted?
(before the technology is built and instituted). Various panelists also emphasized the importance of regulation
that includes limits to the type and cost of such technologies.
56 - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 20per_device_eval_batch_size
: 20num_train_epochs
: 5multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 20per_device_eval_batch_size
: 20per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 5max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_map@100 |
---|---|---|
1.0 | 40 | 0.8449 |
1.25 | 50 | 0.8586 |
2.0 | 80 | 0.8693 |
2.5 | 100 | 0.8702 |
3.0 | 120 | 0.8703 |
3.75 | 150 | 0.8715 |
4.0 | 160 | 0.8659 |
5.0 | 200 | 0.8680 |
Framework Versions
- Python: 3.11.9
- Sentence Transformers: 3.1.1
- Transformers: 4.44.2
- PyTorch: 2.4.1
- Accelerate: 0.34.2
- Datasets: 3.0.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}