finetuned_arctic / README.md
checkthisout's picture
Add new SentenceTransformer model.
169c1c3 verified
metadata
base_model: Snowflake/snowflake-arctic-embed-m
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
  - dot_accuracy@1
  - dot_accuracy@3
  - dot_accuracy@5
  - dot_accuracy@10
  - dot_precision@1
  - dot_precision@3
  - dot_precision@5
  - dot_precision@10
  - dot_recall@1
  - dot_recall@3
  - dot_recall@5
  - dot_recall@10
  - dot_ndcg@10
  - dot_mrr@10
  - dot_map@100
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:800
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      How have algorithms in hiring and credit decisions been shown to impact
      existing inequities, according to the context?
    sentences:
      - >-
        Shoshana Zuboff. The Age of Surveillance Capitalism: The Fight for a
        Human Future at the New Frontier of

        Power. Public Affairs. 2019.

        64. Angela Chen. Why the Future of Life Insurance May Depend on Your
        Online Presence. The Verge. Feb.

        7, 2019.

        https://www.theverge.com/2019/2/7/18211890/social-media-life-insurance-new-york-algorithms-big­

        data-discrimination-online-records

        68
      - >-
        SECTION TITLE­

        FOREWORD

        Among the great challenges posed to democracy today is the use of
        technology, data, and automated systems in 

        ways that threaten the rights of the American public. Too often, these
        tools are used to limit our opportunities and 

        prevent our access to critical resources or services. These problems are
        well documented. In America and around 

        the world, systems supposed to help with patient care have proven
        unsafe, ineffective, or biased. Algorithms used 

        in hiring and credit decisions have been found to reflect and reproduce
        existing unwanted inequities or embed 

        new harmful bias and discrimination. Unchecked social media data
        collection has been used to threaten people’s
      - >-
        ways and to the greatest extent possible; where not possible,
        alternative privacy by design safeguards should be 

        used. Systems should not employ user experience and design decisions
        that obfuscate user choice or burden 

        users with defaults that are privacy invasive. Consent should only be
        used to justify collection of data in cases 

        where it can be appropriately and meaningfully given. Any consent
        requests should be brief, be understandable 

        in plain language, and give you agency over data collection and the
        specific context of use; current hard-to­

        understand notice-and-choice practices for broad uses of data should be
        changed. Enhanced protections and
  - source_sentence: >-
      What factors should be considered when tailoring the extent of explanation
      provided by a system based on risk level?
    sentences:
      - >-
        ENDNOTES

        96. National Science Foundation. NSF Program on Fairness in Artificial
        Intelligence in Collaboration

        with Amazon (FAI). Accessed July 20, 2022.

        https://www.nsf.gov/pubs/2021/nsf21585/nsf21585.htm

        97. Kyle Wiggers. Automatic signature verification software threatens to
        disenfranchise U.S. voters.

        VentureBeat. Oct. 25, 2020.

        https://venturebeat.com/2020/10/25/automatic-signature-verification-software-threatens-to­

        disenfranchise-u-s-voters/

        98. Ballotpedia. Cure period for absentee and mail-in ballots. Article
        retrieved Apr 18, 2022.

        https://ballotpedia.org/Cure_period_for_absentee_and_mail-in_ballots

        99. Larry Buchanan and Alicia Parlapiano. Two of these Mail Ballot
        Signatures are by the Same Person.
      - >-
        data. “Sensitive domains” are those in which activities being conducted
        can cause material harms, including signifi­

        cant adverse effects on human rights such as autonomy and dignity, as
        well as civil liberties and civil rights. Domains 

        that have historically been singled out as deserving of enhanced data
        protections or where such enhanced protections 

        are reasonably expected by the public include, but are not limited to,
        health, family planning and care, employment, 

        education, criminal justice, and personal finance. In the context of
        this framework, such domains are considered 

        sensitive whether or not the specifics of a system context would
        necessitate coverage under existing law, and domains
      - >-
        transparent models should be used), rather than as an after-the-decision
        interpretation. In other settings, the 

        extent of explanation provided should be tailored to the risk level. 

        Valid. The explanation provided by a system should accurately reflect
        the factors and the influences that led 

        to a particular decision, and should be meaningful for the particular
        customization based on purpose, target, 

        and level of risk. While approximation and simplification may be
        necessary for the system to succeed based on 

        the explanatory purpose and target of the explanation, or to account for
        the risk of fraud or other concerns 

        related to revealing decision-making information, such simplifications
        should be done in a scientifically
  - source_sentence: >-
      How do the five principles of the Blueprint for an AI Bill of Rights
      function as backstops against potential harms?
    sentences:
      - >-
        programs; or, 

        Access to critical resources or services, such as healthcare, financial
        services, safety, social services, 

        non-deceptive information about goods and services, and government
        benefits. 

        A list of examples of automated systems for which these principles
        should be considered is provided in the 

        Appendix. The Technical Companion, which follows, offers supportive
        guidance for any person or entity that 

        creates, deploys, or oversees automated systems. 

        Considered together, the five principles and associated practices of the
        Blueprint for an AI Bill of 

        Rights form an overlapping set of backstops against potential harms.
        This purposefully overlapping
      - >-
        those laws beyond providing them as examples, where appropriate, of
        existing protective measures. This 

        framework instead shares a broad, forward-leaning vision of recommended
        principles for automated system 

        development and use to inform private and public involvement with these
        systems where they have the poten­

        tial to meaningfully impact rights, opportunities, or access.
        Additionally, this framework does not analyze or 

        take a position on legislative and regulatory proposals in municipal,
        state, and federal government, or those in 

        other countries. 

        We have seen modest progress in recent years, with some state and local
        governments responding to these prob­
      - >-
        HUMAN ALTERNATIVES, 

        CONSIDERATION, AND 

        FALLBACK 

        HOW THESE PRINCIPLES CAN MOVE INTO PRACTICE

        Real-life examples of how these principles can become reality, through
        laws, policies, and practical 

        technical and sociotechnical approaches to protecting rights,
        opportunities, and access. 

        Healthcare “navigators” help people find their way through online signup
        forms to choose 

        and obtain healthcare. A Navigator is “an individual or organization
        that's trained and able to help 

        consumers, small businesses, and their employees as they look for health
        coverage options through the 

        Marketplace (a government web site), including completing eligibility
        and enrollment forms.”106 For
  - source_sentence: >-
      What should be documented to justify the use of each data attribute and
      source in an automated system?
    sentences:
      - >-
        hand and errors from data entry or other sources should be measured and
        limited. Any data used as the target 

        of a prediction process should receive particular attention to the
        quality and validity of the predicted outcome 

        or label to ensure the goal of the automated system is appropriately
        identified and measured. Additionally, 

        justification should be documented for each data attribute and source to
        explain why it is appropriate to use 

        that data to inform the results of the automated system and why such use
        will not violate any applicable laws. 

        In cases of high-dimensional and/or derived attributes, such
        justifications can be provided as overall 

        descriptions of the attribute generation process and appropriateness. 

        19
      - >-
        13. National Artificial Intelligence Initiative Office. Agency
        Inventories of AI Use Cases. Accessed Sept. 8,

        2022. https://www.ai.gov/ai-use-case-inventories/

        14. National Highway Traffic Safety Administration.
        https://www.nhtsa.gov/

        15. See, e.g., Charles Pruitt. People Doing What They Do Best: The
        Professional Engineers and NHTSA. Public

        Administration Review. Vol. 39, No. 4. Jul.-Aug., 1979.
        https://www.jstor.org/stable/976213?seq=1

        16. The US Department of Transportation has publicly described the
        health and other benefits of these

        “traffic calming” measures. See, e.g.: U.S. Department of
        Transportation. Traffic Calming to Slow Vehicle
      - >-
        target measure; unobservable targets may result in the inappropriate use
        of proxies. Meeting these 

        standards may require instituting mitigation procedures and other
        protective measures to address 

        algorithmic discrimination, avoid meaningful harm, and achieve equity
        goals. 

        Ongoing monitoring and mitigation. Automated systems should be regularly
        monitored to assess algo­

        rithmic discrimination that might arise from unforeseen interactions of
        the system with inequities not 

        accounted for during the pre-deployment testing, changes to the system
        after deployment, or changes to the 

        context of use or associated data. Monitoring and disparity assessment
        should be performed by the entity
  - source_sentence: >-
      What are the implications of surveillance technologies on the rights and
      opportunities of underserved communities?
    sentences:
      - >-
        manage risks associated with activities or business processes common
        across sectors, such as the use of 

        large language models (LLMs), cloud-based services, or acquisition. 

        This document defines risks that are novel to or exacerbated by the use
        of GAI. After introducing and 

        describing these risks, the document provides a set of suggested actions
        to help organizations govern, 

        map, measure, and manage these risks. 
         
         
        1 EO 14110 defines Generative AI as “the class of AI models that emulate
        the structure and characteristics of input 

        data in order to generate derived synthetic content. This can include
        images, videos, audio, text, and other digital
      - >-
        rights, and community health, safety and welfare, as well ensuring
        better representation of all voices, 

        especially those traditionally marginalized by technological advances.
        Some panelists also raised the issue of 

        power structures  providing examples of how strong transparency
        requirements in smart city projects 

        helped to reshape power and give more voice to those lacking the
        financial or political power to effect change. 

        In discussion of technical and governance interventions that that are
        needed to protect against the harms 

        of these technologies, various panelists emphasized the need for
        transparency, data collection, and 

        flexible and reactive policy development, analogous to how software is
        continuously updated and deployed.
      - >-
        limits its focus to both government and commercial use of surveillance
        technologies when juxtaposed with 

        real-time or subsequent automated analysis and when such systems have a
        potential for meaningful impact 

        on individuals’ or communities’ rights, opportunities, or access. 

        UNDERSERVED COMMUNITIES: The term “underserved communities” refers to
        communities that have 

        been systematically denied a full opportunity to participate in aspects
        of economic, social, and civic life, as 

        exemplified by the list in the preceding definition of “equity.” 

        11
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.805
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.925
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.965
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.97
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.805
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.30833333333333335
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.193
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09699999999999999
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.805
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.925
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.965
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.97
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8920929944400894
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.8662916666666668
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.8680077838827839
            name: Cosine Map@100
          - type: dot_accuracy@1
            value: 0.805
            name: Dot Accuracy@1
          - type: dot_accuracy@3
            value: 0.925
            name: Dot Accuracy@3
          - type: dot_accuracy@5
            value: 0.965
            name: Dot Accuracy@5
          - type: dot_accuracy@10
            value: 0.97
            name: Dot Accuracy@10
          - type: dot_precision@1
            value: 0.805
            name: Dot Precision@1
          - type: dot_precision@3
            value: 0.30833333333333335
            name: Dot Precision@3
          - type: dot_precision@5
            value: 0.193
            name: Dot Precision@5
          - type: dot_precision@10
            value: 0.09699999999999999
            name: Dot Precision@10
          - type: dot_recall@1
            value: 0.805
            name: Dot Recall@1
          - type: dot_recall@3
            value: 0.925
            name: Dot Recall@3
          - type: dot_recall@5
            value: 0.965
            name: Dot Recall@5
          - type: dot_recall@10
            value: 0.97
            name: Dot Recall@10
          - type: dot_ndcg@10
            value: 0.8920929944400894
            name: Dot Ndcg@10
          - type: dot_mrr@10
            value: 0.8662916666666668
            name: Dot Mrr@10
          - type: dot_map@100
            value: 0.8680077838827839
            name: Dot Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("checkthisout/finetuned_arctic")
# Run inference
sentences = [
    'What are the implications of surveillance technologies on the rights and opportunities of underserved communities?',
    'limits its focus to both government and commercial use of surveillance technologies when juxtaposed with \nreal-time or subsequent automated analysis and when such systems have a potential for meaningful impact \non individuals’ or communities’ rights, opportunities, or access. \nUNDERSERVED COMMUNITIES: The term “underserved communities” refers to communities that have \nbeen systematically denied a full opportunity to participate in aspects of economic, social, and civic life, as \nexemplified by the list in the preceding definition of “equity.” \n11',
    'manage risks associated with activities or business processes common across sectors, such as the use of \nlarge language models (LLMs), cloud-based services, or acquisition. \nThis document defines risks that are novel to or exacerbated by the use of GAI. After introducing and \ndescribing these risks, the document provides a set of suggested actions to help organizations govern, \nmap, measure, and manage these risks. \n \n \n1 EO 14110 defines Generative AI as “the class of AI models that emulate the structure and characteristics of input \ndata in order to generate derived synthetic content. This can include images, videos, audio, text, and other digital',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.805
cosine_accuracy@3 0.925
cosine_accuracy@5 0.965
cosine_accuracy@10 0.97
cosine_precision@1 0.805
cosine_precision@3 0.3083
cosine_precision@5 0.193
cosine_precision@10 0.097
cosine_recall@1 0.805
cosine_recall@3 0.925
cosine_recall@5 0.965
cosine_recall@10 0.97
cosine_ndcg@10 0.8921
cosine_mrr@10 0.8663
cosine_map@100 0.868
dot_accuracy@1 0.805
dot_accuracy@3 0.925
dot_accuracy@5 0.965
dot_accuracy@10 0.97
dot_precision@1 0.805
dot_precision@3 0.3083
dot_precision@5 0.193
dot_precision@10 0.097
dot_recall@1 0.805
dot_recall@3 0.925
dot_recall@5 0.965
dot_recall@10 0.97
dot_ndcg@10 0.8921
dot_mrr@10 0.8663
dot_map@100 0.868

Training Details

Training Dataset

Unnamed Dataset

  • Size: 800 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 800 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 11 tokens
    • mean: 20.1 tokens
    • max: 36 tokens
    • min: 3 tokens
    • mean: 127.42 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    What groups are involved in the processes that require cooperation and collaboration? processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers,
    technologists, and the public.
    14
    Why is collaboration among different sectors important in these processes? processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers,
    technologists, and the public.
    14
    What did the panelists emphasize regarding the regulation of technology before it is built and instituted? (before the technology is built and instituted). Various panelists also emphasized the importance of regulation
    that includes limits to the type and cost of such technologies.
    56
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 20
  • per_device_eval_batch_size: 20
  • num_train_epochs: 5
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 20
  • per_device_eval_batch_size: 20
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step cosine_map@100
1.0 40 0.8449
1.25 50 0.8586
2.0 80 0.8693
2.5 100 0.8702
3.0 120 0.8703
3.75 150 0.8715
4.0 160 0.8659
5.0 200 0.8680

Framework Versions

  • Python: 3.11.9
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}