jensjorisdecorte's picture
Add new SentenceTransformer model.
9cbaea1 verified
metadata
base_model: sentence-transformers/all-mpnet-base-v2
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:152151
  - loss:HardMultipleNegativesRankingLoss
  - loss:CachedMultipleNegativesSymmetricRankingLoss
widget:
  - source_sentence: >-
      Use arc welding techniques to make welds in conditions of very high
      pressure, usually in an underwater dry chamber such as a diving bell.
      Compensate for the negative consequences of high pressure on a weld, such
      as the shorter and less steady welding arc.
    sentences:
      - skill_skill
      - weld in hyperbaric conditions
      - human-robot collaboration
  - source_sentence: >-
      Carry out mineral processing operations, which aim to separate valuable
      minerals from waste rock or grout. Oversee and implement processes such as
      samping, analysis and most importantly the electrostatic separation
      process, which separates valuable materials from mineral ore.
    sentences:
      - internet governance
      - implement mineral processes
      - skill_skill
  - source_sentence: >-
      looking for a pest control technician with strong knowledge in
      preventative measures to minimize pest populations A successful candidate
      will have experience in cryopreservation techniques as well as laboratory
      protocols
    sentences:
      - cryopreservation
      - food preservation
      - skill_sentence
  - source_sentence: >-
      Candidates with experience using popular balance sheet software are
      encouraged to apply for our accounting position. We are looking for a
      cargo handling expert who can maximize efficiency on our shipping vessels.
    sentences:
      - skill_sentence
      - perform balance sheet operations
      - promote inclusion
  - source_sentence: >-
      Must have the ability to read and interpret schematics and effectively
      install and calibrate lift governors to ensure compliance with safety
      standards. The ideal candidate must have an ear for identifying music with
      commercial potential and understand the current market trends.
    sentences:
      - prepare credit reports
      - install lift governor
      - skill_sentence

SentenceTransformer based on sentence-transformers/all-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2 on the skill_sentence and skill_skill datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-mpnet-base-v2
  • Maximum Sequence Length: 96 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Training Datasets:
    • skill_sentence
    • skill_skill

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 96, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): SmartTokenPooling({'word_embedding_dimension': 768, 'window_size': -1})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jensjorisdecorte/ConTeXT-Skill-Extraction-base")
# Run inference
sentences = [
    'Must have the ability to read and interpret schematics and effectively install and calibrate lift governors to ensure compliance with safety standards. The ideal candidate must have an ear for identifying music with commercial potential and understand the current market trends.',
    'install lift governor',
    'skill_sentence',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Datasets

skill_sentence

  • Dataset: skill_sentence
  • Size: 138,260 training samples
  • Columns: anchor, positive, and type
  • Approximate statistics based on the first 1000 samples:
    anchor positive type
    type string string string
    details
    • min: 9 tokens
    • mean: 35.67 tokens
    • max: 63 tokens
    • min: 3 tokens
    • mean: 6.12 tokens
    • max: 15 tokens
    • min: 5 tokens
    • mean: 5.0 tokens
    • max: 5 tokens
  • Samples:
    anchor positive type
    duties for this role will include conducting water chemistry analysis and managing the laboratory. seeking a seasoned print manufacturing manager with knowledge of printing materials, processes and equipment. water chemistry analysis skill_sentence
    divers must understand how to calculate dive times and limits to ensure they return safely. We are searching for a multimedia software expert with experience in sound, lighting and recording software. comply with the planned time for the depth of the dive skill_sentence
    A successful candidate will possess the ability to calibrate laboratory equipment according to industry standards. we are seeking a candidate with experience in preparing government funding dossiers prepare government funding dossiers skill_sentence
  • Loss: custom_losses.HardMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20,
        "similarity_fct": "<lambda>"
    }
    

skill_skill

  • Dataset: skill_skill
  • Size: 13,891 training samples
  • Columns: anchor, positive, and type
  • Approximate statistics based on the first 1000 samples:
    anchor positive type
    type string string string
    details
    • min: 6 tokens
    • mean: 29.09 tokens
    • max: 96 tokens
    • min: 3 tokens
    • mean: 6.24 tokens
    • max: 16 tokens
    • min: 5 tokens
    • mean: 5.0 tokens
    • max: 5 tokens
  • Samples:
    anchor positive type
    Adapt and move set pieces during rehearsals and live performances. adapt sets skill_skill
    Prepare bread and bread products such as sandwiches for consumption. prepare bread products skill_skill
    The strategies, methods and techniques that increase the organisation's capacity to protect and sustain the services and operations that fulfil the organisational mission and create lasting values by effectively addressing the combined issues of security, preparedness, risk and disaster recovery. organisational resilience skill_skill
  • Loss: CachedMultipleNegativesSymmetricRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 64
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • overwrite_output_dir: True
  • eval_strategy: steps
  • per_device_train_batch_size: 4096
  • per_device_eval_batch_size: 4096
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: True
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 4096
  • per_device_eval_batch_size: 4096
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step
0.1053 4
0.2105 8
0.3158 12
0.4211 16
0.5263 20
0.6316 24
0.7368 28
0.8421 32
0.9474 36
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.9.19
  • Sentence Transformers: 3.1.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu118
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}