am-azadi's picture
Upload folder using huggingface_hub
4084353 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:21988
  - loss:MultipleNegativesRankingLoss
base_model: Lajavaness/bilingual-embedding-large
widget:
  - source_sentence: >-
      BEAS INTESTINES 2901 718935 wwwIsrael under heavy attack from Gaza There
      were more than 600 rockets launched against Israel. There are some
      civilians wounded and dead
    sentences:
      - Photo shows cloud of smoke after attack in Israel
      - Claudia López with a book thanking the FARC
      - Wife of Chinese official shot in US
  - source_sentence: >-
      People's Network people.cn People's Daily: Scientifically grasp the law of
      population development Balanced Population Development in the New Era -
      January 2022 From the 1st, the one-child policy will be completely
      abolished. Newlyweds must have at least two children Wang Peian April 1,
      2021 06:18 Source: People's Daily Online, People's Daily Executive
      summary: ■After the founding of New China, the implementation of family
      planning was based on the basic national conditions of my country's large
      population and relatively insufficient resources A major strategic
      decision, which makes the population's pressure on resources and the
      environment get a preliminary understanding: it creates a longer
      demographic dividend period, It has effectively promoted economic
      development, social progress and the improvement of people's living
      standards, and the country's capacity for sustainable development has been
      greatly enhanced. ■Since the beginning of the new century, my country's
      population situation has undergone major changes. Strive to achieve the
      level of active fertility, vigorously improve the quality and skills of
      workers, and implement the comprehensive two-child policy, which is the
      key to population development. Three issues that must be addressed in the
      field. ■ Attention should be paid to the research on population
      development strategies, comprehensively and profoundly understand and
      grasp the laws of population, and promote the coordination between
      population and economy and society. development, and promote the long-term
      balanced development of the population. choice of history my country has
      been a country with the largest population in the world since ancient
      times. In traditional society, if there is an entrance, there will be a
      license and tax, and the country will be strengthened. If there is a
      population, there will be soldiers. The rulers of successive dynasties
      have vigorously encouraged population reproduction. Once the society is
      stable and production develops, the total population will decrease. The
      threshold will increase greatly; when the dynasty is changed, the army
      will be in chaos, famine and flag epidemics will be intertwined, and the
      population will be sharp or small. Look, before the 17th century, my
      country's population grew slowly in a cyclical ups and downs. The
      introduction of high-yielding food crops such as corn, sweet potato and
      potato in the late Ming Dynasty, especially the century-long Kanggan in
      the early Qing Dynasty. The prosperous age made my country's population
      grow rapidly, breaking through the 200 million, 300 million mark
      successively, and the 400 million mark in the Daoguang years, which led
      to  Legal Migrant Workers People's Network people.cn People's Daily:
      Scientifically grasp the law of population development Balanced Population
      Development in the New Era - January 2022 From the 1st, the one-child
      policy will be completely abolished. Newlyweds must have at least two
      children Wang Peian April 1, 2021 06:18 Source: People's Daily Online,
      People's Daily Executive summary: ■After the founding of New China, the
      implementation of family planning was based on the basic national
      conditions of my country's large population and relatively insufficient
      resources A major strategic decision, which makes the population's
      pressure on resources and the environment get a preliminary understanding:
      it creates a longer demographic dividend period, It has effectively
      promoted economic development, social progress and the improvement of
      people's living standards, and the country's capacity for sustainable
      development has been greatly enhanced. ■Since the beginning of the new
      century, my country's population situation has undergone major changes.
      Strive to achieve the level of active fertility, vigorously improve the
      quality and skills of workers, and implement the comprehensive two-child
      policy, which is the key to population development. Three issues that must
      be addressed in the field. ■ Attention should be paid to the research on
      population development strategies, comprehensively and profoundly
      understand and grasp the laws of population, and promote the coordination
      between population and economy and society. development, and promote the
      long-term balanced development of the population. choice of history my
      country has been a country with the largest population in the world since
      ancient times. In traditional society, if there is an entrance, there will
      be a license and tax, and the country will be strengthened. If there is a
      population, there will be soldiers. The rulers of successive dynasties
      have vigorously encouraged population reproduction. Once the society is
      stable and production develops, the total population will decrease. The
      threshold will increase greatly; when the dynasty is changed, the army
      will be in chaos, famine and flag epidemics will be intertwined, and the
      population will be sharp or small. Look, before the 17th century, my
      country's population grew slowly in a cyclical ups and downs. The
      introduction of high-yielding food crops such as corn, sweet potato and
      potato in the late Ming Dynasty, especially the century-long Kanggan in
      the early Qing Dynasty. The prosperous age made my country's population
      grow rapidly, breaking through the 200 million, 300 million mark
      successively, and the 400 million mark in the Daoguang years, which led
      to  Legal Migrant WorkersA warning to those prosperous forces who often
      talk about human rights: China has human rights, and we have approved that
      Chinese people must get married, and they must have two children after
      they get married!
    sentences:
      - >-
        Hamad bin Jassim told the BBC In a new interview, we paid the defected
        Syrian officer $30,000 and the regular soldier $15,000.
      - >-
        State-run newspaper announces Chinese couples ‘must have two children’
        starting January 2022
      - >-
        This is the draw for judges for the case of former Ecuadorian President
        Rafael Correa
  - source_sentence: >-
      Part 1 Resignation sir jokowi JOKOWI REGISTERED COMPASS DKI DPRD HOLDS
      Plenary MEETING CARIS JAKARTA KOMPASTV Tik TokIs it true that the
      President of Indonesia, Joko Widodo, has resigned from his position?
    sentences:
      - BBC reports on release of 'Unabomber' Ted Kaczynski
      - Thai children flash three fingered salute to Thai PM Prayut
      - President Joko Widodo, alias Jokowi, resigns from his post
  - source_sentence: >-
      The organization 'Vegan Society' calls for a ban on animal-shaped
      children's cookies. They consider that these cookies "incite children to
      see animals as something inferior and at our disposal." This is the ,
      which is dangerous even for anti-bullfighting. It's not that they don't
      want bullfighting. It is that they want to impose even the shape of the
      cookies that your children eat. And it's not the first time. Barnum
      cookies have already "freed" the animals in their boxes to have a better
      brand image. They may seem like funny news. But they are not. They hide a
      prohibitionist ideology full of censorship. 𝗘𝗹 𝗮𝗻𝗶𝗺𝗮𝗹𝗶𝘀𝗺𝗼 𝗲𝘀
      𝗽𝗲𝗹𝗶𝗴𝗿𝗼 𝗽𝗮𝗿𝗮 𝗻𝘂𝗲𝘀𝘁𝗿𝗮 𝘀𝗼𝗰𝗶𝗲𝗱𝗮𝗱
    sentences:
      - >-
        Vegan NGO Vegan Society wants to ban the sale of animal-shaped cookies
        in France
      - Cans of food containing pork with a "halal" stamp
      - >-
        Pfizer announces Covid-19 vaccine update with Microsoft chip for symptom
        reduction
  - source_sentence: >-
      a . . . . . (177. FO Accident st THE LEADER IN ACCIDENT REPORTING Reckless
      driving by a minor Kuliapitiya Kanadulla after a defender collided with a
      motorcycle An accident occurred in front of Maha Vidyalaya today (01)
      afternoon A young man on a motorcycle and about 4 years old A young child
      (father and son) unfortunately Lost his life. Behaved provocatively with
      the accident Villagers set fire to the defender car that caused the
      accident had May that innocent father and little son rest in peace! 94
      site
    sentences:
      - The image of a Syrian child who sleeps next to the graves of his parents
      - Accident kills four-year-old in northwestern Sri Lanka
      - Masks are ineffective because some packaging says they don't protect
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on Lajavaness/bilingual-embedding-large

This is a sentence-transformers model finetuned from Lajavaness/bilingual-embedding-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Lajavaness/bilingual-embedding-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'a . . . . . (177. FO Accident st THE LEADER IN ACCIDENT REPORTING Reckless driving by a minor Kuliapitiya Kanadulla after a defender collided with a motorcycle An accident occurred in front of Maha Vidyalaya today (01) afternoon A young man on a motorcycle and about 4 years old A young child (father and son) unfortunately Lost his life. Behaved provocatively with the accident Villagers set fire to the defender car that caused the accident had May that innocent father and little son rest in peace! 94 site',
    'Accident kills four-year-old in northwestern Sri Lanka',
    'The image of a Syrian child who sleeps next to the graves of his parents',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 21,988 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 2 tokens
    • mean: 119.9 tokens
    • max: 512 tokens
    • min: 7 tokens
    • mean: 19.25 tokens
    • max: 128 tokens
  • Samples:
    sentence_0 sentence_1
    ANK DBS DBS IT department at ChangiThis is actually happening as confirmed by my brother who does contract work with DBS at Changi Business Park. Wonder if PAP knows this or turning a blind eye and pretending not to know. Photo shows foreign staff of the IT department at DBS Bank in Singapore
    29th 30th 31st 32nd 33rd 34th 35th 36th 37th 38th 39th 40th 41st 42nd 43rd 44th 45th 46th 47th 48th 49th 50th 51st 52nd 53rd 54th 55th Urban Planning Foreign Languages Animal Science Law Economics Political Science Education Advertising Journalism Finance Hospitality Criminology Accounting Anthropology Psychology History Geography Information Technology Sociology Sports Science Social Sciences Real Estate Liberal Arts Communications and Mass Media Business Marketing Public Relations 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th 17th 18th 19th 20th 21st 22nd 23rd 24th 25th 26th 27th 28th Architecture Chemical Engineering Chemistry Electrical Engineering Physics Mechanical Engineering Civil Engineering Biochemistry Medicine Pharmacy Engineering Nursing Math Biology Philosophy Mathematics Statistics Music Microbiology Psychology Accounting Finance Environmental Science Creative Writing Hospitality International Relations Art History Ecology55 most difficult course... Harvard list of its 50 most difficult courses
    The 30,000 sheep donated by Mongolia to China entered through the Erenhot port, which is very spectacular. [Qiang] Yesterday there were people who were worried about how to transport so many sheep. It turned out that they came by themselves, and they didn't even need transport tools. These videos show 30,000 sheep donated to China by Mongolia during the novel coronavirus epidemic
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.0455 500 0.0505
0.0910 1000 0.0637
0.1364 1500 0.039
0.1819 2000 0.0269
0.2274 2500 0.0527
0.2729 3000 0.0576
0.3184 3500 0.0278
0.3638 4000 0.0471
0.4093 4500 0.0486
0.4548 5000 0.025
0.5003 5500 0.0324
0.5458 6000 0.0169
0.5912 6500 0.0218
0.6367 7000 0.0476
0.6822 7500 0.0124
0.7277 8000 0.0247
0.7731 8500 0.0231
0.8186 9000 0.01
0.8641 9500 0.0145
0.9096 10000 0.0267
0.9551 10500 0.0111

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}