---
language: []
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:11354
- loss:BatchAllTripletLoss
base_model: FacebookAI/roberta-base
datasets: []
widget:
- source_sentence: 'section: Uni cation-based explicit/implicit grammars, text: The
    grammar was modi ed so that the PCFG formed a backbone and each application of
    a rule involved uni cation of the residue of features in the manner speci ed in
    the original feature-based grammar.'
  sentences:
  - 'section: abstract, text: RNNs are good at modeling long-term dependencies over
    input texts, but preclude parallel computation.'
  - 'section: Results for Prompt Relevance, text: Table 2 (Top) shows the performance
    of the approaches over 11 prompts.'
  - 'section: Results, text: The interface between Check and ElasticSearch is filled
    by Alegre, an API part of the Check suite which is responsible for text and image
    processing, for example, similarity, classification, glossary and language identification.'
- source_sentence: 'section: Languages Families Experiment, text: Finally, Japanese
    and Danish speakers slightly prefer the pattern NN PRP VBZ RB more than others.'
  sentences:
  - 'section: Experiments using Machine Induction, text: Position in phrase (P-P and
    I-P) uses numeric rather than symbolic values.'
  - 'section: Datasets, text: In addition, RWTH-BOSTON-50 (19) includes 483 samples
    of 50 different glosses, RWTH-BOSTON-104 (6) provides 200 continuous sentences
    encompassing 104 signs/words, and RWTH-BOSTON-400 (7), a sentence-level corpus,
    contains 843 sentences involving around 400 signs. '
  - 'section: Convolutional Neural Networks for QA Joint Learning, text: (1) Here,
    H 0 is one real-value matrix after sentence semantic encoding by concatenating
    the word vectors with sliding windows.'
- source_sentence: 'section: Comparing our Baselines and Models, text: [16] further
    improves on DSS for LM tasks by introducing a Gated State Space version called
    GSS, which performs better on PG19, arXiv and GitHub.'
  sentences:
  - 'section: Introduction, text: Current speech and natural language integration
    mainly relies on word-level n-best search techniques [1,2] as shown in figure
    1.'
  - 'section: Experimental Results, text: The F M value of All system suggests that
    it is the most aggressive approach. '
  - 'section: Introduction, text: Because inevitable relation holds at any time and
    the reliability of conclusions inferred from it doesn''t fall down and transitive
    relation can be described efficiently. '
- source_sentence: 'section: Conclusion, text: We presented a comparative evaluation
    of GPT-4, GPT-3.5 and Flan-PaLM 540B on medical competency examinations and benchmark
    datasets.'
  sentences:
  - 'section: Experiment 3: Recursive Structure, text: We surprisingly find that this
    non-recursive corpus induces the same amount of structural transfer as the recursive
    nesting parentheses, which emphasizes the importance of pairing, head-dependency
    type structure in the linguistic structural embeddings of LSTMs.'
  - 'section: Introduction, text: 2. try to analyze data by using the constructed
    rules and extract the exceptions that cannot be correctly handled, then return
    to the first step and focus on the exceptions. '
  - 'section: Evaluation By Token, text: I repeated the experiment once with closed-class
    words and once without, and again averaged the results over the two directions
    of translation.'
- source_sentence: 'section: Experimental Setup, text: The training data used for
    speech recognition -CSR -is different from the Treebank in two aspects: • the
    Treebank is only a subset of the usual CSR training data; • the Treebank tokenization
    is different from that of the CSR corpus; among other spurious small differences,
    the most frequent ones are of the type presented in'
  sentences:
  - 'section: Comparing with Previous Latent Semantic Models, text: 𝐹 𝑖 (or its translation
    candidate 𝐸), and 𝐲 be the projected feature vector, i.e., 𝐲 = 𝐖 T 𝐱.'
  - 'section: Out-of-domain MT, text: The improvement of DIPMT over the baseline is
    striking -we'
  - 'section: Multiple choice next sentence prediction (NSP), text: We have collected
    a new dataset with 54k multiple choice questions where the objective is to predict
    the correct continuation for a given context sentence from four possible answer
    choices.'
pipeline_tag: sentence-similarity
---

# SentenceTransformer based on FacebookAI/roberta-base

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) <!-- at revision e2da8e2f811d1448a5b465c236feacd80ffbac7b -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("SBERT-roberta_pf")
# Run inference
sentences = [
    'section: Experimental Setup, text: The training data used for speech recognition -CSR -is different from the Treebank in two aspects: • the Treebank is only a subset of the usual CSR training data; • the Treebank tokenization is different from that of the CSR corpus; among other spurious small differences, the most frequent ones are of the type presented in',
    'section: Multiple choice next sentence prediction (NSP), text: We have collected a new dataset with 54k multiple choice questions where the objective is to predict the correct continuation for a given context sentence from four possible answer choices.',
    'section: Comparing with Previous Latent Semantic Models, text: 𝐹 𝑖 (or its translation candidate 𝐸), and 𝐲 be the projected feature vector, i.e., 𝐲 = 𝐖 T 𝐱.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### Unnamed Dataset


* Size: 11,354 training samples
* Columns: <code>text</code> and <code>label</code>
* Approximate statistics based on the first 1000 samples:
  |         | text                                                                                | label                                                                                                                                                         |
  |:--------|:------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | type    | string                                                                              | int                                                                                                                                                           |
  | details | <ul><li>min: 11 tokens</li><li>mean: 38.95 tokens</li><li>max: 234 tokens</li></ul> | <ul><li>0: ~14.30%</li><li>1: ~10.70%</li><li>2: ~23.90%</li><li>3: ~0.40%</li><li>4: ~2.60%</li><li>5: ~1.50%</li><li>6: ~1.90%</li><li>7: ~44.70%</li></ul> |
* Samples:
  | text                                                                                                                                                    | label          |
  |:--------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
  | <code>section: INTRODUCTION, text: Arguments for the importance of prosody in language abound in the literature.</code>                                 | <code>0</code> |
  | <code>section: Results, text: This overlap ensures that actions that might otherwise occur on clip boundaries will also occur as part of a clip.</code> | <code>7</code> |
  | <code>section: Introduction, text: In Section 4 the experimental setup and results are detailed.</code>                                                 | <code>6</code> |
* Loss: [<code>BatchAllTripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#batchalltripletloss)

### Evaluation Dataset

#### Unnamed Dataset


* Size: 1,419 evaluation samples
* Columns: <code>text</code> and <code>label</code>
* Approximate statistics based on the first 1000 samples:
  |         | text                                                                                | label                                                                                                                                                         |
  |:--------|:------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | type    | string                                                                              | int                                                                                                                                                           |
  | details | <ul><li>min: 10 tokens</li><li>mean: 40.49 tokens</li><li>max: 221 tokens</li></ul> | <ul><li>0: ~15.50%</li><li>1: ~11.10%</li><li>2: ~20.50%</li><li>3: ~0.50%</li><li>4: ~2.60%</li><li>5: ~1.50%</li><li>6: ~2.10%</li><li>7: ~46.20%</li></ul> |
* Samples:
  | text                                                                                                                                                                                                                                                                        | label          |
  |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
  | <code>section: Introduction, text: It is common in Natural Language Processing (NLP) that the categories into which text is classified do not have fully objective definitions.</code>                                                                                      | <code>0</code> |
  | <code>section: Automatic Evaluation Results, text: With respect to the BLEU score, this difference is 1.58 points absolute for the word based evaluation (27% relative increase), and 2.47 points absolute for the morphemebased evaluation (21% relative increase).</code> | <code>2</code> |
  | <code>section: Neural Descriptor Fields, text: The sum of the maxpooled mask probabilities of all slots can be used for counting, and the loss can be back propagated to optimize NDF as well as the embeddings.</code>                                                     | <code>7</code> |
* Loss: [<code>BatchAllTripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#batchalltripletloss)

### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: steps
- `learning_rate`: 1e-05
- `weight_decay`: 0.1
- `load_best_model_at_end`: True
- `push_to_hub`: True

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 8
- `per_device_eval_batch_size`: 8
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `learning_rate`: 1e-05
- `weight_decay`: 0.1
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- `num_train_epochs`: 3
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: True
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: True
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: False
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: proportional

</details>

### Training Logs
| Epoch      | Step     | Training Loss | loss       |
|:----------:|:--------:|:-------------:|:----------:|
| 0.3521     | 500      | 4.3466        | 4.0196     |
| 0.7042     | 1000     | 3.9809        | 3.3573     |
| 1.0563     | 1500     | 3.8231        | 3.7082     |
| 1.4085     | 2000     | 3.5722        | 3.6799     |
| 1.7606     | 2500     | 3.6224        | 3.4086     |
| 2.1127     | 3000     | 3.1266        | 3.2109     |
| 2.4648     | 3500     | 3.1252        | 3.3558     |
| **2.8169** | **4000** | **3.1115**    | **3.1682** |

* The bold row denotes the saved checkpoint.

### Framework Versions
- Python: 3.9.2
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu121
- Accelerate: 0.31.0
- Datasets: 2.19.2
- Tokenizers: 0.19.1

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### BatchAllTripletLoss
```bibtex
@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification}, 
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->