metadata

base_model: BAAI/bge-small-en-v1.5
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - dot_accuracy
  - manhattan_accuracy
  - euclidean_accuracy
  - max_accuracy
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:60341
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      What is the focus of the research conducted by the MHCI x 99P Labs
      Capstone Team?
    sentences:
      - >-
        To determine the destination of a given car based on an initial start
        position in time, we developed a Markov Model. We then creatively
        combined DBScan, K-NN, and XGboost algorithms to achieve accurate dwell
        time forecasts.
      - >-
        Transportation networks touch all three pillars of sustainability. They
        shape our daily lives by connecting us to work, retail, and recreation;
        however, a system that does not connect everyone equitably reproduces
        social disparities.
      - >-
        Two weeks of digging deep into exploratory, generative research

        Written by the MHCI x 99P Labs Capstone TeamEdited by 99P Labs

        The MHCI x 99P Labs Capstone Team is part of the Master of
        Human-Computer Interaction (MHCI) program at Carnegie Mellon University.
  - source_sentence: What limits are being considered for data quality checks?
    sentences:
      - >-
        Unlike many other Agile teams, we don t do a Retro every sprint, mostly
        because we do one-week sprints.
      - >-
        Our team has been exploring implementing data quality checks into our
        data platform. We ve been trying to establish our goals, limits, and
        expectations, some of which were discussed in Part 1 of our Data Quality
        blog posts.
      - >-
        Literature and Topical ReviewEach team member performed a literature
        review on telematics research, identifying its applications,
        methodologies, and critical insights.
  - source_sentence: What are the potential consequences of not researching before coding?
    sentences:
      - >-
        This indicates a degree of variance in the model s accuracy across
        different times and conditions.
      - >-
        In order to objectively test ourselves on the knowledge we ve gained, we
        decide to take a quiz. The quiz contains 50 images of either dogs or
        cats and we have to determine which animal the image most closely
        resembles.
      - >-
        To reiterate, before even writing any code, it s important to do proper
        research into your team s documentation and online resources. A lot of
        time can be saved by reusing code that can adapt to your use case
        instead of starting from scratch every time.
  - source_sentence: What might be the implications of having a performance of 3%?
    sentences:
      - Then, I will highlight the top three winning projects from each track.
      - >-
        Channels can be used only by organizations that are invited to the
        channel and are invisible to other members of the network. Each channel
        has a separate blockchain ledger.
      - >-
        3%, only slightly better than the worst-performing model, K Nearest
        Neighbors.
  - source_sentence: In what context is traffic flow theory typically discussed?
    sentences:
      - >-
        As a result, I was familiar with many terms discussed conceptually but I
        discovered some of the more official terminology used when discussing
        traffic flow theory and network control.
      - >-
        We called it plus-deltas (+/ ). Seeing the output and outcomes we
        accomplished in each session was encouraging and allowed us to
        acknowledge things we did that made us successful so we could carry it
        on to the next session.
      - There are different types of projects within C.
model-index:
  - name: SentenceTransformer based on BAAI/bge-small-en-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: 99GPT Finetuning Embedding test 01
          type: 99GPT-Finetuning-Embedding-test-01
        metrics:
          - type: cosine_accuracy
            value: 0.9987405541561712
            name: Cosine Accuracy
          - type: dot_accuracy
            value: 0.0011931592204693093
            name: Dot Accuracy
          - type: manhattan_accuracy
            value: 0.9987405541561712
            name: Manhattan Accuracy
          - type: euclidean_accuracy
            value: 0.9987405541561712
            name: Euclidean Accuracy
          - type: max_accuracy
            value: 0.9987405541561712
            name: Max Accuracy

SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-small-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 384 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("marroyo777/bge-99GPT-v1")
# Run inference
sentences = [
    'In what context is traffic flow theory typically discussed?',
    'As a result, I was familiar with many terms discussed conceptually but I discovered some of the more official terminology used when discussing traffic flow theory and network control.',
    'There are different types of projects within C.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Dataset: 99GPT-Finetuning-Embedding-test-01
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9987
dot_accuracy	0.0012
manhattan_accuracy	0.9987
euclidean_accuracy	0.9987
max_accuracy	0.9987

Training Details

Training Dataset

Unnamed Dataset

Size: 60,341 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 7 tokens mean: 13.77 tokens max: 24 tokens	min: 7 tokens mean: 40.26 tokens max: 123 tokens	min: 6 tokens mean: 39.24 tokens max: 139 tokens

Samples:

anchor	positive	negative
`Who is being invited to join the initiative?`	`Our belief is that the research community will be able to gain access to diverse and real-time data with minimal friction, build exciting innovations and make an impact to Data and AI technologies as well. This is just the first release and we are inviting the research community to join us to build exciting data-driven mobility & energy solutions together.`	`Burning it destroys the oil. Once you burn the oil, that particular oil ceases to exist.`
`What is the main focus of the research conducted for Orbit?`	`Orbit holds the culmination of almost a year of research with participants from a wide variety of backgrounds, needs, and jobs to be done.`	`So how do you win a hackathon mobility challenge? The SmartRoute team showed two of them.`
`What role do LLMs play in HRI's strategy?`	`We are excited about the potential of JournAI to transform mobility. By harnessing the power of LLMs and other AI technologies, HRI is driving towards a more connected, efficient, and sustainable future.`	`This simplified the process for users, who only had to pull and run the docker image to spawn a Jupyterlab app on their machine, open it in their browser, and create a new Pyspark notebook that automatically connected to our spark cluster. Our new workflow allows data science teams to configure their spark jobs and compute resources with options to request memory and CPU from the cluster and customize spark settings.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 15,086 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 6 tokens mean: 13.73 tokens max: 24 tokens	min: 6 tokens mean: 39.51 tokens max: 131 tokens	min: 6 tokens mean: 36.9 tokens max: 153 tokens

Samples:

anchor	positive	negative
`What does the text suggest about the balance between creating tools and their practical application?`	`From technology to healthcare, these examples underline the importance of the interplay between theory and practice, between creating advanced tools and applying them effectively.`	`We found success when leaving the later panels empty as opposed to earlier ones. If we established a clear context and pain point for participants, they were often able to fill in a solution and resolution themselves.`
`Who are the personas mentioned in the text?`	`Our derived data sets are created based on personas that we have identified and their data access needs.`	`However there still exists a need to connect the map matched nodes that are outputted from the libraries to specific data points from the V2X data, in order to get the rest of the V2X features in a specific time frame.`
`Is this the first or second hackathon mentioned?`	`Up next is the first of two hackathons we participated in at Ohio State University.`	`The team did a great job by targeting a pervasive issue in such an intuitive way.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	loss	99GPT-Finetuning-Embedding-test-01_max_accuracy
0.0265	100	0.7653	0.4309	-
0.0530	200	0.4795	0.2525	-
0.0795	300	0.3416	0.1996	-
0.1060	400	0.2713	0.1699	-
0.1326	500	0.2271	0.1558	-
0.1591	600	0.2427	0.1510	-
0.1856	700	0.2188	0.1414	-
0.2121	800	0.1936	0.1350	-
0.2386	900	0.2174	0.1370	-
0.2651	1000	0.2104	0.1265	-
0.2916	1100	0.2142	0.1324	-
0.3181	1200	0.2088	0.1297	-
0.3446	1300	0.1865	0.1240	-
0.3712	1400	0.177	0.1221	-
0.3977	1500	0.1735	0.1296	-
0.4242	1600	0.1746	0.1188	-
0.4507	1700	0.1639	0.1178	-
0.4772	1800	0.1958	0.1105	-
0.5037	1900	0.1874	0.1152	-
0.5302	2000	0.1676	0.1143	-
0.5567	2100	0.1671	0.1067	-
0.5832	2200	0.142	0.1154	-
0.6098	2300	0.1668	0.1150	-
0.6363	2400	0.1605	0.1091	-
0.6628	2500	0.1475	0.1096	-
0.6893	2600	0.1668	0.1066	-
0.7158	2700	0.166	0.1067	-
0.7423	2800	0.1611	0.0999	-
0.7688	2900	0.1747	0.1001	-
0.7953	3000	0.1436	0.1065	-
0.8218	3100	0.1579	0.0992	-
0.8484	3200	0.1718	0.1006	-
0.8749	3300	0.1567	0.0995	-
0.9014	3400	0.1634	0.0954	-
0.9279	3500	0.1441	0.0956	-
0.9544	3600	0.1433	0.0991	-
0.9809	3700	0.1562	0.0931	-
1.0074	3800	0.1421	0.0931	-
1.0339	3900	0.1424	0.0956	-
1.0604	4000	0.128	0.0900	-
1.0870	4100	0.1265	0.0921	-
1.1135	4200	0.1062	0.0944	-
1.1400	4300	0.1221	0.0900	-
1.1665	4400	0.1091	0.0944	-
1.1930	4500	0.091	0.0913	-
1.2195	4600	0.0823	0.0935	-
1.2460	4700	0.0946	0.0949	-
1.2725	4800	0.0803	0.0890	-
1.2990	4900	0.0796	0.0885	-
1.3256	5000	0.0699	0.0921	-
1.3521	5100	0.073	0.0909	-
1.3786	5200	0.0608	0.0934	-
1.4051	5300	0.07	0.0941	-
1.4316	5400	0.0732	0.0896	-
1.4581	5500	0.0639	0.0910	-
1.4846	5600	0.0722	0.0874	-
1.5111	5700	0.0635	0.0925	-
1.5376	5800	0.0631	0.0887	-
1.5642	5900	0.0589	0.0896	-
1.5907	6000	0.0636	0.0925	-
1.6172	6100	0.0702	0.0938	-
1.6437	6200	0.0572	0.0921	-
1.6702	6300	0.0516	0.0946	-
1.6967	6400	0.0695	0.0902	-
1.7232	6500	0.0632	0.0917	-
1.7497	6600	0.0697	0.0832	-
1.7762	6700	0.0747	0.0853	-
1.8028	6800	0.0615	0.0892	-
1.8293	6900	0.0747	0.0855	-
1.8558	7000	0.0668	0.0848	-
1.8823	7100	0.0747	0.0853	-
1.9088	7200	0.0774	0.0847	-
1.9353	7300	0.0546	0.0874	-
1.9618	7400	0.0708	0.0879	-
1.9883	7500	0.0632	0.0863	-
2.0148	7600	0.0601	0.0873	-
2.0414	7700	0.063	0.0870	-
2.0679	7800	0.0646	0.0819	-
2.0944	7900	0.0557	0.0825	-
2.1209	8000	0.0444	0.0841	-
2.1474	8100	0.049	0.0825	-
2.1739	8200	0.0441	0.0845	-
2.2004	8300	0.0451	0.0844	-
2.2269	8400	0.0346	0.0851	-
2.2534	8500	0.0398	0.0847	-
2.2800	8600	0.033	0.0855	-
2.3065	8700	0.0355	0.0851	-
2.3330	8800	0.0313	0.0867	-
2.3595	8900	0.0358	0.0870	-
2.3860	9000	0.0251	0.0867	-
2.4125	9100	0.0395	0.0854	-
2.4390	9200	0.0322	0.0838	-
2.4655	9300	0.0355	0.0847	-
2.4920	9400	0.034	0.0834	-
2.5186	9500	0.0345	0.0862	-
2.5451	9600	0.0272	0.0830	-
2.5716	9700	0.0275	0.0831	-
2.5981	9800	0.0345	0.0849	-
2.6246	9900	0.0289	0.0849	-
2.6511	10000	0.0282	0.0860	-
2.6776	10100	0.0279	0.0885	-
2.7041	10200	0.0344	0.0865	-
2.7306	10300	0.0326	0.0863	-
2.7572	10400	0.0383	0.0840	-
2.7837	10500	0.0338	0.0833	-
2.8102	10600	0.0298	0.0836	-
2.8367	10700	0.0402	0.0825	-
2.8632	10800	0.0361	0.0822	-
2.8897	10900	0.0388	0.0818	-
2.9162	11000	0.0347	0.0821	-
2.9427	11100	0.0341	0.0826	-
2.9692	11200	0.0373	0.0825	-
2.9958	11300	0.0354	0.0824	-
3.0	11316	-	-	0.9987

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.1.1
Transformers: 4.44.2
PyTorch: 2.4.1+cu121
Accelerate: 0.34.2
Datasets: 3.0.1
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}