BGE base ArgillaSDK Matryoshka
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Language: en
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("plaguss/bge-base-argilla-sdk-matryoshka")
# Run inference
sentences = [
'hide: footer\n\nrg.Argilla\n\nTo interact with the Argilla server from python you can use the Argilla class. The Argilla client is used to create, get, update, and delete all Argilla resources, such as workspaces, users, datasets, and records.\n\nUsage Examples\n\nConnecting to an Argilla server\n\nTo connect to an Argilla server, instantiate the Argilla class and pass the api_url of the server and the api_key to authenticate.\n\n```python\nimport argilla_sdk as rg',
'Can the Argilla class be employed to streamline dataset administration tasks in my Argilla server setup?',
'The Argilla flowers were blooming beautifully in the garden.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Dataset:
dim_768
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.1327 |
cosine_accuracy@3 | 0.2857 |
cosine_accuracy@5 | 0.3878 |
cosine_accuracy@10 | 0.5204 |
cosine_precision@1 | 0.1327 |
cosine_precision@3 | 0.0952 |
cosine_precision@5 | 0.0776 |
cosine_precision@10 | 0.052 |
cosine_recall@1 | 0.1327 |
cosine_recall@3 | 0.2857 |
cosine_recall@5 | 0.3878 |
cosine_recall@10 | 0.5204 |
cosine_ndcg@10 | 0.3086 |
cosine_mrr@10 | 0.2432 |
cosine_map@100 | 0.2604 |
Information Retrieval
- Dataset:
dim_512
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.102 |
cosine_accuracy@3 | 0.2755 |
cosine_accuracy@5 | 0.3878 |
cosine_accuracy@10 | 0.5102 |
cosine_precision@1 | 0.102 |
cosine_precision@3 | 0.0918 |
cosine_precision@5 | 0.0776 |
cosine_precision@10 | 0.051 |
cosine_recall@1 | 0.102 |
cosine_recall@3 | 0.2755 |
cosine_recall@5 | 0.3878 |
cosine_recall@10 | 0.5102 |
cosine_ndcg@10 | 0.2942 |
cosine_mrr@10 | 0.2264 |
cosine_map@100 | 0.2426 |
Information Retrieval
- Dataset:
dim_256
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.1224 |
cosine_accuracy@3 | 0.2755 |
cosine_accuracy@5 | 0.3878 |
cosine_accuracy@10 | 0.5 |
cosine_precision@1 | 0.1224 |
cosine_precision@3 | 0.0918 |
cosine_precision@5 | 0.0776 |
cosine_precision@10 | 0.05 |
cosine_recall@1 | 0.1224 |
cosine_recall@3 | 0.2755 |
cosine_recall@5 | 0.3878 |
cosine_recall@10 | 0.5 |
cosine_ndcg@10 | 0.2931 |
cosine_mrr@10 | 0.2291 |
cosine_map@100 | 0.2445 |
Information Retrieval
- Dataset:
dim_128
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.0918 |
cosine_accuracy@3 | 0.2551 |
cosine_accuracy@5 | 0.3163 |
cosine_accuracy@10 | 0.4694 |
cosine_precision@1 | 0.0918 |
cosine_precision@3 | 0.085 |
cosine_precision@5 | 0.0633 |
cosine_precision@10 | 0.0469 |
cosine_recall@1 | 0.0918 |
cosine_recall@3 | 0.2551 |
cosine_recall@5 | 0.3163 |
cosine_recall@10 | 0.4694 |
cosine_ndcg@10 | 0.2629 |
cosine_mrr@10 | 0.1992 |
cosine_map@100 | 0.2165 |
Information Retrieval
- Dataset:
dim_64
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.0816 |
cosine_accuracy@3 | 0.2551 |
cosine_accuracy@5 | 0.3163 |
cosine_accuracy@10 | 0.4796 |
cosine_precision@1 | 0.0816 |
cosine_precision@3 | 0.085 |
cosine_precision@5 | 0.0633 |
cosine_precision@10 | 0.048 |
cosine_recall@1 | 0.0816 |
cosine_recall@3 | 0.2551 |
cosine_recall@5 | 0.3163 |
cosine_recall@10 | 0.4796 |
cosine_ndcg@10 | 0.2611 |
cosine_mrr@10 | 0.194 |
cosine_map@100 | 0.2059 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 882 training samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 1000 samples:
anchor positive negative type string string string details - min: 6 tokens
- mean: 90.85 tokens
- max: 198 tokens
- min: 8 tokens
- mean: 25.44 tokens
- max: 91 tokens
- min: 10 tokens
- mean: 22.33 tokens
- max: 61 tokens
- Samples:
anchor positive negative ``
!!! note "Update the metadata"
ThemetadataofRecordobject is a python dictionary. So to update the metadata of a record, you can iterate over the records and update the metadata by key or usingmetadata.update`. After that, you should update the records in the dataset.Can I use Argilla to annotate the metadata of Record objects and update them in the dataset?
The beautiful scenery of the Argilla valley in Italy is perfect for a relaxing summer vacation.
git checkout [branch-name]
git rebase [default-branch]<br><br>If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:<br><br>
sh
Add the changes to the staging area
git add filename
Commit the changes by writing a proper message
git commit -m "commit-message"
Push the changes to your forkCan I commit Argilla's annotation changes and push them to a forked project repository after rebasing from the default branch?
The beautiful beach in Argilla, Spain, is a popular spot for surfers to catch a wave and enjoy the sunny weather.
Accessing Record Attributes
The Record object has suggestions, responses, metadata, and vectors attributes that can be accessed directly whilst iterating over records in a dataset.
python
for record in dataset.records(
with_suggestions=True,
with_responses=True,
with_metadata=True,
with_vectors=True
):
print(record.suggestions)
print(record.responses)
print(record.metadata)
print(record.vectors)Is it possible to retrieve the suggestions, responses, metadata, and vectors of a Record object at the same time when iterating over a dataset in Argilla?
The new hiking trail offered breathtaking suggestions for scenic views, responses to environmental concerns, and metadata about the surrounding ecosystem, but it lacked vectors for navigation.
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "TripletLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_eval_batch_size
: 4gradient_accumulation_steps
: 4learning_rate
: 2e-05lr_scheduler_type
: cosinewarmup_ratio
: 0.1load_best_model_at_end
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 4per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 4eval_accumulation_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: cosinelr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | dim_128_cosine_map@100 | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_64_cosine_map@100 | dim_768_cosine_map@100 |
---|---|---|---|---|---|---|---|
0.1802 | 5 | 21.701 | - | - | - | - | - |
0.3604 | 10 | 21.7449 | - | - | - | - | - |
0.5405 | 15 | 21.7453 | - | - | - | - | - |
0.7207 | 20 | 21.7168 | - | - | - | - | - |
0.9009 | 25 | 21.6945 | - | - | - | - | - |
0.973 | 27 | - | 0.2165 | 0.2445 | 0.2426 | 0.2059 | 0.2604 |
1.0811 | 30 | 21.7248 | - | - | - | - | - |
1.2613 | 35 | 21.7322 | - | - | - | - | - |
1.4414 | 40 | 21.7367 | - | - | - | - | - |
1.6216 | 45 | 21.6821 | - | - | - | - | - |
1.8018 | 50 | 21.8392 | - | - | - | - | - |
1.9820 | 55 | 21.6441 | 0.2165 | 0.2445 | 0.2426 | 0.2059 | 0.2604 |
2.1622 | 60 | 21.8154 | - | - | - | - | - |
2.3423 | 65 | 21.7098 | - | - | - | - | - |
2.5225 | 70 | 21.6447 | - | - | - | - | - |
2.7027 | 75 | 21.6033 | - | - | - | - | - |
2.8829 | 80 | 21.8271 | - | - | - | - | - |
2.9189 | 81 | - | 0.2165 | 0.2445 | 0.2426 | 0.2059 | 0.2604 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.11.8
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.1.2
- Accelerate: 0.31.0
- Datasets: 2.19.2
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
TripletLoss
@misc{hermans2017defense,
title={In Defense of the Triplet Loss for Person Re-Identification},
author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
year={2017},
eprint={1703.07737},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 5
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for plaguss/bge-base-argilla-sdk-matryoshka
Base model
BAAI/bge-base-en-v1.5Space using plaguss/bge-base-argilla-sdk-matryoshka 1
Evaluation results
- Cosine Accuracy@1 on dim 768self-reported0.133
- Cosine Accuracy@3 on dim 768self-reported0.286
- Cosine Accuracy@5 on dim 768self-reported0.388
- Cosine Accuracy@10 on dim 768self-reported0.520
- Cosine Precision@1 on dim 768self-reported0.133
- Cosine Precision@3 on dim 768self-reported0.095
- Cosine Precision@5 on dim 768self-reported0.078
- Cosine Precision@10 on dim 768self-reported0.052
- Cosine Recall@1 on dim 768self-reported0.133
- Cosine Recall@3 on dim 768self-reported0.286