Combination of Embedding Models: Arctic M (v1.5) & BGE Small (en; v1.5)
Acknowledgement | Combination of Embedding Models | Usage | Citation | License
Acknowledgement
First of all, we want to acknowledge the original creators of the Snowflake/snowflake-arctic-embed-m-v1.5 and BAAI/bge-base-en-v1.5 models which are used to create this model. Our model is just a combination of these two models, and we have not made any changes to the original models.
Furthermore, we want to acknowledge the team of Marqo, who has worked on the idea of combining two models through concatenation in parallel to ourselves. Their initial effort allowed to re-use existing pieces of code, in particular the modeling script for bringing the combined model to HuggingFace.
Combination of Embedding Models
Overview
Embedding models have become increasingly powerful and applicable across various use cases. However, the next significant challenge lies in enhancing their efficiency in terms of resource consumption. Our goal is to experiment with combining two embedding models to achieve better performance with fewer resources.
Key Insights
- Diversity Matters: Initial findings suggest that combining models with differing characteristics can complement each other, resulting in improved outcomes. To design an effective combination, the diversity of the models—evaluated by factors like MTEB performance, architecture, and training data—is crucial.
- Combination Technique:
- We combine the embeddings of two models using the most straightforward approach: concatenation.
- Prior to concatenation, we normalize the embeddings to ensure they are on the same scale. This step is vital for achieving coherent and meaningful results.
Implementation
We combined the following models:
Model Details
- Output Embedding Dimensions: 1152 (768 + 384)
- Total Parameters: 142M (109M + 33M)
Results
This combination demonstrated notable performance on the MTEB Leaderboard, offering a promising foundation for further experimentation:
- Performance Improvement: The average nDCG@10 on the MTEB English Retrieval benchmark increased from 55.14 to 56.5, climbing several spots on the leaderboard—a feat often requiring extensive engineering efforts.
- Comparison with Chimera Model:
Interestingly, the Chimera model, which employs more potent models individually, performs worse on the leaderboard. This raises questions about:- The role of parameter count.
- Differences in training processes.
- How effectively two models complement each other for specific benchmark tasks.
Future Directions
While the results are promising, we acknowledge the complexity of model combinations and the importance of focusing on more than leaderboard rankings. The simplicity of concatenating embeddings yielding tangible gains emphasizes the potential for further exploration in this area.
We look forward to conducting additional experiments and engaging in discussions to deepen our understanding of effective model combinations.
Usage
import numpy as np
import torch
from torch.utils.data import DataLoader
from transformers import AutoModel, AutoTokenizer, PreTrainedTokenizerFast, BatchEncoding, DataCollatorWithPadding
from functools import partial
from datasets import Dataset
from tqdm import tqdm
from typing import *
NUM_WORKERS = 4
BATCH_SIZE = 32
def transform_func(tokenizer: PreTrainedTokenizerFast,
max_length: int,
examples: Dict[str, List]) -> BatchEncoding:
return tokenizer(examples['contents'],
max_length=max_length,
padding=True,
return_token_type_ids=False,
truncation=True)
def move_to_cuda(sample):
if len(sample) == 0:
return {}
def _move_to_cuda(maybe_tensor):
if torch.is_tensor(maybe_tensor):
return maybe_tensor.cuda(non_blocking=True)
elif isinstance(maybe_tensor, dict):
return {key: _move_to_cuda(value) for key, value in maybe_tensor.items()}
elif isinstance(maybe_tensor, list):
return [_move_to_cuda(x) for x in maybe_tensor]
elif isinstance(maybe_tensor, tuple):
return tuple([_move_to_cuda(x) for x in maybe_tensor])
elif isinstance(maybe_tensor, Mapping):
return type(maybe_tensor)({k: _move_to_cuda(v) for k, v in maybe_tensor.items()})
else:
return maybe_tensor
return _move_to_cuda(sample)
class RetrievalModel():
def __init__(self, pretrained_model_name: str, **kwargs):
self.pretrained_model_name = pretrained_model_name
self.encoder = AutoModel.from_pretrained(pretrained_model_name, trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)
self.gpu_count = torch.cuda.device_count()
self.batch_size = BATCH_SIZE
self.query_instruction = 'Represent this sentence for searching relevant passages: {}'
self.document_instruction = '{}'
self.pool_type = 'cls'
self.max_length = 512
self.encoder.cuda()
self.encoder.eval()
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
input_texts = [self.query_instruction.format(q) for q in queries]
return self._do_encode(input_texts)
def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
input_texts = [self.document_instruction.format('{} {}'.format(d.get('title', ''), d['text']).strip()) for d in corpus]
return self._do_encode(input_texts)
@torch.no_grad()
def _do_encode(self, input_texts: List[str]) -> np.ndarray:
dataset: Dataset = Dataset.from_dict({'contents': input_texts})
dataset.set_transform(partial(transform_func, self.tokenizer, self.max_length))
data_collator = DataCollatorWithPadding(self.tokenizer, pad_to_multiple_of=8)
data_loader = DataLoader(
dataset,
batch_size=self.batch_size * self.gpu_count,
shuffle=False,
drop_last=False,
num_workers=NUM_WORKERS,
collate_fn=data_collator,
pin_memory=True)
encoded_embeds = []
for batch_dict in tqdm(data_loader, desc='encoding', mininterval=10):
batch_dict = move_to_cuda(batch_dict)
with torch.amp.autocast('cuda'):
outputs = self.encoder(**batch_dict)
encoded_embeds.append(outputs.cpu().numpy())
return np.concatenate(encoded_embeds, axis=0)
model = RetrievalModel('PaDaS-Lab/arctic-m-bge-small')
embeds_q = model.encode_queries(['What is the capital of France?'])
# [[-0.01099197 -0.08366653 0.0060241 ... 0.03182805 -0.00674182 0.058571 ]]
embeds_d = model.encode_corpus([{'title': 'Paris', 'text': 'Paris is the capital of France.'}])
# [[ 0.0391828 -0.02951912 0.10862264 ... -0.05373885 -0.00368348 0.02323797]]
Libraries
torch==2.5.0
transformers==4.42.3
mteb==1.12.94
Citation
@misc{https://doi.org/10.48550/arxiv.2407.08275,
doi = {10.48550/ARXIV.2407.08275},
url = {https://arxiv.org/abs/2407.08275},
author = {Caspari, Laura and Dastidar, Kanishka Ghosh and Zerhoudi, Saber and Mitrovic, Jelena and Granitzer, Michael},
title = {Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems},
year = {2024},
copyright = {Creative Commons Attribution 4.0 International}
}
License
Notice that Arctic M (v1.5) is licensed under Apache-2.0 and BGE Small (en; v1.5) is licensed under MIT license. Please refer to the licenses of the original models for more details.
- Downloads last month
- 79
Evaluation results
- main_score on MTEB ArguAna (default)test set self-reported62.440
- map_at_1 on MTEB ArguAna (default)test set self-reported37.909
- map_at_10 on MTEB ArguAna (default)test set self-reported54.071
- map_at_100 on MTEB ArguAna (default)test set self-reported54.707
- map_at_1000 on MTEB ArguAna (default)test set self-reported54.710
- map_at_20 on MTEB ArguAna (default)test set self-reported54.610
- map_at_3 on MTEB ArguAna (default)test set self-reported49.787
- map_at_5 on MTEB ArguAna (default)test set self-reported52.472
- mrr_at_1 on MTEB ArguAna (default)test set self-reported38.549
- mrr_at_10 on MTEB ArguAna (default)test set self-reported54.308