README.md · shreyansh26/bert-base-1024-biencoder-64M-pairs at main

bert-base-1024-biencoder-64M-pairs / README.md

shreyansh26

Create README.md

af24c8f about 1 year ago

preview code

raw

history blame contribute delete

3.49 kB

	---
	datasets:
	- sentence-transformers/embedding-training-data
	- flax-sentence-embeddings/stackexchange_xml
	- snli
	- eli5
	- search_qa
	- multi_nli
	- wikihow
	- natural_questions
	- trivia_qa
	- ms_marco
	- gooaq
	- yahoo_answers_topics
	language:
	- en
	inference: false
	pipeline_tag: sentence-similarity
	task_categories:
	- sentence-similarity
	- feature-extraction
	- text-retrieval
	tags:
	- information retrieval
	- ir
	- documents retrieval
	- passage retrieval
	- beir
	- benchmark
	- sts
	- semantic search
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	---

	# bert-base-1024-biencoder-64M-pairs

	A long context biencoder based on [MosaicML's BERT pretrained on 1024 sequence length](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024). This model maps sentences & paragraphs to a 768 dimensional dense vector space
	and can be used for tasks like clustering or semantic search.

	## Usage

	### Download the model and related scripts
	```git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs```

	### Inference
	```python
	import torch
	from torch import nn
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel
	from mosaic_bert import BertModel

	# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0

	class AutoModelForSentenceEmbedding(nn.Module):
	def __init__(self, model, tokenizer, normalize=True):
	super(AutoModelForSentenceEmbedding, self).__init__()

	self.model = model.to("cuda")
	self.normalize = normalize
	self.tokenizer = tokenizer

	def forward(self, **kwargs):
	model_output = self.model(**kwargs)
	embeddings = self.mean_pooling(model_output, kwargs['attention_mask'])
	if self.normalize:
	embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

	return embeddings

	def mean_pooling(self, model_output, attention_mask):
	token_embeddings = model_output[0] # First element of model_output contains all token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda")
	model = AutoModelForSentenceEmbedding(model, tokenizer)
	tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

	sentences = ["This is an example sentence", "Each sentence is converted"]

	encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda")
	embeddings = model(**encoded_input)

	print(embeddings)
	print(embeddings.shape)
	```

	## Other details

	### Training

	This model has been trained on 64M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the
	training set [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#training-data).

	The training (along with hyperparameters), inference and data loading scripts can all be found in [this Github repository](https://github.com/shreyansh26/Long-Context-Biencoder).

	### Evaluations

	We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are [here](https://github.com/shreyansh26/Long-Context-Biencoder/tree/master/models/results/64M_results).