|
--- |
|
datasets: |
|
- sentence-transformers/embedding-training-data |
|
- flax-sentence-embeddings/stackexchange_xml |
|
- snli |
|
- eli5 |
|
- search_qa |
|
- multi_nli |
|
- wikihow |
|
- natural_questions |
|
- trivia_qa |
|
- ms_marco |
|
- gooaq |
|
- yahoo_answers_topics |
|
language: |
|
- en |
|
inference: false |
|
pipeline_tag: sentence-similarity |
|
task_categories: |
|
- sentence-similarity |
|
- feature-extraction |
|
- text-retrieval |
|
tags: |
|
- information retrieval |
|
- ir |
|
- documents retrieval |
|
- passage retrieval |
|
- beir |
|
- benchmark |
|
- sts |
|
- semantic search |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
--- |
|
|
|
# bert-base-1024-biencoder-64M-pairs |
|
|
|
A long context biencoder based on [MosaicML's BERT pretrained on 1024 sequence length](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024). This model maps sentences & paragraphs to a 768 dimensional dense vector space |
|
and can be used for tasks like clustering or semantic search. |
|
|
|
## Usage |
|
|
|
### Download the model and related scripts |
|
```git clone https://huggingface.co/shreyansh26/bert-base-1024-biencoder-64M-pairs``` |
|
|
|
### Inference |
|
```python |
|
import torch |
|
from torch import nn |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline, AutoModel |
|
from mosaic_bert import BertModel |
|
|
|
# pip install triton==2.0.0.dev20221202 --no-deps if using Pytorch 2.0 |
|
|
|
class AutoModelForSentenceEmbedding(nn.Module): |
|
def __init__(self, model, tokenizer, normalize=True): |
|
super(AutoModelForSentenceEmbedding, self).__init__() |
|
|
|
self.model = model.to("cuda") |
|
self.normalize = normalize |
|
self.tokenizer = tokenizer |
|
|
|
def forward(self, **kwargs): |
|
model_output = self.model(**kwargs) |
|
embeddings = self.mean_pooling(model_output, kwargs['attention_mask']) |
|
if self.normalize: |
|
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
|
|
|
return embeddings |
|
|
|
def mean_pooling(self, model_output, attention_mask): |
|
token_embeddings = model_output[0] # First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
model = AutoModel.from_pretrained("<path-to-model>", trust_remote_code=True).to("cuda") |
|
model = AutoModelForSentenceEmbedding(model, tokenizer) |
|
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') |
|
|
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=1024, return_tensors='pt').to("cuda") |
|
embeddings = model(**encoded_input) |
|
|
|
print(embeddings) |
|
print(embeddings.shape) |
|
``` |
|
|
|
## Other details |
|
|
|
### Training |
|
|
|
This model has been trained on 64M randomly sampled pairs of sentences/paragraphs from the same training set that Sentence Transformers models use. Details of the |
|
training set [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#training-data). |
|
|
|
The training (along with hyperparameters), inference and data loading scripts can all be found in [this Github repository](https://github.com/shreyansh26/Long-Context-Biencoder). |
|
|
|
### Evaluations |
|
|
|
We ran the model on a few retrieval based benchmarks (CQADupstackEnglishRetrieval, DBPedia, MSMARCO, QuoraRetrieval) and the results are [here](https://github.com/shreyansh26/Long-Context-Biencoder/tree/master/models/results/64M_results). |