File size: 7,434 Bytes
243add7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58bbcf2
b03d78b
 
 
58bbcf2
b03d78b
243add7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58bbcf2
 
243add7
58bbcf2
243add7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58bbcf2
 
243add7
 
58bbcf2
243add7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language: en
license: apache-2.0
---

# PubMedBERT Embeddings Matryoshka

This is a version of [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings) with [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) applied. This enables dynamic embeddings sizes of `64`, `128`, `256`, `384`, `512` and the full size of `768`. It's important to note while this method saves space, the same computational resources are used regardless of the dimension size.

Sentence Transformers 2.4 added support for Matryoshka Embeddings. More can be read in [this blog post](https://huggingface.co/blog/matryoshka). 

## Usage (txtai)

This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

```python
import txtai

# New embeddings with requested dimensionality
embeddings = txtai.Embeddings(
  path="neuml/pubmedbert-base-embeddings-matryoshka",
  content=True,
  dimensionality=256
)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")
```

## Usage (Sentence-Transformers)

Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("neuml/pubmedbert-base-embeddings-matryoshka")
embeddings = model.encode(sentences)

# Requested dimensionality
dimensionality = 256

print(embeddings[:, :dimensionality])
```

## Usage (Hugging Face Transformers)

The model can also be used directly with Transformers. 

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def meanpooling(output, mask):
    embeddings = output[0] # First element of model_output contains all token embeddings
    mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
    return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("neuml/pubmedbert-base-embeddings-matryoshka")
model = AutoModel.from_pretrained("neuml/pubmedbert-base-embeddings-matryoshka")

# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    output = model(**inputs)

# Perform pooling. In this case, mean pooling.
embeddings = meanpooling(output, inputs['attention_mask'])

# Requested dimensionality
dimensionality = 256

print("Sentence embeddings:")
print(embeddings[:, :dimensionality])
```

## Evaluation Results

Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.

The following datasets were used to evaluate model performance.

- [PubMed QA](https://huggingface.co/datasets/pubmed_qa)
  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- [PubMed Subset](https://huggingface.co/datasets/zxvix/pubmed_subset_new)
  - Split: test, Pair: (title, text)
- [PubMed Summary](https://huggingface.co/datasets/scientific_papers)
  - Subset: pubmed, Split: validation, Pair: (article, abstract)

Evaluation results from the original model are shown below for reference. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.

| Model                                                                         | PubMed QA | PubMed Subset | PubMed Summary | Average   |
| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- | 
| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2)      | 90.40     | 95.86         | 94.07          | 93.44     | 
| [bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5)                      | 91.02     | 95.60         | 94.49          | 93.70     | 
| [gte-base](https://hf.co/thenlper/gte-base)                                   | 92.97     | 96.83         | 96.24          | 95.35     | 
| [**pubmedbert-base-embeddings**](https://hf.co/neuml/pubmedbert-base-embeddings) | **93.27** | **97.07**     | **96.58**      | **95.64** |
| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO)       | 90.86     | 93.33         | 93.54          | 92.58     | 

See the table below for evaluation results per dimension for `pubmedbert-base-embeddings-matryoshka`.

| Model               | PubMed QA | PubMed Subset | PubMed Summary | Average   |
| --------------------| --------- | ------------- | -------------- | --------- | 
| Dimensions =  64    | 92.16     | 95.85         | 95.67          | 94.56     | 
| Dimensions = 128    | 92.80     | 96.44         | 96.22          | 95.15     | 
| Dimensions = 256    | 93.11     | 96.68         | 96.53          | 95.44     | 
| Dimensions = 384    | 93.42     | 96.79         | 96.61          | 95.61     | 
| Dimensions = 512    | 93.37     | 96.87         | 96.61          | 95.62     | 
| **Dimensions = 768**    | **93.53**     | **96.95**         | **96.70**          | **95.73**     | 

This model performs slightly better overall compared to the original model.

The bigger takeaway is how competitive it is at lower dimensions. For example, `Dimensions = 256` performs better than all the other models originally tested above. Even `Dimensions = 64` performs better than `all-MiniLM-L6-v2` and `bge-base-en-v1.5`.

## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 20191 with parameters:
```
{'batch_size': 24, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.MatryoshkaLoss.MatryoshkaLoss` with parameters:
  ```
  {'loss': 'MultipleNegativesRankingLoss', 'matryoshka_dims': [768, 512, 384, 256, 128, 64], 'matryoshka_weights': [1, 1, 1, 1, 1, 1]}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 1,
    "evaluation_steps": 500,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}
```

## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```