pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- embeddings
- static-embeddings
language: en
license: apache-2.0
PubMedBERT Embeddings 100K
This is a pruned version of PubMedBERT Embeddings 2M. It prunes the vocabulary to take the top 5% most frequently used tokens.
See Extremely Small BERT Models from Mixed-Vocabulary Training for background on pruning vocabularies to build smaller models.
Usage (txtai)
This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
import txtai
# Create embeddings
embeddings = txtai.Embeddings(
path="neuml/pubmedbert-base-embeddings-100K",
content=True,
)
embeddings.index(documents())
# Run a query
embeddings.search("query to run")
Usage (Sentence-Transformers)
Alternatively, the model can be loaded with sentence-transformers.
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
# Initialize a StaticEmbedding module
static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-100K")
model = SentenceTransformer(modules=[static])
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
Usage (Model2Vec)
The model can also be used directly with Model2Vec.
from model2vec import StaticModel
# Load a pretrained Model2Vec model
model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-100K")
# Compute text embeddings
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)
Evaluation Results
The following compares performance of this model against the models previously compared with PubMedBERT Embeddings. The following datasets were used to evaluate model performance.
- PubMed QA
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- PubMed Subset
- Split: test, Pair: (title, text)
- Note: The previously used PubMed Subset dataset is no longer available but a similar dataset is used here
- PubMed Summary
- Subset: pubmed, Split: validation, Pair: (article, abstract)
The Pearson correlation coefficient is used as the evaluation metric.
Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
---|---|---|---|---|
pubmedbert-base-embeddings-8M-M2V (No training) | 69.84 | 70.77 | 71.30 | 70.64 |
pubmedbert-base-embeddings-100K | 74.56 | 84.65 | 81.84 | 80.35 |
pubmedbert-base-embeddings-500K | 86.03 | 91.71 | 91.25 | 89.66 |
pubmedbert-base-embeddings-1M | 87.87 | 92.80 | 92.87 | 91.18 |
pubmedbert-base-embeddings-2M | 88.62 | 93.08 | 93.24 | 91.65 |
It's quite a steep dropoff in accuracy compared the original unpruned model. Although this model still scores higher than the naive distilled version without training
Runtime performance
As another test, let's see how long each model takes to index 120K article abstracts using the following code. All indexing is done with a RTX 3090 GPU.
from datasets import load_dataset
from tqdm import tqdm
from txtai import Embeddings
ds = load_dataset("ccdv/pubmed-summarization", split="train")
embeddings = Embeddings(path="path to model", content=True, backend="numpy")
embeddings.index(tqdm(ds["abstract"]))
Model | Model Size (MB) | Index time (s) |
---|---|---|
pubmedbert-base-embeddings-100K | 0.2 | 19 |
pubmedbert-base-embeddings-500K | 1.0 | 17 |
pubmedbert-base-embeddings-1M | 2.0 | 17 |
pubmedbert-base-embeddings-2M | 7.5 | 17 |
Vocabulary pruning leads to a slighly higher runtime. This is attributed to the fact that more tokens are needed to represent text. But the model is much smaller. Vectors are stored at int16
precision. This can be beneficial to smaller/lower powered embedded devices and could lead to faster vectorization times.
Training
This model was vocabulary pruned using the following script.
import json
import os
from collections import Counter
from pathlib import Path
import numpy as np
from model2vec import StaticModel
from more_itertools import batched
from sklearn.decomposition import PCA
from tokenlearn.train import collect_means_and_texts
from tokenizers import Tokenizer
from tqdm import tqdm
from txtai.scoring import ScoringFactory
def tokenize(tokenizer):
# Tokenize into dataset
dataset = []
for t in tqdm(batched(texts, 1024)):
encodings = tokenizer.encode_batch_fast(t, add_special_tokens=False)
for e in encodings:
dataset.append((None, e.ids, None))
return dataset
def tokenweights(tokenizer):
dataset = tokenize(tokenizer)
# Build scoring index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index(dataset)
# Calculate mean value of weights array per token
tokens = np.zeros(tokenizer.get_vocab_size())
for x in scoring.idf:
tokens[x] = np.mean(scoring.terms.weights(x)[1])
return tokens
# See PubMedBERT Embeddings 2M model for details on this data
features = "features"
paths = sorted(Path(features).glob("*.json"))
texts, _ = collect_means_and_texts(paths)
# Output model parameters
output = "output path"
params, dims = 100000, 64
path = "pubmedbert-base-embeddings-2M_unweighted"
model = StaticModel.from_pretrained(path)
os.makedirs(output, exist_ok=True)
with open(f"{path}/tokenizer.json", "r", encoding="utf-8") as f:
config = json.load(f)
# Calculate number of tokens to keep
tokencount = params // model.dim
# Calculate term frequency
freqs = Counter()
for _, ids, _ in tokenize(model.tokenizer):
freqs.update(ids)
# Select top N most common tokens
uids = set(x for x, _ in freqs.most_common(tokencount))
uids = [uid for token, uid in config["model"]["vocab"].items() if uid in uids or token.startswith("[")]
# Get embeddings for uids
model.embedding = model.embedding[uids]
# Select pruned tokens
pairs, index = [], 0
for token, uid in config["model"]["vocab"].items():
if uid in uids:
pairs.append((token, index))
index += 1
config["model"]["vocab"] = dict(pairs)
# Write new tokenizer
with open(f"{output}/tokenizer.json", "w", encoding="utf-8") as f:
json.dump(config, f, indent=2)
model.tokenizer = Tokenizer.from_file(f"{output}/tokenizer.json")
# Re-weight tokens
weights = tokenweights(model.tokenizer)
# Remove NaNs from embedding, if any
embedding = np.nan_to_num(model.embedding)
# Apply PCA
embedding = PCA(n_components=dims).fit_transform(embedding)
# Apply weights
embedding *= weights[:, None]
# Update model embedding and normalize
model.embedding, model.normalize = embedding.astype(np.int16), True
model.save_pretrained(output)
Acknowledgement
This model is built on the great work from the Minish Lab team consisting of Stephan Tulkens and Thomas van Dongen.
Read more at the following links.