pszemraj's picture
Update README.md
0d46110
|
raw
history blame
5.03 kB
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- embedding
- text embedding
---
# flan-ul2-text-encoder
The encoder model extracted from [flan-ul2](https://huggingface.co/google/flan-ul2).
⚠️ This model is 17.44 GB in `bfloat16` precision ⚠️
## basic usage
> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome.
Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model
## Functions
### load_model_and_tokenizer
<details>
<summary><b>Details</b></summary>
loads the model and tokenizer based on `model_name`. It returns a tuple containing the loaded model and tokenizer.
```python
from typing import List, Tuple
import torch
from transformers import AutoModel, AutoTokenizer
from transformers import AutoModelForTextEncoding
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
"""
Load the model and tokenizer based on the given model name.
Args:
model_name (str): The name of the model to be loaded.
Returns:
Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer.
"""
model = AutoModelForTextEncoding.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval() # Deactivate Dropout
return model, tokenizer
```
</details>
### get_embeddings
This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be))
<details>
<summary><b>Details</b></summary>
```python
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
"""
Get the embeddings via weighted mean pooling across seq_len
Args:
model (AutoModel): The model to be used for getting embeddings.
tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts.
texts (List[str]): The texts for which embeddings are to be calculated.
Returns:
torch.Tensor: The calculated embeddings.
"""
# Tokenize input texts
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state
# Get weights
weights = (
torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
.unsqueeze(0)
.unsqueeze(-1)
.expand(last_hidden_state.size())
.float().to(last_hidden_state.device)
)
# Get attn mask
input_mask_expanded = (
batch_tokens["attention_mask"]
.unsqueeze(-1)
.expand(last_hidden_state.size())
.float()
)
# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
sum_mask = torch.sum(input_mask_expanded * weights, dim=1)
embeddings = sum_embeddings / sum_mask
return embeddings
```
</details>
### calculate_cosine_similarity
Helper fn to compute and print out cosine similarity
<details>
<summary><b>click to expand</b></summary>
```python
from scipy.spatial.distance import cosine
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None:
"""
Calculate and print the cosine similarity between the first text and all other texts.
Args:
embeddings (torch.Tensor): The embeddings for the texts.
texts (List[str]): The texts for which cosine similarity is to be calculated.
"""
# Calculate cosine similarities
for i in range(1, len(embeddings)):
cosine_sim = 1 - cosine(embeddings[0], embeddings[i])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim))
```
</details>
## Usage
Install packages:
```bash
pip install transformers accelerate sentencepiece scipy
```
Then, you can use the functions to compute embeddings and similarity scores:
```python
model_name = "pszemraj/flan-ul2-text-encoder"
model, tokenizer = load_model_and_tokenizer(model_name)
texts = [
"deep learning",
"artificial intelligence",
"deep diving",
"artificial snow",
]
embeddings = get_embeddings(model, tokenizer, texts)
calculate_cosine_similarity(embeddings, texts)
```
This will print the cosine similarity between the first text and all other texts in the `texts` list.
## References
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
```
@article{muennighoff2022sgpt,
title={SGPT: GPT Sentence Embeddings for Semantic Search},
author={Muennighoff, Niklas},
journal={arXiv preprint arXiv:2202.08904},
year={2022}
}
```