|
--- |
|
license: apache-2.0 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- embedding |
|
- text embedding |
|
--- |
|
|
|
# flan-ul2-text-encoder |
|
|
|
The encoder model extracted from [flan-ul2](https://huggingface.co/google/flan-ul2). |
|
|
|
⚠️ This model is 17.44 GB in `bfloat16` precision ⚠️ |
|
|
|
|
|
## basic usage |
|
|
|
> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome. |
|
|
|
Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model |
|
|
|
## Functions |
|
|
|
### load_model_and_tokenizer |
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
loads the model and tokenizer based on `model_name`. It returns a tuple containing the loaded model and tokenizer. |
|
|
|
```python |
|
from typing import List, Tuple |
|
|
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
from transformers import AutoModelForTextEncoding |
|
|
|
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]: |
|
""" |
|
Load the model and tokenizer based on the given model name. |
|
|
|
Args: |
|
model_name (str): The name of the model to be loaded. |
|
|
|
Returns: |
|
Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer. |
|
""" |
|
model = AutoModelForTextEncoding.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model.eval() # Deactivate Dropout |
|
return model, tokenizer |
|
``` |
|
|
|
</details> |
|
|
|
### get_embeddings |
|
|
|
This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be)) |
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
|
|
```python |
|
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor: |
|
""" |
|
Get the embeddings via weighted mean pooling across seq_len |
|
|
|
Args: |
|
model (AutoModel): The model to be used for getting embeddings. |
|
tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts. |
|
texts (List[str]): The texts for which embeddings are to be calculated. |
|
|
|
Returns: |
|
torch.Tensor: The calculated embeddings. |
|
""" |
|
# Tokenize input texts |
|
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") |
|
|
|
# Get the embeddings |
|
with torch.no_grad(): |
|
last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state |
|
|
|
# Get weights |
|
weights = ( |
|
torch.arange(start=1, end=last_hidden_state.shape[1] + 1) |
|
.unsqueeze(0) |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float().to(last_hidden_state.device) |
|
) |
|
|
|
# Get attn mask |
|
input_mask_expanded = ( |
|
batch_tokens["attention_mask"] |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float() |
|
) |
|
|
|
# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim |
|
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1) |
|
sum_mask = torch.sum(input_mask_expanded * weights, dim=1) |
|
|
|
embeddings = sum_embeddings / sum_mask |
|
|
|
return embeddings |
|
``` |
|
|
|
</details> |
|
|
|
### calculate_cosine_similarity |
|
|
|
Helper fn to compute and print out cosine similarity |
|
|
|
<details> |
|
<summary><b>click to expand</b></summary> |
|
|
|
|
|
```python |
|
from scipy.spatial.distance import cosine |
|
|
|
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None: |
|
""" |
|
Calculate and print the cosine similarity between the first text and all other texts. |
|
|
|
Args: |
|
embeddings (torch.Tensor): The embeddings for the texts. |
|
texts (List[str]): The texts for which cosine similarity is to be calculated. |
|
""" |
|
# Calculate cosine similarities |
|
for i in range(1, len(embeddings)): |
|
cosine_sim = 1 - cosine(embeddings[0], embeddings[i]) |
|
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim)) |
|
``` |
|
|
|
</details> |
|
|
|
## Usage |
|
|
|
Install packages: |
|
|
|
```bash |
|
pip install transformers accelerate sentencepiece scipy |
|
``` |
|
|
|
Then, you can use the functions to compute embeddings and similarity scores: |
|
|
|
```python |
|
model_name = "pszemraj/flan-ul2-text-encoder" |
|
model, tokenizer = load_model_and_tokenizer(model_name) |
|
|
|
texts = [ |
|
"deep learning", |
|
"artificial intelligence", |
|
"deep diving", |
|
"artificial snow", |
|
] |
|
|
|
embeddings = get_embeddings(model, tokenizer, texts) |
|
calculate_cosine_similarity(embeddings, texts) |
|
``` |
|
|
|
This will print the cosine similarity between the first text and all other texts in the `texts` list. |
|
|
|
## References |
|
|
|
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be). |
|
|
|
``` |
|
@article{muennighoff2022sgpt, |
|
title={SGPT: GPT Sentence Embeddings for Semantic Search}, |
|
author={Muennighoff, Niklas}, |
|
journal={arXiv preprint arXiv:2202.08904}, |
|
year={2022} |
|
} |
|
``` |