metadata

language:
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - bo
  - bs
  - ca
  - ceb
  - co
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - haw
  - he
  - hi
  - hmn
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lb
  - lo
  - lt
  - lv
  - mg
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ny
  - or
  - pa
  - pl
  - pt
  - ro
  - ru
  - rw
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - tk
  - tl
  - tr
  - tt
  - ug
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zh
  - zu
tags:
  - bert
  - sentence_embedding
  - multilingual
  - google
  - sentence-similarity
  - lealla
  - labse
license: apache-2.0
datasets:
  - CommonCrawl
  - Wikipedia

LEALLA-small

Model description

LEALLA is a collection of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.

Model: HuggingFace's model hub.
Paper: arXiv.
Original model: TensorFlow Hub.
Conversion from TensorFlow to PyTorch: GitHub.

This is migrated from the v1 model on the TF Hub. The embeddings produced by both the versions of the model are equivalent. Though, for some of the languages (like Japanese), the LEALLA models appear to require higher tolerances when comparing embeddings and similarities.

Usage

Using the model:

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-small")
model = BertModel.from_pretrained("setu4993/LEALLA-small")
model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    english_outputs = model(**english_inputs)

To get the sentence embeddings, use the pooler output:

english_embeddings = english_outputs.pooler_output

Output for other languages:

italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    japanese_outputs = model(**japanese_inputs)

italian_embeddings = italian_outputs.pooler_output
japanese_embeddings = japanese_outputs.pooler_output

For similarity between sentences, an L2-norm is recommended before calculating the similarity:

import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )


print(similarity(english_embeddings, italian_embeddings))
print(similarity(english_embeddings, japanese_embeddings))
print(similarity(italian_embeddings, japanese_embeddings))

Details

Details about data, training, evaluation and performance metrics are available in the original paper.

BibTeX entry and citation info

@misc{mao2023lealla,
      title={LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation},
      author={Zhuoyuan Mao and Tetsuji Nakagawa},
      year={2023},
      eprint={2302.08387},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}