README.md · sergeyzh/LaBSE-ru-sts at 00c333ce29c9ad739f48baca9a578cd1e85094d4

metadata

language:
  - ru
pipeline_tag: sentence-similarity
tags:
  - russian
  - pretraining
  - embeddings
  - feature-extraction
  - sentence-similarity
  - sentence-transformers
  - transformers
license: mit
base_model: cointegrated/LaBSE-en-ru

Базовый Bert для Semantic text similarity (STS) на GPU

Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на cointegrated/LaBSE-en-ru - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.

Использование модели с библиотекой `transformers`:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (768,)

Использование с `sentence_transformers`:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

Метрики

Оценки модели на бенчмарке encodechka:

Модель	STS	PI	NLI	SA	TI
intfloat/multilingual-e5-large	0.862	0.727	0.473	0.810	0.979
sergeyzh/LaBSE-ru-sts	0.845	0.737	0.481	0.805	0.957
sergeyzh/rubert-mini-sts	0.815	0.723	0.477	0.791	0.949
sergeyzh/rubert-tiny-sts	0.797	0.702	0.453	0.778	0.946
Tochka-AI/ruRoPEBert-e5-base-512	0.793	0.704	0.457	0.803	0.970
cointegrated/LaBSE-en-ru	0.794	0.659	0.431	0.761	0.946
cointegrated/rubert-tiny2	0.750	0.651	0.417	0.737	0.937

Задачи:

Semantic text similarity (STS);
Paraphrase identification (PI);
Natural language inference (NLI);
Sentiment analysis (SA);
Toxicity identification (TI).

Быстродействие и размеры

Оценки модели на бенчмарке encodechka:

Модель	CPU	GPU	size	dim	n_ctx	n_vocab
intfloat/multilingual-e5-large	149.026	15.629	2136	1024	514	250002
sergeyzh/LaBSE-ru-sts	42.835	8.561	490	768	512	55083
sergeyzh/rubert-mini-sts	6.417	5.517	123	312	2048	83828
sergeyzh/rubert-tiny-sts	3.208	3.379	111	312	2048	83828
Tochka-AI/ruRoPEBert-e5-base-512	43.314	9.338	532	768	512	69382
cointegrated/LaBSE-en-ru	42.867	8.549	490	768	512	55083
cointegrated/rubert-tiny2	3.212	3.384	111	312	2048	83828

Оценки модели на бенчмарке ruMTEB:

Model Name	Metric	sbert_large_ mt_nlu_ru	sbert_large_ nlu_ru	LaBSE-ru-sts	LaBSE-ru-turbo	multilingual-e5-small	multilingual-e5-base	multilingual-e5-large
CEDRClassification	Accuracy	0.368	0.358	0.418	0.451	0.401	0.423	0.448
GeoreviewClassification	Accuracy	0.397	0.400	0.406	0.438	0.447	0.461	0.497
GeoreviewClusteringP2P	V-measure	0.584	0.590	0.626	0.644	0.586	0.545	0.605
HeadlineClassification	Accuracy	0.772	0.793	0.633	0.688	0.732	0.757	0.758
InappropriatenessClassification	Accuracy	0.646	0.625	0.599	0.615	0.592	0.588	0.616
KinopoiskClassification	Accuracy	0.503	0.495	0.496	0.521	0.500	0.509	0.566
RiaNewsRetrieval	NDCG@10	0.214	0.111	0.651	0.694	0.700	0.702	0.807
RuBQReranking	MAP@10	0.561	0.468	0.688	0.687	0.715	0.720	0.756
RuBQRetrieval	NDCG@10	0.298	0.124	0.622	0.657	0.685	0.696	0.741
RuReviewsClassification	Accuracy	0.589	0.583	0.599	0.632	0.612	0.630	0.653
RuSTSBenchmarkSTS	Pearson correlation	0.712	0.588	0.788	0.822	0.781	0.796	0.831
RuSciBenchGRNTIClassification	Accuracy	0.542	0.539	0.529	0.569	0.550	0.563	0.582
RuSciBenchGRNTIClusteringP2P	V-measure	0.522	0.504	0.486	0.517	0.511	0.516	0.520
RuSciBenchOECDClassification	Accuracy	0.438	0.430	0.406	0.440	0.427	0.423	0.445
RuSciBenchOECDClusteringP2P	V-measure	0.473	0.464	0.426	0.452	0.443	0.448	0.450
SensitiveTopicsClassification	Accuracy	0.285	0.280	0.262	0.272	0.228	0.234	0.257
TERRaClassification	Average Precision	0.520	0.502	0.587	0.585	0.551	0.550	0.584

Model Name	Metric	sbert_large_ mt_nlu_ru	sbert_large_ nlu_ru	LaBSE-ru-sts	LaBSE-ru-turbo	multilingual-e5-small	multilingual-e5-base	multilingual-e5-large
Classification	Accuracy	0.554	0.552	0.524	0.558	0.551	0.561	0.588
Clustering	V-measure	0.526	0.519	0.513	0.538	0.513	0.503	0.525
MultiLabelClassification	Accuracy	0.326	0.319	0.340	0.361	0.314	0.329	0.353
PairClassification	Average Precision	0.520	0.502	0.587	0.585	0.551	0.550	0.584
Reranking	MAP@10	0.561	0.468	0.688	0.687	0.715	0.720	0.756
Retrieval	NDCG@10	0.256	0.118	0.637	0.675	0.697	0.699	0.774
STS	Pearson correlation	0.712	0.588	0.788	0.822	0.781	0.796	0.831
Average	Average	0.494	0.438	0.582	0.604	0.588	0.594	0.630

Базовый Bert для Semantic text similarity (STS) на GPU

Использование модели с библиотекой transformers:

Использование с sentence_transformers:

Метрики

Быстродействие и размеры

Использование модели с библиотекой `transformers`:

Использование с `sentence_transformers`: