Базовый Bert для Semantic text similarity (STS) на GPU
Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на cointegrated/LaBSE-en-ru - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
Использование модели с библиотекой transformers
:
# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
# model.cuda() # uncomment it if you have a GPU
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (768,)
Использование с sentence_transformers
:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))
Метрики
Оценки модели на бенчмарке encodechka:
Модель | STS | PI | NLI | SA | TI |
---|---|---|---|---|---|
intfloat/multilingual-e5-large | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
sergeyzh/LaBSE-ru-sts | 0.845 | 0.737 | 0.481 | 0.805 | 0.957 |
sergeyzh/rubert-mini-sts | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
sergeyzh/rubert-tiny-sts | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
Tochka-AI/ruRoPEBert-e5-base-512 | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
cointegrated/LaBSE-en-ru | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
cointegrated/rubert-tiny2 | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
Задачи:
- Semantic text similarity (STS);
- Paraphrase identification (PI);
- Natural language inference (NLI);
- Sentiment analysis (SA);
- Toxicity identification (TI).
Быстродействие и размеры
Оценки модели на бенчмарке encodechka:
Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
---|---|---|---|---|---|---|
intfloat/multilingual-e5-large | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
sergeyzh/LaBSE-ru-sts | 42.835 | 8.561 | 490 | 768 | 512 | 55083 |
sergeyzh/rubert-mini-sts | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
sergeyzh/rubert-tiny-sts | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
Tochka-AI/ruRoPEBert-e5-base-512 | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
cointegrated/LaBSE-en-ru | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
cointegrated/rubert-tiny2 | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
Оценки модели на бенчмарке ruMTEB:
Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | LaBSE-ru-turbo | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
---|---|---|---|---|---|---|---|---|
CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | 0.448 |
GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | 0.497 |
GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | 0.644 | 0.586 | 0.545 | 0.605 |
HeadlineClassification | Accuracy | 0.772 | 0.793 | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 |
InappropriatenessClassification | Accuracy | 0.646 | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 |
KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | 0.566 |
RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | 0.807 |
RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | 0.756 |
RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | 0.741 |
RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | 0.653 |
RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | 0.831 |
RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | 0.582 |
RuSciBenchGRNTIClusteringP2P | V-measure | 0.522 | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 |
RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | 0.445 |
RuSciBenchOECDClusteringP2P | V-measure | 0.473 | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 |
SensitiveTopicsClassification | Accuracy | 0.285 | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 |
TERRaClassification | Average Precision | 0.520 | 0.502 | 0.587 | 0.585 | 0.551 | 0.550 | 0.584 |
Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | LaBSE-ru-turbo | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
---|---|---|---|---|---|---|---|---|
Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | 0.588 |
Clustering | V-measure | 0.526 | 0.519 | 0.513 | 0.538 | 0.513 | 0.503 | 0.525 |
MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | 0.361 | 0.314 | 0.329 | 0.353 |
PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | 0.585 | 0.551 | 0.550 | 0.584 |
Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | 0.756 |
Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | 0.774 |
STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | 0.831 |
Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | 0.630 |
- Downloads last month
- 586
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for sergeyzh/LaBSE-ru-sts
Base model
cointegrated/LaBSE-en-ru