SentenceTransformer
This is a sentence-transformers model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
The model is based on GLuCoSE and additionally fine-tuned. Fine-tuning consists of the following steps.
Step 1: Ensemble distillation
- The embedded representation was distilled using E5-mistral, gte-Qwen2, and mE5-large as teacher models.
Step 2: Contrastive learning
- Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.
Step 3: Search-specific contrastive learning
- In order to make the model more robust to the retrieval task, additional two-stage training with QA and question-answer data was conducted.
- In the first stage, the synthetic dataset auto-wiki-qa was used for training, while in the second stage, Japanese Wikipedia Human Retrieval , Mr.TyDi, MIRACL, JQaRA, MQA, Quiz Works and Quiz No Mori were used.
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Usage
Direct Usage (Sentence Transformers)
You can perform inference using SentenceTransformers with the following code:
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
Direct Usage (Transformers)
You can perform inference using Transformers with the following code:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
emb = last_hidden_states * attention_mask.unsqueeze(-1)
emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
return emb
# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
Benchmarks
Retieval
Evaluated with MIRACL-ja, JQARA , JaCWIR and MLDR-ja.
Model | Size | MIRACL Recall@5 |
JQaRA nDCG@10 |
JaCWIR MAP@10 |
MLDR nDCG@10 |
---|---|---|---|---|---|
OpenAI/text-embedding-3-small | - | processing... | 38.8 | 81.6 | processing... |
OpenAI/text-embedding-3-large | - | processing... | processing... | processing... | processing... |
intfloat/multilingual-e5-large | 0.6B | 89.2 | 55.4 | 87.6 | 29.8 |
cl-nagoya/ruri-large | 0.3B | 78.7 | 62.4 | 85.0 | 37.5 |
intfloat/multilingual-e5-base | 0.3B | 84.2 | 47.2 | 85.3 | 25.4 |
cl-nagoya/ruri-base | 0.1B | 74.3 | 58.1 | 84.6 | 35.3 |
pkshatech/GLuCoSE-base-ja | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
GLuCoSE v2 | 0.1B | 85.5 | 60.6 | 85.3 | 33.8 |
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.
JMTEB
Evaluated with JMTEB.
Model | Size | Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|---|
OpenAI/text-embedding-3-small | - | 70.86 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
OpenAI/text-embedding-3-large | - | 73.97 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
intfloat/multilingual-e5-large | 0.6B | 71.65 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
cl-nagoya/ruri-large | 0.3B | 73.45 | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 |
intfloat/multilingual-e5-base | 0.3B | 70.12 | 68.21 | 79.84 | 69.30 | 92.85 | 48.26 | 62.26 |
cl-nagoya/ruri-base | 0.1B | 72.95 | 69.82 | 82.87 | 75.58 | 92.91 | 54.16 | 62.38 |
pkshatech/GLuCoSE-base-ja | 0.1B | 70.44 | 59.02 | 78.71 | 76.82 | 91.90 | 49.78 | 66.39 |
GLuCoSE v2 | 0.1B | 72.39 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | 62.37 |
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the cl-nagoya/ruri-base model card.
9/11 correction: Some values were initially micro-averaged; I've now standardized all metrics to macro-averaging for consistency.
Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
License
This model is published under the Apache License, Version 2.0.
- Downloads last month
- 772