Edit model card

SentenceTransformer

This is a sentence-transformers model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

The model is based on GLuCoSE and additionally fine-tuned. Fine-tuning consists of the following steps.

Step 1: Ensemble distillation

Step 2: Contrastive learning

  • Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
  • This training aimed to improve the overall performance as a sentence embedding model.

Step 3: Search-specific contrastive learning

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformers with the following code:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Benchmarks

Retieval

Evaluated with MIRACL-ja, JQARA , JaCWIR and MLDR-ja.

Model Size MIRACL
Recall@5
JQaRA
nDCG@10
JaCWIR
MAP@10
MLDR
nDCG@10
OpenAI/text-embedding-3-small - processing... 38.8 81.6 processing...
OpenAI/text-embedding-3-large - processing... processing... processing... processing...
intfloat/multilingual-e5-large 0.6B 89.2 55.4 87.6 29.8
cl-nagoya/ruri-large 0.3B 78.7 62.4 85.0 37.5
intfloat/multilingual-e5-base 0.3B 84.2 47.2 85.3 25.4
cl-nagoya/ruri-base 0.1B 74.3 58.1 84.6 35.3
pkshatech/GLuCoSE-base-ja 0.1B 53.3 30.8 68.6 25.2
GLuCoSE v2 0.1B 85.5 60.6 85.3 33.8

Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.

JMTEB

Evaluated with JMTEB.

Model Size Avg. Retrieval STS Classification Reranking Clustering PairClassification
OpenAI/text-embedding-3-small - 70.86 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 73.97 74.48 82.52 77.58 93.58 53.32 62.35
intfloat/multilingual-e5-large 0.6B 71.65 70.98 79.70 72.89 92.96 51.24 62.15
cl-nagoya/ruri-large 0.3B 73.45 73.02 83.13 77.43 92.99 51.82 62.29
intfloat/multilingual-e5-base 0.3B 70.12 68.21 79.84 69.30 92.85 48.26 62.26
cl-nagoya/ruri-base 0.1B 72.95 69.82 82.87 75.58 92.91 54.16 62.38
pkshatech/GLuCoSE-base-ja 0.1B 70.44 59.02 78.71 76.82 91.90 49.78 66.39
GLuCoSE v2 0.1B 72.39 73.36 82.96 74.21 93.01 48.65 62.37

Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the cl-nagoya/ruri-base model card.

9/11 correction: Some values were initially micro-averaged; I've now standardized all metrics to macro-averaging for consistency.

Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

License

This model is published under the Apache License, Version 2.0.

Downloads last month
772
Safetensors
Model size
133M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for pkshatech/GLuCoSE-base-ja-v2

Finetuned
this model

Datasets used to train pkshatech/GLuCoSE-base-ja-v2

Space using pkshatech/GLuCoSE-base-ja-v2 1