metadata
language:
- zh
license: mit
pipeline_tag: sentence-similarity
SimCSE(sup)
Model List
The evaluation dataset is in Chinese, and we used the same language model RoBERTa large on different methods.
Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. |
---|---|---|---|---|---|---|
BAAI/bge-large-zh | 78.61 | - | - | - | - | - |
BAAI/bge-large-zh-v1.5 | 79.07 | - | - | - | - | - |
hellonlp/simcse-large-zh | 81.32 | - | - | - | - | - |
Data List
The following data are all in Chinese.
Data | Link | size(train) | size(valid) | size(test) |
---|---|---|---|---|
STS-B | STS-B | 5231 | 1458 | 1361 |
ATEC | ATEC | 62477 | 20000 | 20000 |
BQ | BQ | 100000 | 10000 | 10000 |
LCQMC | LCQMC | 238766 | 8802 | 12500 |
PAWSX | PAWSX | 49401 | 2000 | 2000 |
SNLI | SNLI | 146828 | 2699 | 2618 |
MNLI | MNLI | 122547 | 2932 | 2397 |
Uses
You can use our model for encoding sentences into embeddings
import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity
# model
simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)
def get_vector_simcse(sentence):
"""
预测simcse的语义向量。
"""
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
output = MODEL(input_ids)
return output.last_hidden_state[:, 0].squeeze(0)
embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([768])
You can also compute the cosine similarities between two sentences
def get_similarity_two(sentence1, sentence2):
vec1 = get_vector_simcse(sentence1).tolist()
vec2 = get_vector_simcse(sentence2).tolist()
similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
return similarity_list
sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result)
#0.848331