hellonlp's picture
Update README.md
b4f3d09 verified
|
raw
history blame
3.05 kB
metadata
language:
  - zh
license: mit
pipeline_tag: sentence-similarity

SimCSE(sup)

Model List

The evaluation dataset is in Chinese, and we used the same language model RoBERTa large on different methods.

Model STS-B(w-avg) ATEC BQ LCQMC PAWSX Avg.
BAAI/bge-large-zh 78.61 - - - - -
BAAI/bge-large-zh-v1.5 79.07 - - - - -
hellonlp/simcse-large-zh 81.32 - - - - -

Data List

The following data are all in Chinese.

Data Link size(train) size(valid) size(test)
STS-B STS-B 5231 1458 1361
ATEC ATEC 62477 20000 20000
BQ BQ 100000 10000 10000
LCQMC LCQMC 238766 8802 12500
PAWSX PAWSX 49401 2000 2000
SNLI SNLI 146828 2699 2618
MNLI MNLI 122547 2932 2397

Uses

You can use our model for encoding sentences into embeddings

import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity

# model
simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)

def get_vector_simcse(sentence):
    """
    预测simcse的语义向量。
    """
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
    output = MODEL(input_ids)
    return output.last_hidden_state[:, 0].squeeze(0)

embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([768])

You can also compute the cosine similarities between two sentences

def get_similarity_two(sentence1, sentence2):
    vec1 = get_vector_simcse(sentence1).tolist()
    vec2 = get_vector_simcse(sentence2).tolist()
    similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
    return similarity_list

sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result)
#0.848331