hellonlp
/

simcse-roberta-large-zh

Sentence Similarity

Inference Endpoints

Model card Files Files and versions Community

simcse-roberta-large-zh / README.md

hellonlp's picture

Update README.md

4bd9604 verified about 1 year ago

|

1.68 kB

	---
	language:
	- zh
	license: mit
	---

	# SimCSE(sup)



	## Model List
	The evaluation dataset is in Chinese, and we used the same language model RoBERTa large on different methods.
	\| Model \| STS-B(w-avg) \| ATEC \| BQ \| LCQMC \| PAWSX \| Avg. \|
	\|:-----------------------:\|:------------:\|:-----------:\|:----------\|:----------\|:----------:\|:----------:\|
	\| [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh) \| 81.32\| -\| -\| -\| -\| -\|





	## Uses
	You can use our model for encoding sentences into embeddings
	```python
	import torch
	from transformers import BertTokenizer
	from transformers import BertModel
	from sklearn.metrics.pairwise import cosine_similarity

	# model
	simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
	tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
	MODEL = BertModel.from_pretrained(simcse_sup_path)

	def get_vector_simcse(sentence):
	"""
	预测simcse的语义向量。
	"""
	input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
	output = MODEL(input_ids)
	return output.last_hidden_state[:, 0].squeeze(0)

	embeddings = get_vector_simcse("武汉是一个美丽的城市。")
	print(embeddings.shape)
	#torch.Size([768])
	```

	You can also compute the cosine similarities between two sentences
	```python
	def get_similarity_two(sentence1, sentence2):
	vec1 = get_vector_simcse(sentence1).tolist()
	vec2 = get_vector_simcse(sentence2).tolist()
	similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
	return similarity_list

	sentence1 = '你好吗'
	sentence2 = '你还好吗'
	result = get_similarity_two(sentence1,sentence2)
	print(result)
	#0.848331
	```