hellonlp
/

simcse-roberta-large-zh

Sentence Similarity

Inference Endpoints

Model card Files Files and versions Community

simcse-roberta-large-zh / README.md

hellonlp's picture

Update README.md

b4f3d09 verified about 1 year ago

|

3.05 kB

	---
	language:
	- zh
	license: mit
	pipeline_tag: sentence-similarity
	---

	# SimCSE(sup)



	## Model List
	The evaluation dataset is in Chinese, and we used the same language model RoBERTa large on different methods.
	\| Model \| STS-B(w-avg) \| ATEC \| BQ \| LCQMC \| PAWSX \| Avg. \|
	\|:-----------------------:\|:------------:\|:-----------:\|:----------\|:----------\|:----------:\|:----------:\|
	\| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) \| 78.61\| -\| -\| -\| -\| -\|
	\| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) \| 79.07\| -\| -\| -\| -\| -\|
	\| [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh) \| 81.32\| -\| -\| -\| -\| -\|



	## Data List
	The following data are all in Chinese.
	\| Data \| Link \| size(train) \| size(valid) \| size(test) \|
	\|:-----------------------:\|:------------:\|:------------:\|:------------:\|:------------:\|
	\| STS-B \| [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y)\| 5231\| 1458\| 1361\|
	\| ATEC \| [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c)\| 62477\| 20000\| 20000\|
	\| BQ \| [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9)\| 100000\| 10000\| 10000\|
	\| LCQMC \| [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w )\| 238766\| 8802\| 12500\|
	\| PAWSX \| [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn)\| 49401\| 2000\| 2000\|
	\| SNLI \| [SNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v)\| 146828\| 2699\| 2618\|
	\| MNLI \| [MNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte)\| 122547\| 2932\| 2397\|



	## Uses
	You can use our model for encoding sentences into embeddings
	```python
	import torch
	from transformers import BertTokenizer
	from transformers import BertModel
	from sklearn.metrics.pairwise import cosine_similarity

	# model
	simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
	tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
	MODEL = BertModel.from_pretrained(simcse_sup_path)

	def get_vector_simcse(sentence):
	"""
	预测simcse的语义向量。
	"""
	input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
	output = MODEL(input_ids)
	return output.last_hidden_state[:, 0].squeeze(0)

	embeddings = get_vector_simcse("武汉是一个美丽的城市。")
	print(embeddings.shape)
	#torch.Size([768])
	```

	You can also compute the cosine similarities between two sentences
	```python
	def get_similarity_two(sentence1, sentence2):
	vec1 = get_vector_simcse(sentence1).tolist()
	vec2 = get_vector_simcse(sentence2).tolist()
	similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
	return similarity_list

	sentence1 = '你好吗'
	sentence2 = '你还好吗'
	result = get_similarity_two(sentence1,sentence2)
	print(result)
	#0.848331
	```