File size: 3,324 Bytes
7b25245 4582162 7b25245 68d882b 7b25245 4582162 076004f b4f3d09 61fab68 f4feec0 a34056b 73210b3 b53f5fb b4f3d09 076004f 20cc90e 4582162 d568520 4582162 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
language:
- zh
license: mit
pipeline_tag: sentence-similarity
---
# SimCSE(sup)
## Data List
The following datasets are all in Chinese.
| Data | size(train) | size(valid) | size(test) |
|:----------------------:|:----------:|:----------:|:----------:|
| [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c) | 62477| 20000| 20000|
| [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9) | 100000| 10000| 10000|
| [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w ) | 238766| 8802| 12500|
| [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn) | 49401| 2000| 2000|
| [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y) | 5231| 1458| 1361|
| [*SNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v) | 146828| 2699| 2618|
| [*MNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte) | 122547| 2932| 2397|
## Model List
The evaluation dataset is in Chinese, and we used the same language model **RoBERTa base** on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the **weighted average (w-avg)** method.
| Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. |
|:-----------------------:|:------------:|:-----------:|:----------|:----------|:----------:|:----------:|
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | 78.61| -| -| -| -| -|
| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | 79.07| -| -| -| -| -|
| [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh) | 81.32| -| -| -| -| -|
## Uses
You can use our model for encoding sentences into embeddings
```python
import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity
# model
simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)
def get_vector_simcse(sentence):
"""
预测simcse的语义向量。
"""
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
output = MODEL(input_ids)
return output.last_hidden_state[:, 0].squeeze(0)
embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([1024])
```
You can also compute the cosine similarities between two sentences
```python
def get_similarity_two(sentence1, sentence2):
vec1 = get_vector_simcse(sentence1).tolist()
vec2 = get_vector_simcse(sentence2).tolist()
similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
return similarity_list
sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result)
#0.848331
``` |