Jinkin commited on
Commit
d0e8a3f
1 Parent(s): 315e6b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -1
README.md CHANGED
@@ -1056,4 +1056,63 @@ model-index:
1056
  ---
1057
 
1058
 
1059
- ## piccolo-large-zh
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1056
  ---
1057
 
1058
 
1059
+ ## piccolo-large-zh
1060
+
1061
+ piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
1062
+ 在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
1063
+ 在第二阶段中,我们搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
1064
+ 目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1065
+
1066
+ piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research.
1067
+ Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1068
+ and train the model with the pair(text and text pos) softmax contrastive loss.
1069
+ On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
1070
+ Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
1071
+
1072
+ ## Metric
1073
+ 我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在[eval文件夹](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval)中提供了复现结果的脚本。
1074
+
1075
+ We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
1076
+ we provide scripts in ["eval" folder](https://huggingface.co/sensenova/piccolo-base-zh/tree/main/eval) for results reproducing.
1077
+
1078
+
1079
+ | Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
1080
+ |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
1081
+ | [**piccolo-large-zh**] | 0.65 | 1024 | 512 | **64.11** | 67.03 | 47.04 | 78.38 | 65.98 | 70.93 | 58.02 |
1082
+ | [bge-large-zh]| 1.3 | 1024| 512 | 63.96 | 68.32 | 48.39 | 78.94 | 65.11 | 71.52 | 54.98 |
1083
+ | [**piccolo-base-zh**]| 0.2 | 768 | 512 | **63.66** | 66.98 | 47.12 | 76.61 | 66.68 | 71.2 | 55.9 |
1084
+ | [bge-large-zh-no-instruct]| 1.3 | 1024 | 512 | 63.4 | 68.58 | 50.01 | 76.77 | 64.9 | 70.54 | 53 |
1085
+ | [bge-base-zh]| 0.41 | 768 | 512 | 62.8 | 67.07 | 47.64 | 77.5 | 64.91 | 69.53 | 54.12 |
1086
+
1087
+ ## Usage
1088
+ 在sentence-transformer package中可以很容易地调用piccolo模型
1089
+ ```python
1090
+ # for s2s dataset, you can use piccolo as below
1091
+ # 对于短对短数据集,下面是通用的使用方式
1092
+ from sentence_transformers import SentenceTransformer
1093
+ sentences = ["数据1", "数据2"]
1094
+ model = SentenceTransformer('sensenova/piccolo-base-zh')
1095
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
1096
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
1097
+ similarity = embeddings_1 @ embeddings_2.T
1098
+ print(similarity)
1099
+ # for s2p dataset, we recommend to add instruction for passage retrieval
1100
+ # 对于短对长数据集,我们推荐添加instruction,来帮助模型更好地进行检索。
1101
+ from sentence_transformers import SentenceTransformer
1102
+ queries = ['query_1', 'query_2']
1103
+ passages = ["doc_1", "doc_2"]
1104
+ model = SentenceTransformer('sensenova/piccolo-base-zh')
1105
+ q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
1106
+ p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
1107
+ scores = q_embeddings @ p_embeddings.T
1108
+ ```
1109
+
1110
+ ## Training Detail
1111
+ TODO
1112
+
1113
+ ## acknowledgement
1114
+
1115
+ piccolo is powered by Genral Model group from SenseTime Research.
1116
+ [Jinkin](https://huggingface.co/Jinkin) complete code implementation and model training.
1117
+ [Jinkin](https://huggingface.co/Jinkin), [CCCCxxx](https://huggingface.co/CCCCxxx) completed the data collection、processing and model evaluation together.
1118
+ Project is led by [Gaomengya](https://huggingface.co/gaomengya) and [chaorenwu111](https://huggingface.co/chaorenwu111)