Jinkin commited on
Commit
f476313
1 Parent(s): 736658f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -1063,14 +1063,16 @@ piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组
1063
  目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1064
 
1065
  piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
1066
- Based on BERT framework, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1067
  and train the model with the pair(text and text pos) softmax contrastive loss.
1068
  On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
1069
  Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
1070
 
1071
  ## Metric
1072
  我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在eval文件夹中提供了复现结果的脚本。
1073
- We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard. wo provide scripts in "eval" folder for results reproducing.
 
 
1074
 
1075
 
1076
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
 
1063
  目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1064
 
1065
  piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
1066
+ Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1067
  and train the model with the pair(text and text pos) softmax contrastive loss.
1068
  On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
1069
  Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
1070
 
1071
  ## Metric
1072
  我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在eval文件夹中提供了复现结果的脚本。
1073
+
1074
+ We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
1075
+ we provide scripts in "eval" folder for results reproducing.
1076
 
1077
 
1078
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |