Jinkin commited on
Commit
3e9652a
1 Parent(s): 7d8ba10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -1057,15 +1057,15 @@ model-index:
1057
 
1058
  ## piccolo-base-zh
1059
 
1060
- piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
1061
  在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
1062
- 在第二阶段中,我们从互联网搜集了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
1063
  目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1064
 
1065
- piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
1066
  Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1067
  and train the model with the pair(text and text pos) softmax contrastive loss.
1068
- On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
1069
  Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
1070
 
1071
  ## Metric
@@ -1106,10 +1106,11 @@ model = SentenceTransformer('sensenova/piccolo-base-zh')
1106
  q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
1107
  p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
1108
  scores = q_embeddings @ p_embeddings.T
1109
-
1110
-
1111
  ```
1112
 
 
 
 
1113
  ## acknowledgement
1114
 
1115
  piccolo is powered by Genral Model group from SenseTime Research.
 
1057
 
1058
  ## piccolo-base-zh
1059
 
1060
+ piccolo是一个通用embedding模型(中文), 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
1061
  在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
1062
+ 在第二阶段中,我们从搜集整理了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
1063
  目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1064
 
1065
+ piccolo is a general text embedding model(chinese), powered by General Model Group from SenseTime Research.
1066
  Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1067
  and train the model with the pair(text and text pos) softmax contrastive loss.
1068
+ On the second stage, we collect 20 million human labeled chinese text pairs dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
1069
  Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
1070
 
1071
  ## Metric
 
1106
  q_embeddings = model.encode(["查询:" + q for q in queries], normalize_embeddings=True)
1107
  p_embeddings = model.encode(["结果:" + p for p in passages], normalize_embeddings=True)
1108
  scores = q_embeddings @ p_embeddings.T
 
 
1109
  ```
1110
 
1111
+ ## Training Detail
1112
+ TODO
1113
+
1114
  ## acknowledgement
1115
 
1116
  piccolo is powered by Genral Model group from SenseTime Research.