thenlper commited on
Commit
d64520f
·
1 Parent(s): 438f4ae

Upload 6 files

Browse files
README.md CHANGED
@@ -1,3 +1,1134 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - mteb
4
+ - sentence-similarity
5
+ - sentence-transformers
6
+ - Sentence Transformers
7
+ model-index:
8
+ - name: gte-large-zh
9
+ results:
10
+ - task:
11
+ type: STS
12
+ dataset:
13
+ type: C-MTEB/AFQMC
14
+ name: MTEB AFQMC
15
+ config: default
16
+ split: validation
17
+ revision: None
18
+ metrics:
19
+ - type: cos_sim_pearson
20
+ value: 48.94131905219026
21
+ - type: cos_sim_spearman
22
+ value: 54.58261199731436
23
+ - type: euclidean_pearson
24
+ value: 52.73929210805982
25
+ - type: euclidean_spearman
26
+ value: 54.582632097533676
27
+ - type: manhattan_pearson
28
+ value: 52.73123295724949
29
+ - type: manhattan_spearman
30
+ value: 54.572941830465794
31
+ - task:
32
+ type: STS
33
+ dataset:
34
+ type: C-MTEB/ATEC
35
+ name: MTEB ATEC
36
+ config: default
37
+ split: test
38
+ revision: None
39
+ metrics:
40
+ - type: cos_sim_pearson
41
+ value: 47.292931669579005
42
+ - type: cos_sim_spearman
43
+ value: 54.601019783506466
44
+ - type: euclidean_pearson
45
+ value: 54.61393532658173
46
+ - type: euclidean_spearman
47
+ value: 54.60101865708542
48
+ - type: manhattan_pearson
49
+ value: 54.59369555606305
50
+ - type: manhattan_spearman
51
+ value: 54.601098593646036
52
+ - task:
53
+ type: Classification
54
+ dataset:
55
+ type: mteb/amazon_reviews_multi
56
+ name: MTEB AmazonReviewsClassification (zh)
57
+ config: zh
58
+ split: test
59
+ revision: 1399c76144fd37290681b995c656ef9b2e06e26d
60
+ metrics:
61
+ - type: accuracy
62
+ value: 47.233999999999995
63
+ - type: f1
64
+ value: 45.68998446563349
65
+ - task:
66
+ type: STS
67
+ dataset:
68
+ type: C-MTEB/BQ
69
+ name: MTEB BQ
70
+ config: default
71
+ split: test
72
+ revision: None
73
+ metrics:
74
+ - type: cos_sim_pearson
75
+ value: 62.55033151404683
76
+ - type: cos_sim_spearman
77
+ value: 64.40573802644984
78
+ - type: euclidean_pearson
79
+ value: 62.93453281081951
80
+ - type: euclidean_spearman
81
+ value: 64.40574149035828
82
+ - type: manhattan_pearson
83
+ value: 62.839969210895816
84
+ - type: manhattan_spearman
85
+ value: 64.30837945045283
86
+ - task:
87
+ type: Clustering
88
+ dataset:
89
+ type: C-MTEB/CLSClusteringP2P
90
+ name: MTEB CLSClusteringP2P
91
+ config: default
92
+ split: test
93
+ revision: None
94
+ metrics:
95
+ - type: v_measure
96
+ value: 42.098169316685045
97
+ - task:
98
+ type: Clustering
99
+ dataset:
100
+ type: C-MTEB/CLSClusteringS2S
101
+ name: MTEB CLSClusteringS2S
102
+ config: default
103
+ split: test
104
+ revision: None
105
+ metrics:
106
+ - type: v_measure
107
+ value: 38.90716707051822
108
+ - task:
109
+ type: Reranking
110
+ dataset:
111
+ type: C-MTEB/CMedQAv1-reranking
112
+ name: MTEB CMedQAv1
113
+ config: default
114
+ split: test
115
+ revision: None
116
+ metrics:
117
+ - type: map
118
+ value: 86.09191911031553
119
+ - type: mrr
120
+ value: 88.6747619047619
121
+ - task:
122
+ type: Reranking
123
+ dataset:
124
+ type: C-MTEB/CMedQAv2-reranking
125
+ name: MTEB CMedQAv2
126
+ config: default
127
+ split: test
128
+ revision: None
129
+ metrics:
130
+ - type: map
131
+ value: 86.45781885502122
132
+ - type: mrr
133
+ value: 89.01591269841269
134
+ - task:
135
+ type: Retrieval
136
+ dataset:
137
+ type: C-MTEB/CmedqaRetrieval
138
+ name: MTEB CmedqaRetrieval
139
+ config: default
140
+ split: dev
141
+ revision: None
142
+ metrics:
143
+ - type: map_at_1
144
+ value: 24.215
145
+ - type: map_at_10
146
+ value: 36.498000000000005
147
+ - type: map_at_100
148
+ value: 38.409
149
+ - type: map_at_1000
150
+ value: 38.524
151
+ - type: map_at_3
152
+ value: 32.428000000000004
153
+ - type: map_at_5
154
+ value: 34.664
155
+ - type: mrr_at_1
156
+ value: 36.834
157
+ - type: mrr_at_10
158
+ value: 45.196
159
+ - type: mrr_at_100
160
+ value: 46.214
161
+ - type: mrr_at_1000
162
+ value: 46.259
163
+ - type: mrr_at_3
164
+ value: 42.631
165
+ - type: mrr_at_5
166
+ value: 44.044
167
+ - type: ndcg_at_1
168
+ value: 36.834
169
+ - type: ndcg_at_10
170
+ value: 43.146
171
+ - type: ndcg_at_100
172
+ value: 50.632999999999996
173
+ - type: ndcg_at_1000
174
+ value: 52.608999999999995
175
+ - type: ndcg_at_3
176
+ value: 37.851
177
+ - type: ndcg_at_5
178
+ value: 40.005
179
+ - type: precision_at_1
180
+ value: 36.834
181
+ - type: precision_at_10
182
+ value: 9.647
183
+ - type: precision_at_100
184
+ value: 1.574
185
+ - type: precision_at_1000
186
+ value: 0.183
187
+ - type: precision_at_3
188
+ value: 21.48
189
+ - type: precision_at_5
190
+ value: 15.649
191
+ - type: recall_at_1
192
+ value: 24.215
193
+ - type: recall_at_10
194
+ value: 54.079
195
+ - type: recall_at_100
196
+ value: 84.943
197
+ - type: recall_at_1000
198
+ value: 98.098
199
+ - type: recall_at_3
200
+ value: 38.117000000000004
201
+ - type: recall_at_5
202
+ value: 44.775999999999996
203
+ - task:
204
+ type: PairClassification
205
+ dataset:
206
+ type: C-MTEB/CMNLI
207
+ name: MTEB Cmnli
208
+ config: default
209
+ split: validation
210
+ revision: None
211
+ metrics:
212
+ - type: cos_sim_accuracy
213
+ value: 82.51352976548407
214
+ - type: cos_sim_ap
215
+ value: 89.49905141462749
216
+ - type: cos_sim_f1
217
+ value: 83.89334489486234
218
+ - type: cos_sim_precision
219
+ value: 78.19761567993534
220
+ - type: cos_sim_recall
221
+ value: 90.48398410100538
222
+ - type: dot_accuracy
223
+ value: 82.51352976548407
224
+ - type: dot_ap
225
+ value: 89.49108293121158
226
+ - type: dot_f1
227
+ value: 83.89334489486234
228
+ - type: dot_precision
229
+ value: 78.19761567993534
230
+ - type: dot_recall
231
+ value: 90.48398410100538
232
+ - type: euclidean_accuracy
233
+ value: 82.51352976548407
234
+ - type: euclidean_ap
235
+ value: 89.49904709975154
236
+ - type: euclidean_f1
237
+ value: 83.89334489486234
238
+ - type: euclidean_precision
239
+ value: 78.19761567993534
240
+ - type: euclidean_recall
241
+ value: 90.48398410100538
242
+ - type: manhattan_accuracy
243
+ value: 82.48947684906794
244
+ - type: manhattan_ap
245
+ value: 89.49231995962901
246
+ - type: manhattan_f1
247
+ value: 83.84681215233205
248
+ - type: manhattan_precision
249
+ value: 77.28258726089528
250
+ - type: manhattan_recall
251
+ value: 91.62964694879588
252
+ - type: max_accuracy
253
+ value: 82.51352976548407
254
+ - type: max_ap
255
+ value: 89.49905141462749
256
+ - type: max_f1
257
+ value: 83.89334489486234
258
+ - task:
259
+ type: Retrieval
260
+ dataset:
261
+ type: C-MTEB/CovidRetrieval
262
+ name: MTEB CovidRetrieval
263
+ config: default
264
+ split: dev
265
+ revision: None
266
+ metrics:
267
+ - type: map_at_1
268
+ value: 78.583
269
+ - type: map_at_10
270
+ value: 85.613
271
+ - type: map_at_100
272
+ value: 85.777
273
+ - type: map_at_1000
274
+ value: 85.77900000000001
275
+ - type: map_at_3
276
+ value: 84.58
277
+ - type: map_at_5
278
+ value: 85.22800000000001
279
+ - type: mrr_at_1
280
+ value: 78.925
281
+ - type: mrr_at_10
282
+ value: 85.667
283
+ - type: mrr_at_100
284
+ value: 85.822
285
+ - type: mrr_at_1000
286
+ value: 85.824
287
+ - type: mrr_at_3
288
+ value: 84.651
289
+ - type: mrr_at_5
290
+ value: 85.299
291
+ - type: ndcg_at_1
292
+ value: 78.925
293
+ - type: ndcg_at_10
294
+ value: 88.405
295
+ - type: ndcg_at_100
296
+ value: 89.02799999999999
297
+ - type: ndcg_at_1000
298
+ value: 89.093
299
+ - type: ndcg_at_3
300
+ value: 86.393
301
+ - type: ndcg_at_5
302
+ value: 87.5
303
+ - type: precision_at_1
304
+ value: 78.925
305
+ - type: precision_at_10
306
+ value: 9.789
307
+ - type: precision_at_100
308
+ value: 1.005
309
+ - type: precision_at_1000
310
+ value: 0.101
311
+ - type: precision_at_3
312
+ value: 30.769000000000002
313
+ - type: precision_at_5
314
+ value: 19.031000000000002
315
+ - type: recall_at_1
316
+ value: 78.583
317
+ - type: recall_at_10
318
+ value: 96.891
319
+ - type: recall_at_100
320
+ value: 99.473
321
+ - type: recall_at_1000
322
+ value: 100.0
323
+ - type: recall_at_3
324
+ value: 91.438
325
+ - type: recall_at_5
326
+ value: 94.152
327
+ - task:
328
+ type: Retrieval
329
+ dataset:
330
+ type: C-MTEB/DuRetrieval
331
+ name: MTEB DuRetrieval
332
+ config: default
333
+ split: dev
334
+ revision: None
335
+ metrics:
336
+ - type: map_at_1
337
+ value: 25.604
338
+ - type: map_at_10
339
+ value: 77.171
340
+ - type: map_at_100
341
+ value: 80.033
342
+ - type: map_at_1000
343
+ value: 80.099
344
+ - type: map_at_3
345
+ value: 54.364000000000004
346
+ - type: map_at_5
347
+ value: 68.024
348
+ - type: mrr_at_1
349
+ value: 89.85
350
+ - type: mrr_at_10
351
+ value: 93.009
352
+ - type: mrr_at_100
353
+ value: 93.065
354
+ - type: mrr_at_1000
355
+ value: 93.068
356
+ - type: mrr_at_3
357
+ value: 92.72500000000001
358
+ - type: mrr_at_5
359
+ value: 92.915
360
+ - type: ndcg_at_1
361
+ value: 89.85
362
+ - type: ndcg_at_10
363
+ value: 85.038
364
+ - type: ndcg_at_100
365
+ value: 88.247
366
+ - type: ndcg_at_1000
367
+ value: 88.837
368
+ - type: ndcg_at_3
369
+ value: 85.20299999999999
370
+ - type: ndcg_at_5
371
+ value: 83.47
372
+ - type: precision_at_1
373
+ value: 89.85
374
+ - type: precision_at_10
375
+ value: 40.275
376
+ - type: precision_at_100
377
+ value: 4.709
378
+ - type: precision_at_1000
379
+ value: 0.486
380
+ - type: precision_at_3
381
+ value: 76.36699999999999
382
+ - type: precision_at_5
383
+ value: 63.75999999999999
384
+ - type: recall_at_1
385
+ value: 25.604
386
+ - type: recall_at_10
387
+ value: 85.423
388
+ - type: recall_at_100
389
+ value: 95.695
390
+ - type: recall_at_1000
391
+ value: 98.669
392
+ - type: recall_at_3
393
+ value: 56.737
394
+ - type: recall_at_5
395
+ value: 72.646
396
+ - task:
397
+ type: Retrieval
398
+ dataset:
399
+ type: C-MTEB/EcomRetrieval
400
+ name: MTEB EcomRetrieval
401
+ config: default
402
+ split: dev
403
+ revision: None
404
+ metrics:
405
+ - type: map_at_1
406
+ value: 51.800000000000004
407
+ - type: map_at_10
408
+ value: 62.17
409
+ - type: map_at_100
410
+ value: 62.649
411
+ - type: map_at_1000
412
+ value: 62.663000000000004
413
+ - type: map_at_3
414
+ value: 59.699999999999996
415
+ - type: map_at_5
416
+ value: 61.23499999999999
417
+ - type: mrr_at_1
418
+ value: 51.800000000000004
419
+ - type: mrr_at_10
420
+ value: 62.17
421
+ - type: mrr_at_100
422
+ value: 62.649
423
+ - type: mrr_at_1000
424
+ value: 62.663000000000004
425
+ - type: mrr_at_3
426
+ value: 59.699999999999996
427
+ - type: mrr_at_5
428
+ value: 61.23499999999999
429
+ - type: ndcg_at_1
430
+ value: 51.800000000000004
431
+ - type: ndcg_at_10
432
+ value: 67.246
433
+ - type: ndcg_at_100
434
+ value: 69.58
435
+ - type: ndcg_at_1000
436
+ value: 69.925
437
+ - type: ndcg_at_3
438
+ value: 62.197
439
+ - type: ndcg_at_5
440
+ value: 64.981
441
+ - type: precision_at_1
442
+ value: 51.800000000000004
443
+ - type: precision_at_10
444
+ value: 8.32
445
+ - type: precision_at_100
446
+ value: 0.941
447
+ - type: precision_at_1000
448
+ value: 0.097
449
+ - type: precision_at_3
450
+ value: 23.133
451
+ - type: precision_at_5
452
+ value: 15.24
453
+ - type: recall_at_1
454
+ value: 51.800000000000004
455
+ - type: recall_at_10
456
+ value: 83.2
457
+ - type: recall_at_100
458
+ value: 94.1
459
+ - type: recall_at_1000
460
+ value: 96.8
461
+ - type: recall_at_3
462
+ value: 69.39999999999999
463
+ - type: recall_at_5
464
+ value: 76.2
465
+ - task:
466
+ type: Classification
467
+ dataset:
468
+ type: C-MTEB/IFlyTek-classification
469
+ name: MTEB IFlyTek
470
+ config: default
471
+ split: validation
472
+ revision: None
473
+ metrics:
474
+ - type: accuracy
475
+ value: 49.60369372835706
476
+ - type: f1
477
+ value: 38.24016248875209
478
+ - task:
479
+ type: Classification
480
+ dataset:
481
+ type: C-MTEB/JDReview-classification
482
+ name: MTEB JDReview
483
+ config: default
484
+ split: test
485
+ revision: None
486
+ metrics:
487
+ - type: accuracy
488
+ value: 86.71669793621012
489
+ - type: ap
490
+ value: 55.75807094995178
491
+ - type: f1
492
+ value: 81.59033162805417
493
+ - task:
494
+ type: STS
495
+ dataset:
496
+ type: C-MTEB/LCQMC
497
+ name: MTEB LCQMC
498
+ config: default
499
+ split: test
500
+ revision: None
501
+ metrics:
502
+ - type: cos_sim_pearson
503
+ value: 69.50947272908907
504
+ - type: cos_sim_spearman
505
+ value: 74.40054474949213
506
+ - type: euclidean_pearson
507
+ value: 73.53007373987617
508
+ - type: euclidean_spearman
509
+ value: 74.40054474732082
510
+ - type: manhattan_pearson
511
+ value: 73.51396571849736
512
+ - type: manhattan_spearman
513
+ value: 74.38395696630835
514
+ - task:
515
+ type: Reranking
516
+ dataset:
517
+ type: C-MTEB/Mmarco-reranking
518
+ name: MTEB MMarcoReranking
519
+ config: default
520
+ split: dev
521
+ revision: None
522
+ metrics:
523
+ - type: map
524
+ value: 31.188333827724108
525
+ - type: mrr
526
+ value: 29.84801587301587
527
+ - task:
528
+ type: Retrieval
529
+ dataset:
530
+ type: C-MTEB/MMarcoRetrieval
531
+ name: MTEB MMarcoRetrieval
532
+ config: default
533
+ split: dev
534
+ revision: None
535
+ metrics:
536
+ - type: map_at_1
537
+ value: 64.685
538
+ - type: map_at_10
539
+ value: 73.803
540
+ - type: map_at_100
541
+ value: 74.153
542
+ - type: map_at_1000
543
+ value: 74.167
544
+ - type: map_at_3
545
+ value: 71.98
546
+ - type: map_at_5
547
+ value: 73.21600000000001
548
+ - type: mrr_at_1
549
+ value: 66.891
550
+ - type: mrr_at_10
551
+ value: 74.48700000000001
552
+ - type: mrr_at_100
553
+ value: 74.788
554
+ - type: mrr_at_1000
555
+ value: 74.801
556
+ - type: mrr_at_3
557
+ value: 72.918
558
+ - type: mrr_at_5
559
+ value: 73.965
560
+ - type: ndcg_at_1
561
+ value: 66.891
562
+ - type: ndcg_at_10
563
+ value: 77.534
564
+ - type: ndcg_at_100
565
+ value: 79.106
566
+ - type: ndcg_at_1000
567
+ value: 79.494
568
+ - type: ndcg_at_3
569
+ value: 74.13499999999999
570
+ - type: ndcg_at_5
571
+ value: 76.20700000000001
572
+ - type: precision_at_1
573
+ value: 66.891
574
+ - type: precision_at_10
575
+ value: 9.375
576
+ - type: precision_at_100
577
+ value: 1.0170000000000001
578
+ - type: precision_at_1000
579
+ value: 0.105
580
+ - type: precision_at_3
581
+ value: 27.932000000000002
582
+ - type: precision_at_5
583
+ value: 17.86
584
+ - type: recall_at_1
585
+ value: 64.685
586
+ - type: recall_at_10
587
+ value: 88.298
588
+ - type: recall_at_100
589
+ value: 95.426
590
+ - type: recall_at_1000
591
+ value: 98.48700000000001
592
+ - type: recall_at_3
593
+ value: 79.44200000000001
594
+ - type: recall_at_5
595
+ value: 84.358
596
+ - task:
597
+ type: Classification
598
+ dataset:
599
+ type: mteb/amazon_massive_intent
600
+ name: MTEB MassiveIntentClassification (zh-CN)
601
+ config: zh-CN
602
+ split: test
603
+ revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
604
+ metrics:
605
+ - type: accuracy
606
+ value: 73.30531271015468
607
+ - type: f1
608
+ value: 70.88091430578575
609
+ - task:
610
+ type: Classification
611
+ dataset:
612
+ type: mteb/amazon_massive_scenario
613
+ name: MTEB MassiveScenarioClassification (zh-CN)
614
+ config: zh-CN
615
+ split: test
616
+ revision: 7d571f92784cd94a019292a1f45445077d0ef634
617
+ metrics:
618
+ - type: accuracy
619
+ value: 75.7128446536651
620
+ - type: f1
621
+ value: 75.06125593532262
622
+ - task:
623
+ type: Retrieval
624
+ dataset:
625
+ type: C-MTEB/MedicalRetrieval
626
+ name: MTEB MedicalRetrieval
627
+ config: default
628
+ split: dev
629
+ revision: None
630
+ metrics:
631
+ - type: map_at_1
632
+ value: 52.7
633
+ - type: map_at_10
634
+ value: 59.532
635
+ - type: map_at_100
636
+ value: 60.085
637
+ - type: map_at_1000
638
+ value: 60.126000000000005
639
+ - type: map_at_3
640
+ value: 57.767
641
+ - type: map_at_5
642
+ value: 58.952000000000005
643
+ - type: mrr_at_1
644
+ value: 52.900000000000006
645
+ - type: mrr_at_10
646
+ value: 59.648999999999994
647
+ - type: mrr_at_100
648
+ value: 60.20100000000001
649
+ - type: mrr_at_1000
650
+ value: 60.242
651
+ - type: mrr_at_3
652
+ value: 57.882999999999996
653
+ - type: mrr_at_5
654
+ value: 59.068
655
+ - type: ndcg_at_1
656
+ value: 52.7
657
+ - type: ndcg_at_10
658
+ value: 62.883
659
+ - type: ndcg_at_100
660
+ value: 65.714
661
+ - type: ndcg_at_1000
662
+ value: 66.932
663
+ - type: ndcg_at_3
664
+ value: 59.34700000000001
665
+ - type: ndcg_at_5
666
+ value: 61.486
667
+ - type: precision_at_1
668
+ value: 52.7
669
+ - type: precision_at_10
670
+ value: 7.340000000000001
671
+ - type: precision_at_100
672
+ value: 0.8699999999999999
673
+ - type: precision_at_1000
674
+ value: 0.097
675
+ - type: precision_at_3
676
+ value: 21.3
677
+ - type: precision_at_5
678
+ value: 13.819999999999999
679
+ - type: recall_at_1
680
+ value: 52.7
681
+ - type: recall_at_10
682
+ value: 73.4
683
+ - type: recall_at_100
684
+ value: 87.0
685
+ - type: recall_at_1000
686
+ value: 96.8
687
+ - type: recall_at_3
688
+ value: 63.9
689
+ - type: recall_at_5
690
+ value: 69.1
691
+ - task:
692
+ type: Classification
693
+ dataset:
694
+ type: C-MTEB/MultilingualSentiment-classification
695
+ name: MTEB MultilingualSentiment
696
+ config: default
697
+ split: validation
698
+ revision: None
699
+ metrics:
700
+ - type: accuracy
701
+ value: 76.47666666666667
702
+ - type: f1
703
+ value: 76.4808576632057
704
+ - task:
705
+ type: PairClassification
706
+ dataset:
707
+ type: C-MTEB/OCNLI
708
+ name: MTEB Ocnli
709
+ config: default
710
+ split: validation
711
+ revision: None
712
+ metrics:
713
+ - type: cos_sim_accuracy
714
+ value: 77.58527341635084
715
+ - type: cos_sim_ap
716
+ value: 79.32131557636497
717
+ - type: cos_sim_f1
718
+ value: 80.51948051948052
719
+ - type: cos_sim_precision
720
+ value: 71.7948717948718
721
+ - type: cos_sim_recall
722
+ value: 91.65786694825766
723
+ - type: dot_accuracy
724
+ value: 77.58527341635084
725
+ - type: dot_ap
726
+ value: 79.32131557636497
727
+ - type: dot_f1
728
+ value: 80.51948051948052
729
+ - type: dot_precision
730
+ value: 71.7948717948718
731
+ - type: dot_recall
732
+ value: 91.65786694825766
733
+ - type: euclidean_accuracy
734
+ value: 77.58527341635084
735
+ - type: euclidean_ap
736
+ value: 79.32131557636497
737
+ - type: euclidean_f1
738
+ value: 80.51948051948052
739
+ - type: euclidean_precision
740
+ value: 71.7948717948718
741
+ - type: euclidean_recall
742
+ value: 91.65786694825766
743
+ - type: manhattan_accuracy
744
+ value: 77.15213860314023
745
+ - type: manhattan_ap
746
+ value: 79.26178519246496
747
+ - type: manhattan_f1
748
+ value: 80.22028453418999
749
+ - type: manhattan_precision
750
+ value: 70.94155844155844
751
+ - type: manhattan_recall
752
+ value: 92.29144667370645
753
+ - type: max_accuracy
754
+ value: 77.58527341635084
755
+ - type: max_ap
756
+ value: 79.32131557636497
757
+ - type: max_f1
758
+ value: 80.51948051948052
759
+ - task:
760
+ type: Classification
761
+ dataset:
762
+ type: C-MTEB/OnlineShopping-classification
763
+ name: MTEB OnlineShopping
764
+ config: default
765
+ split: test
766
+ revision: None
767
+ metrics:
768
+ - type: accuracy
769
+ value: 92.68
770
+ - type: ap
771
+ value: 90.78652757815115
772
+ - type: f1
773
+ value: 92.67153098230253
774
+ - task:
775
+ type: STS
776
+ dataset:
777
+ type: C-MTEB/PAWSX
778
+ name: MTEB PAWSX
779
+ config: default
780
+ split: test
781
+ revision: None
782
+ metrics:
783
+ - type: cos_sim_pearson
784
+ value: 35.301730226895955
785
+ - type: cos_sim_spearman
786
+ value: 38.54612530948101
787
+ - type: euclidean_pearson
788
+ value: 39.02831131230217
789
+ - type: euclidean_spearman
790
+ value: 38.54612530948101
791
+ - type: manhattan_pearson
792
+ value: 39.04765584936325
793
+ - type: manhattan_spearman
794
+ value: 38.54455759013173
795
+ - task:
796
+ type: STS
797
+ dataset:
798
+ type: C-MTEB/QBQTC
799
+ name: MTEB QBQTC
800
+ config: default
801
+ split: test
802
+ revision: None
803
+ metrics:
804
+ - type: cos_sim_pearson
805
+ value: 32.27907454729754
806
+ - type: cos_sim_spearman
807
+ value: 33.35945567162729
808
+ - type: euclidean_pearson
809
+ value: 31.997628193815725
810
+ - type: euclidean_spearman
811
+ value: 33.3592386340529
812
+ - type: manhattan_pearson
813
+ value: 31.97117833750544
814
+ - type: manhattan_spearman
815
+ value: 33.30857326127779
816
+ - task:
817
+ type: STS
818
+ dataset:
819
+ type: mteb/sts22-crosslingual-sts
820
+ name: MTEB STS22 (zh)
821
+ config: zh
822
+ split: test
823
+ revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
824
+ metrics:
825
+ - type: cos_sim_pearson
826
+ value: 62.53712784446981
827
+ - type: cos_sim_spearman
828
+ value: 62.975074386224286
829
+ - type: euclidean_pearson
830
+ value: 61.791207731290854
831
+ - type: euclidean_spearman
832
+ value: 62.975073716988064
833
+ - type: manhattan_pearson
834
+ value: 62.63850653150875
835
+ - type: manhattan_spearman
836
+ value: 63.56640346497343
837
+ - task:
838
+ type: STS
839
+ dataset:
840
+ type: C-MTEB/STSB
841
+ name: MTEB STSB
842
+ config: default
843
+ split: test
844
+ revision: None
845
+ metrics:
846
+ - type: cos_sim_pearson
847
+ value: 79.52067424748047
848
+ - type: cos_sim_spearman
849
+ value: 79.68425102631514
850
+ - type: euclidean_pearson
851
+ value: 79.27553959329275
852
+ - type: euclidean_spearman
853
+ value: 79.68450427089856
854
+ - type: manhattan_pearson
855
+ value: 79.21584650471131
856
+ - type: manhattan_spearman
857
+ value: 79.6419242840243
858
+ - task:
859
+ type: Reranking
860
+ dataset:
861
+ type: C-MTEB/T2Reranking
862
+ name: MTEB T2Reranking
863
+ config: default
864
+ split: dev
865
+ revision: None
866
+ metrics:
867
+ - type: map
868
+ value: 65.8563449629786
869
+ - type: mrr
870
+ value: 75.82550832339254
871
+ - task:
872
+ type: Retrieval
873
+ dataset:
874
+ type: C-MTEB/T2Retrieval
875
+ name: MTEB T2Retrieval
876
+ config: default
877
+ split: dev
878
+ revision: None
879
+ metrics:
880
+ - type: map_at_1
881
+ value: 27.889999999999997
882
+ - type: map_at_10
883
+ value: 72.878
884
+ - type: map_at_100
885
+ value: 76.737
886
+ - type: map_at_1000
887
+ value: 76.836
888
+ - type: map_at_3
889
+ value: 52.738
890
+ - type: map_at_5
891
+ value: 63.726000000000006
892
+ - type: mrr_at_1
893
+ value: 89.35600000000001
894
+ - type: mrr_at_10
895
+ value: 92.622
896
+ - type: mrr_at_100
897
+ value: 92.692
898
+ - type: mrr_at_1000
899
+ value: 92.694
900
+ - type: mrr_at_3
901
+ value: 92.13799999999999
902
+ - type: mrr_at_5
903
+ value: 92.452
904
+ - type: ndcg_at_1
905
+ value: 89.35600000000001
906
+ - type: ndcg_at_10
907
+ value: 81.932
908
+ - type: ndcg_at_100
909
+ value: 86.351
910
+ - type: ndcg_at_1000
911
+ value: 87.221
912
+ - type: ndcg_at_3
913
+ value: 84.29100000000001
914
+ - type: ndcg_at_5
915
+ value: 82.279
916
+ - type: precision_at_1
917
+ value: 89.35600000000001
918
+ - type: precision_at_10
919
+ value: 39.511
920
+ - type: precision_at_100
921
+ value: 4.901
922
+ - type: precision_at_1000
923
+ value: 0.513
924
+ - type: precision_at_3
925
+ value: 72.62100000000001
926
+ - type: precision_at_5
927
+ value: 59.918000000000006
928
+ - type: recall_at_1
929
+ value: 27.889999999999997
930
+ - type: recall_at_10
931
+ value: 80.636
932
+ - type: recall_at_100
933
+ value: 94.333
934
+ - type: recall_at_1000
935
+ value: 98.39099999999999
936
+ - type: recall_at_3
937
+ value: 54.797
938
+ - type: recall_at_5
939
+ value: 67.824
940
+ - task:
941
+ type: Classification
942
+ dataset:
943
+ type: C-MTEB/TNews-classification
944
+ name: MTEB TNews
945
+ config: default
946
+ split: validation
947
+ revision: None
948
+ metrics:
949
+ - type: accuracy
950
+ value: 51.979000000000006
951
+ - type: f1
952
+ value: 50.35658238894168
953
+ - task:
954
+ type: Clustering
955
+ dataset:
956
+ type: C-MTEB/ThuNewsClusteringP2P
957
+ name: MTEB ThuNewsClusteringP2P
958
+ config: default
959
+ split: test
960
+ revision: None
961
+ metrics:
962
+ - type: v_measure
963
+ value: 68.36477832710159
964
+ - task:
965
+ type: Clustering
966
+ dataset:
967
+ type: C-MTEB/ThuNewsClusteringS2S
968
+ name: MTEB ThuNewsClusteringS2S
969
+ config: default
970
+ split: test
971
+ revision: None
972
+ metrics:
973
+ - type: v_measure
974
+ value: 62.92080622759053
975
+ - task:
976
+ type: Retrieval
977
+ dataset:
978
+ type: C-MTEB/VideoRetrieval
979
+ name: MTEB VideoRetrieval
980
+ config: default
981
+ split: dev
982
+ revision: None
983
+ metrics:
984
+ - type: map_at_1
985
+ value: 59.3
986
+ - type: map_at_10
987
+ value: 69.299
988
+ - type: map_at_100
989
+ value: 69.669
990
+ - type: map_at_1000
991
+ value: 69.682
992
+ - type: map_at_3
993
+ value: 67.583
994
+ - type: map_at_5
995
+ value: 68.57799999999999
996
+ - type: mrr_at_1
997
+ value: 59.3
998
+ - type: mrr_at_10
999
+ value: 69.299
1000
+ - type: mrr_at_100
1001
+ value: 69.669
1002
+ - type: mrr_at_1000
1003
+ value: 69.682
1004
+ - type: mrr_at_3
1005
+ value: 67.583
1006
+ - type: mrr_at_5
1007
+ value: 68.57799999999999
1008
+ - type: ndcg_at_1
1009
+ value: 59.3
1010
+ - type: ndcg_at_10
1011
+ value: 73.699
1012
+ - type: ndcg_at_100
1013
+ value: 75.626
1014
+ - type: ndcg_at_1000
1015
+ value: 75.949
1016
+ - type: ndcg_at_3
1017
+ value: 70.18900000000001
1018
+ - type: ndcg_at_5
1019
+ value: 71.992
1020
+ - type: precision_at_1
1021
+ value: 59.3
1022
+ - type: precision_at_10
1023
+ value: 8.73
1024
+ - type: precision_at_100
1025
+ value: 0.9650000000000001
1026
+ - type: precision_at_1000
1027
+ value: 0.099
1028
+ - type: precision_at_3
1029
+ value: 25.900000000000002
1030
+ - type: precision_at_5
1031
+ value: 16.42
1032
+ - type: recall_at_1
1033
+ value: 59.3
1034
+ - type: recall_at_10
1035
+ value: 87.3
1036
+ - type: recall_at_100
1037
+ value: 96.5
1038
+ - type: recall_at_1000
1039
+ value: 99.0
1040
+ - type: recall_at_3
1041
+ value: 77.7
1042
+ - type: recall_at_5
1043
+ value: 82.1
1044
+ - task:
1045
+ type: Classification
1046
+ dataset:
1047
+ type: C-MTEB/waimai-classification
1048
+ name: MTEB Waimai
1049
+ config: default
1050
+ split: test
1051
+ revision: None
1052
+ metrics:
1053
+ - type: accuracy
1054
+ value: 88.36999999999999
1055
+ - type: ap
1056
+ value: 73.29590829222836
1057
+ - type: f1
1058
+ value: 86.74250506247606
1059
+ language:
1060
+ - en
1061
  license: mit
1062
  ---
1063
+
1064
+ # gte-large-zh
1065
+
1066
+ General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
1067
+
1068
+ The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including [GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh), [GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh), and [GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh). The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
1069
+
1070
+ ## Metrics
1071
+
1072
+ We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
1073
+
1074
+ ## Usage
1075
+
1076
+ Code example
1077
+
1078
+ ```python
1079
+ import torch.nn.functional as F
1080
+ from torch import Tensor
1081
+ from transformers import AutoTokenizer, AutoModel
1082
+
1083
+ input_texts = [
1084
+ "what is the capital of China?",
1085
+ "how to implement quick sort in python?",
1086
+ "Beijing",
1087
+ "sorting algorithms"
1088
+ ]
1089
+
1090
+ tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
1091
+ model = AutoModel.from_pretrained("thenlper/gte-large-zh")
1092
+
1093
+ # Tokenize the input texts
1094
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
1095
+
1096
+ outputs = model(**batch_dict)
1097
+ embeddings = outputs.last_hidden_state[:, 0]
1098
+
1099
+ # (Optionally) normalize embeddings
1100
+ embeddings = F.normalize(embeddings, p=2, dim=1)
1101
+ scores = (embeddings[:1] @ embeddings[1:].T) * 100
1102
+ print(scores.tolist())
1103
+ ```
1104
+
1105
+ Use with sentence-transformers:
1106
+ ```python
1107
+ from sentence_transformers import SentenceTransformer
1108
+ from sentence_transformers.util import cos_sim
1109
+
1110
+ sentences = ['That is a happy person', 'That is a very happy person']
1111
+
1112
+ model = SentenceTransformer('thenlper/gte-large')
1113
+ embeddings = model.encode(sentences)
1114
+ print(cos_sim(embeddings[0], embeddings[1]))
1115
+ ```
1116
+
1117
+ ### Limitation
1118
+
1119
+ This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
1120
+
1121
+ ### Citation
1122
+
1123
+ If you find our paper or models helpful, please consider citing them as follows:
1124
+
1125
+ ```
1126
+ @misc{li2023general,
1127
+ title={Towards General Text Embeddings with Multi-stage Contrastive Learning},
1128
+ author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
1129
+ year={2023},
1130
+ eprint={2308.03281},
1131
+ archivePrefix={arXiv},
1132
+ primaryClass={cs.CL}
1133
+ }
1134
+ ```
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 1000000000000000019884624838656,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff