File size: 15,057 Bytes
6b4f3f1 821a02f eef9885 821a02f eef9885 2491537 a748263 6b4f3f1 6b6c466 6b4f3f1 6b6c466 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 57ffbc3 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 143fca0 8c02c7a 6b6c466 8c02c7a 143fca0 8c02c7a 143fca0 8c02c7a 143fca0 8c02c7a 6b6c466 8c02c7a 0d1b94a 8c02c7a 6b6c466 8c02c7a 6b6c466 8c02c7a 6b6c466 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 |
---
language:
- zh
- en
base_model: openbmb/MiniCPM-2B-sft-bf16
model-index:
- name: MiniCPM-Embedding
results:
- task:
type: Retrieval
dataset:
type: mteb/arguana
name: MTEB ArguAna
config: default
split: test
revision: c22ab2a51041ffd869aaddef7af8d8215647e41a
metrics:
- type: ndcg_at_10
value: 64.65
- task:
type: Retrieval
dataset:
type: BeIR/cqadupstack
name: MTEB CQADupstackRetrieval
config: default
split: test
revision: 4ffe81d471b1924886b33c7567bfb200e9eec5c4
metrics:
- type: ndcg_at_10
value: 46.53
- task:
type: Retrieval
dataset:
type: mteb/climate-fever
name: MTEB ClimateFEVER
config: default
split: test
revision: 47f2ac6acb640fc46020b02a5b59fdda04d39380
metrics:
- type: ndcg_at_10
value: 35.55
- task:
type: Retrieval
dataset:
type: mteb/dbpedia
name: MTEB DBPedia
config: default
split: test
revision: c0f706b76e590d620bd6618b3ca8efdd34e2d659
metrics:
- type: ndcg_at_10
value: 47.82
- task:
type: Retrieval
dataset:
type: mteb/fever
name: MTEB FEVER
config: default
split: test
revision: bea83ef9e8fb933d90a2f1d5515737465d613e12
metrics:
- type: ndcg_at_10
value: 90.76
- task:
type: Retrieval
dataset:
type: mteb/fiqa
name: MTEB FiQA2018
config: default
split: test
revision: 27a168819829fe9bcd655c2df245fb19452e8e06
metrics:
- type: ndcg_at_10
value: 56.64
- task:
type: Retrieval
dataset:
type: mteb/hotpotqa
name: MTEB HotpotQA
config: default
split: test
revision: ab518f4d6fcca38d87c25209f94beba119d02014
metrics:
- type: ndcg_at_10
value: 78.11
- task:
type: Retrieval
dataset:
type: mteb/msmarco
name: MTEB MSMARCO
config: default
split: dev
revision: c5a29a104738b98a9e76336939199e264163d4a0
metrics:
- type: ndcg_at_10
value: 43.93
- task:
type: Retrieval
dataset:
type: mteb/nfcorpus
name: MTEB NFCorpus
config: default
split: test
revision: ec0fa4fe99da2ff19ca1214b7966684033a58814
metrics:
- type: ndcg_at_10
value: 39.77
- task:
type: Retrieval
dataset:
type: mteb/nq
name: MTEB NQ
config: default
split: test
revision: b774495ed302d8c44a3a7ea25c90dbce03968f31
metrics:
- type: ndcg_at_10
value: 69.29
- task:
type: Retrieval
dataset:
type: mteb/quora
name: MTEB QuoraRetrieval
config: default
split: test
revision: None
metrics:
- type: ndcg_at_10
value: 89.97
- task:
type: Retrieval
dataset:
type: mteb/scidocs
name: MTEB SCIDOCS
config: default
split: test
revision: None
metrics:
- type: ndcg_at_10
value: 22.38
- task:
type: Retrieval
dataset:
type: mteb/scifact
name: MTEB SciFact
config: default
split: test
revision: 0228b52cf27578f30900b9e5271d331663a030d7
metrics:
- type: ndcg_at_10
value: 86.6
- task:
type: Retrieval
dataset:
type: mteb/trec-covid
name: MTEB TRECCOVID
config: default
split: test
revision: None
metrics:
- type: ndcg_at_10
value: 81.32
- task:
type: Retrieval
dataset:
type: mteb/touche2020
name: MTEB Touche2020
config: default
split: test
revision: a34f9a33db75fa0cbb21bb5cfc3dae8dc8bec93f
metrics:
- type: ndcg_at_10
value: 25.08
- task:
type: Retrieval
dataset:
type: C-MTEB/CmedqaRetrieval
name: MTEB CmedqaRetrieval
config: default
split: dev
revision: cd540c506dae1cf9e9a59c3e06f42030d54e7301
metrics:
- type: ndcg_at_10
value: 46.05
- task:
type: Retrieval
dataset:
type: C-MTEB/CovidRetrieval
name: MTEB CovidRetrieval
config: default
split: dev
revision: 1271c7809071a13532e05f25fb53511ffce77117
metrics:
- type: ndcg_at_10
value: 92.01
- task:
type: Retrieval
dataset:
type: C-MTEB/DuRetrieval
name: MTEB DuRetrieval
config: default
split: dev
revision: a1a333e290fe30b10f3f56498e3a0d911a693ced
metrics:
- type: ndcg_at_10
value: 90.98
- task:
type: Retrieval
dataset:
type: C-MTEB/EcomRetrieval
name: MTEB EcomRetrieval
config: default
split: dev
revision: 687de13dc7294d6fd9be10c6945f9e8fec8166b9
metrics:
- type: ndcg_at_10
value: 70.21
- task:
type: Retrieval
dataset:
type: C-MTEB/MMarcoRetrieval
name: MTEB MMarcoRetrieval
config: default
split: dev
revision: 539bbde593d947e2a124ba72651aafc09eb33fc2
metrics:
- type: ndcg_at_10
value: 85.55
- task:
type: Retrieval
dataset:
type: C-MTEB/MedicalRetrieval
name: MTEB MedicalRetrieval
config: default
split: dev
revision: 2039188fb5800a9803ba5048df7b76e6fb151fc6
metrics:
- type: ndcg_at_10
value: 63.91
- task:
type: Retrieval
dataset:
type: C-MTEB/T2Retrieval
name: MTEB T2Retrieval
config: default
split: dev
revision: 8731a845f1bf500a4f111cf1070785c793d10e64
metrics:
- type: ndcg_at_10
value: 87.33
- task:
type: Retrieval
dataset:
type: C-MTEB/VideoRetrieval
name: MTEB VideoRetrieval
config: default
split: dev
revision: 58c2597a5943a2ba48f4668c3b90d796283c5639
metrics:
- type: ndcg_at_10
value: 78.05
pipeline_tag: feature-extraction
tags:
- mteb
library_name: transformers
---
## MiniCPM-Embedding
**MiniCPM-Embedding** 是面壁智能与清华大学自然语言处理实验室(THUNLP)共同开发的中英双语言文本嵌入模型,有如下特点:
- 出色的中文、英文检索能力。
- 出色的中英跨语言检索能力。
MiniCPM-Embedding 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练,结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
欢迎关注 RAG 套件系列:
- 检索模型:[MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding)
- 重排模型:[MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker)
- 面向 RAG 场景的 LoRA 插件:[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
**MiniCPM-Embedding** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:
- Exceptional Chinese and English retrieval capabilities.
- Outstanding cross-lingual retrieval capabilities between Chinese and English.
MiniCPM-Embedding is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
We also invite you to explore the RAG toolkit series:
- Retrieval Model: [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding)
- Re-ranking Model: [MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker)
- LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
## 模型信息 Model Information
- 模型大小:2.4B
- 嵌入维度:2304
- 最大输入token数:512
- Model Size: 2.4B
- Embedding Dimension: 2304
- Max Input Tokens: 512
## 使用方法 Usage
### 输入格式 Input Format
本模型支持 query 侧指令,格式如下:
MiniCPM-Embedding supports query-side instructions in the following format:
```
Instruction: {{ instruction }} Query: {{ query }}
```
例如:
For example:
```
Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?
```
```
Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.
```
也可以不提供指令,即采取如下格式:
MiniCPM-Embedding also works in instruction-free mode in the following format:
```
Query: {{ query }}
```
我们在 BEIR 与 C-MTEB/Retrieval 上测试时使用的指令见 `instructions.json`,其他测试不使用指令。文档侧直接输入文档原文。
When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in `instructions.json`. For other evaluations, we do not use instructions. On the document side, we directly use the bare document as the input.
### 环境要求 Requirements
```
transformers==4.37.2
flash-attn>2.3.5
```
### 示例脚本 Demo
#### Huggingface Transformers
```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model_name = "openbmb/MiniCPM-Embedding"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()
# 事实上我们用的是weighted mean pooling,但为了部署方便,我们将一部分pooling步骤集成在model.forward中
# In fact, we will use weighted mean pooling, but we will integrate some pooling steps into model.forward for deployment convenience
def mean_pooling(hidden,attention_mask):
s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
d = attention_mask.sum(dim=1, keepdim=True).float()
reps = s / d
return reps
@torch.no_grad()
def encode(input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
outputs = model(**batch_dict)
attention_mask = batch_dict["attention_mask"]
hidden = outputs.last_hidden_state
reps = mean_pooling(hidden, attention_mask)
embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
return embeddings
queries = ["中国的首都是哪里?"]
passages = ["beijing", "shanghai"]
INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]
embeddings_query = encode(queries)
embeddings_doc = encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist()) # [[0.3535913825035095, 0.18596848845481873]]
```
#### Sentence Transformers
```python
import torch
from sentence_transformers import SentenceTransformer
model_name = "openbmb/MiniCPM-Embedding"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"attn_implementation":"flash_attention_2", "torch_dtype":torch.float16})
model.max_seq_length = 512
model.tokenizer.padding_side="right"
queries = ["中国的首都是哪里?"]
passages = ["beijing", "shanghai"]
INSTRUCTION = "Query: "
embeddings_query = model.encode(queries, prompt=INSTRUCTION, normalize_embeddings=True)
embeddings_doc = model.encode(passages, normalize_embeddings=True)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist()) # [[0.3535913825035095, 0.18596848845481873]]
```
## 实验结果 Evaluation Results
### 中文与英文检索结果 CN/EN Retrieval Results
| 模型 Model | C-MTEB/Retrieval (NDCG@10) | BEIR (NDCG@10) |
|------------------------------|-------------------|---------------|
| bge-large-zh-v1.5 | 70.46 | - |
| gte-large-zh | 72.49 | - |
| Zhihui_LLM_Embedding | 76.74 | |
| bge-large-en-v1.5 | - | 54.29 |
| gte-en-large-v1.5 | - | 57.91 |
| NV-Retriever-v1 | - | 60.9 |
| bge-en-icl | - | 62.16 |
| NV-Embed-v2 | - | 62.65 |
| me5-large | 63.66 | 51.43 |
| bge-m3(Dense) | 65.43 | 48.82 |
| gte-multilingual-base(Dense) | 71.95 | 51.08 |
| gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
| gte-Qwen2-7B-instruct | 76.03 | 60.25 |
| bge-multilingual-gemma2 | 73.73 | 59.24 |
| MiniCPM-Embedding | **76.76** | 58.56 |
| MiniCPM-Embedding+MiniCPM-Reranker | 77.08 | 61.61 |
### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
| 模型 Model | MKQA En-Zh_CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
|------------------------------|--------------------|--------------------|--------------------|
| me5-large | 44.3 | 9.01 | 25.33 |
| bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
| gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
| gte-Qwen2-1.5B-instruct | 68.52 | 49.11 | 45.05 |
| gte-Qwen2-7B-instruct | 68.27 | 49.14 | 49.6 |
| MiniCPM-Embedding | **72.95** | **52.65** | **49.95** |
| MiniCPM-Embedding+MiniCPM-Reranker | 74.33 | 53.21 | 54.12 |
## 许可证 License
- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
- MiniCPM-Embedding 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
- MiniCPM-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
* The usage of MiniCPM-Embedding model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of MiniCPM-Embedding are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Embedding weights are also available for free commercial use. |