maidalun1020 commited on
Commit
0ca1bc5
1 Parent(s): a1140bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -37
README.md CHANGED
@@ -27,24 +27,24 @@ license: apache-2.0
27
  <details open="open">
28
  <summary>Click to Open Contents</summary>
29
 
30
- - <a href="#t1">🌐 Bilingual and Crosslingual Superiority</a>
31
- - <a href="#t2">💡 Key Features</a>
32
- - <a href="#t3">🚀 Latest Updates</a>
33
- - <a href="#t4">🍎 Model List</a>
34
- - <a href="#t5">📖 Manual</a>
35
- - <a href="#installation">Installation</a>
36
- - <a href="#quick-start">Quick Start</a>
37
- - <a href="#t6">⚙️ Evaluation</a>
38
- - <a href="#evaluate-semantic-representation-by-mteb">Evaluate Semantic Representation by MTEB</a>
39
- - <a href="#evaluate-rag-by-llamaindex">Evaluate RAG by LlamaIndex</a>
40
- - <a href="#t7">📈 Leaderboard</a>
41
- - <a href="#semantic-representation-evaluations-in-mteb">Semantic Representation Evaluations in MTEB</a>
42
- - <a href="#rag-evaluations-in-llamaindex">RAG Evaluations in LlamaIndex</a>
43
- - <a href="#t8">🛠 Youdao's BCEmbedding API</a>
44
- - <a href="#t9">🧲 WeChat Group</a>
45
- - <a href="#t10">✏️ Citation</a>
46
- - <a href="#t11">🔐 License</a>
47
- - <a href="#t12">🔗 Related Links</a>
48
 
49
  </details>
50
  <br>
@@ -54,18 +54,17 @@ license: apache-2.0
54
  `BCEmbedding` serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably [QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)], an open-source implementation widely integrated in various Youdao products like [Youdao Speed Reading](https://read.youdao.com/#/home) and [Youdao Translation](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation).
55
 
56
  Distinguished for its bilingual and crosslingual proficiency, `BCEmbedding` excels in bridging Chinese and English linguistic gaps, which achieves
57
- - **A high performence on <a href=#semantic-representation-evaluations-in-mteb>Semantic Representation Evaluations in MTEB</a>**;
58
- - **A new benchmark in the realm of <a href=#rag-evaluations-in-llamaindex>RAG Evaluations in LlamaIndex</a>**.
59
 
60
  `BCEmbedding`是由网易有道开发的双语和跨语种语义表征算法模型库,其中包含`EmbeddingModel`和`RerankerModel`两类基础模型。`EmbeddingModel`专门用于生成语义向量,在语义搜索和问答中起着关键作用,而`RerankerModel`擅长优化语义搜索结果和语义相关顺序精排。
61
 
62
  `BCEmbedding`作为有道的检索增强生成式应用(RAG)的基石,特别是在[QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如[有道速读](https://read.youdao.com/#/home)和[有道翻译](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation)
63
 
64
  `BCEmbedding`以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:
65
- - **强大的双语和跨语种语义表征能力【<a href=#t7-1>基于MTEB的语义表征评测指标</a>】。**
66
- - **基于LlamaIndex的RAG评测,表现SOTA【<a href=#t7-2>基于LlamaIndex的RAG评测指标</a>】。**
67
 
68
- <t id="t1"></t>
69
  ## 🌐 Bilingual and Crosslingual Superiority
70
 
71
  Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. `BCEmbedding`, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.
@@ -76,7 +75,6 @@ Existing embedding models often encounter performance challenges in bilingual an
76
 
77
  `EmbeddingModel`支持***中文和英文***(之后会支持更多语种);`RerankerModel`支持***中文,英文,日文和韩文***。
78
 
79
- <t id="t2"></t>
80
  ## 💡 Key Features
81
 
82
  - **Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
@@ -95,7 +93,7 @@ Existing embedding models often encounter performance challenges in bilingual an
95
 
96
  - **双语和跨语种能力**:基于有道翻译引擎的强大能力,我们的`BCEmbedding`具备强大的中英双语和跨语种语义表征能力。
97
 
98
- - **RAG适配**:面向RAG做了针对性优化,可以适配大多数相关任务,比如**翻译,摘要,问答**等。此外,针对**问题理解**(query understanding)也做了针对优化,详见 <a href=#t7-2>基于LlamaIndex的RAG评测指标</a>。
99
 
100
  - **高效且精确的语义检索**:`EmbeddingModel`采用双编码器,可以在第一阶段实现高效的语义检索。`RerankerModel`采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。
101
 
@@ -107,7 +105,6 @@ Existing embedding models often encounter performance challenges in bilingual an
107
 
108
  - **产品化检验**:`BCEmbedding`已经被有道众多真实产品检验。
109
 
110
- <t id="t3"></t>
111
  ## 🚀 Latest Updates
112
 
113
  - ***2024-01-03***: **Model Releases** - [bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) and [bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) are available.
@@ -118,7 +115,6 @@ Existing embedding models often encounter performance challenges in bilingual an
118
  - ***2024-01-03***: **RAG评测数据** [[CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset)] - 基于[LlamaIndex](https://github.com/run-llama/llama_index)的RAG评测数据已发布。
119
  - ***2024-01-03***: **跨语种语义表征评测数据** [[详情](https://github.com/netease-youdao/BCEmbedding/BCEmbedding/evaluation/c_mteb/Retrieval.py)] - 基于[MTEB](https://github.com/embeddings-benchmark/mteb)的跨语种评测数据已发布.
120
 
121
- <t id="t4"></t>
122
  ## 🍎 Model List
123
 
124
  | Model Name | Model Type | Languages | Parameters | Weights |
@@ -126,7 +122,6 @@ Existing embedding models often encounter performance challenges in bilingual an
126
  | bce-embedding-base_v1 | `EmbeddingModel` | ch, en | 279M | [download](https://huggingface.co/maidalun1020/bce-embedding-base_v1) |
127
  | bce-reranker-base_v1 | `RerankerModel` | ch, en, ja, ko | 279M | [download](https://huggingface.co/maidalun1020/bce-reranker-base_v1) |
128
 
129
- <t id="t5"></t>
130
  ## 📖 Manual
131
 
132
  ### Installation
@@ -188,7 +183,6 @@ scores = model.compute_score(sentence_pairs)
188
  rerank_results = model.rerank(query, passages)
189
  ```
190
 
191
- <t id="t6"></t>
192
  ## ⚙️ Evaluation
193
 
194
  ### Evaluate Semantic Representation by MTEB
@@ -323,7 +317,6 @@ python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir result
323
 
324
  The summary of multiple domains evaluations can be seen in <a href=#1-multiple-domains-scenarios>Multiple Domains Scenarios</a>.
325
 
326
- <t id="t7"></t>
327
  ## 📈 Leaderboard
328
 
329
  ### Semantic Representation Evaluations in MTEB
@@ -381,8 +374,9 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
381
  | bge-large-en-v1.5 | 52.67/34.69 | 64.59/52.11 | 64.71/52.05 | **65.36/55.50** |
382
  | bge-large-zh-v1.5 | 69.81/47.38 | 79.37/62.13 | 80.11/63.95 | **81.19/68.50** |
383
  | llm-embedder | 50.85/33.26 | 63.62/51.45 | 63.54/51.32 | **64.47/54.98** |
384
- | CohereV3 | 53.10/35.39 | 65.75/52.80 | 66.29/53.31 | **66.91/56.93** |
385
- | JinaAI-Base | 50.27/32.31 | 63.97/51.10 | 64.28/51.83 | **64.82/54.98** |
 
386
  | ***bce-embedding-base_v1*** | **85.91/62.36** | **91.25/69.38** | **91.80/71.13** | ***93.46/77.02*** |
387
 
388
  ***NOTE:***
@@ -395,14 +389,12 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
395
  - 在固定Embedding模型设置下,对比不同reranker效果(**横排对比**),`bce-reranker-base_v1`比其他reranker模型效果都要好,包括开源和闭源。
396
  - ***`bce-embedding-base_v1`和`bce-reranker-base_v1`组合,表现SOTA。***
397
 
398
- <t id="t8"></t>
399
  ## 🛠 Youdao's BCEmbedding API
400
 
401
  For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, `BCEmbedding` is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at [Youdao BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html). Here, you'll find all the necessary guidance to easily implement `BCEmbedding` across a variety of use cases, ensuring a smooth and effective integration for optimal results.
402
 
403
  对于那些更喜欢直接调用api的用户,有道提供方便的`BCEmbedding`调用api。该方式是一种简化和高效的方式,将`BCEmbedding`集成到您的项目中,避开了手动设置和系统维护的复杂性。更详细的api调用接口说明详见[有道BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html)。
404
 
405
- <t id="t9"></t>
406
  ## 🧲 WeChat Group
407
 
408
  Welcome to scan the QR code below and join the WeChat group.
@@ -411,7 +403,6 @@ Welcome to scan the QR code below and join the WeChat group.
411
 
412
  <img src="https://github.com/netease-youdao/BCEmbedding/Docs/assets/Wechat.jpg" width="20%" height="auto">
413
 
414
- <t id="t10"></t>
415
  ## ✏️ Citation
416
 
417
  If you use `BCEmbedding` in your research or project, please feel free to cite and star it:
@@ -427,12 +418,10 @@ If you use `BCEmbedding` in your research or project, please feel free to cite a
427
  }
428
  ```
429
 
430
- <t id="t11"></t>
431
  ## 🔐 License
432
 
433
  `BCEmbedding` is licensed under [Apache 2.0 License](https://github.com/netease-youdao/BCEmbedding/LICENSE)
434
 
435
- <t id="t12"></t>
436
  ## 🔗 Related Links
437
 
438
  [Netease Youdao - QAnything](https://github.com/netease-youdao/qanything)
 
27
  <details open="open">
28
  <summary>Click to Open Contents</summary>
29
 
30
+ - <a href="#-bilingual-and-crosslingual-superiority" target="_Self">🌐 Bilingual and Crosslingual Superiority</a>
31
+ - <a href="#-key-features" target="_Self">💡 Key Features</a>
32
+ - <a href="#-latest-updates" target="_Self">🚀 Latest Updates</a>
33
+ - <a href="#-model-list" target="_Self">🍎 Model List</a>
34
+ - <a href="#-manual" target="_Self">📖 Manual</a>
35
+ - <a href="#installation" target="_Self">Installation</a>
36
+ - <a href="#quick-start" target="_Self">Quick Start</a>
37
+ - <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
38
+ - <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
39
+ - <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
40
+ - <a href="#-leaderboard" target="_Self">📈 Leaderboard</a>
41
+ - <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>
42
+ - <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>
43
+ - <a href="#-youdaos-bcembedding-api" target="_Self">🛠 Youdao's BCEmbedding API</a>
44
+ - <a href="#-wechat-group" target="_Self">🧲 WeChat Group</a>
45
+ - <a href="#%EF%B8%8F-citation" target="_Self">✏️ Citation</a>
46
+ - <a href="#-license" target="_Self">🔐 License</a>
47
+ - <a href="#-related-links" target="_Self">🔗 Related Links</a>
48
 
49
  </details>
50
  <br>
 
54
  `BCEmbedding` serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably [QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)], an open-source implementation widely integrated in various Youdao products like [Youdao Speed Reading](https://read.youdao.com/#/home) and [Youdao Translation](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation).
55
 
56
  Distinguished for its bilingual and crosslingual proficiency, `BCEmbedding` excels in bridging Chinese and English linguistic gaps, which achieves
57
+ - **A high performence on <a href="#semantic-representation-evaluations-in-mteb">Semantic Representation Evaluations in MTEB</a>**;
58
+ - **A new benchmark in the realm of <a href="#rag-evaluations-in-llamaindex">RAG Evaluations in LlamaIndex</a>**.
59
 
60
  `BCEmbedding`是由网易有道开发的双语和跨语种语义表征算法模型库,其中包含`EmbeddingModel`和`RerankerModel`两类基础模型。`EmbeddingModel`专门用于生成语义向量,在语义搜索和问答中起着关键作用,而`RerankerModel`擅长优化语义搜索结果和语义相关顺序精排。
61
 
62
  `BCEmbedding`作为有道的检索增强生成式应用(RAG)的基石,特别是在[QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如[有道速读](https://read.youdao.com/#/home)和[有道翻译](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation)
63
 
64
  `BCEmbedding`以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:
65
+ - **强大的双语和跨语种语义表征能力【<a href="#semantic-representation-evaluations-in-mteb">基于MTEB的语义表征评测指标</a>】。**
66
+ - **基于LlamaIndex的RAG评测,表现SOTA【<a href="#rag-evaluations-in-llamaindex">基于LlamaIndex的RAG评测指标</a>】。**
67
 
 
68
  ## 🌐 Bilingual and Crosslingual Superiority
69
 
70
  Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. `BCEmbedding`, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.
 
75
 
76
  `EmbeddingModel`支持***中文和英文***(之后会支持更多语种);`RerankerModel`支持***中文,英文,日文和韩文***。
77
 
 
78
  ## 💡 Key Features
79
 
80
  - **Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
 
93
 
94
  - **双语和跨语种能力**:基于有道翻译引擎的强大能力,我们的`BCEmbedding`具备强大的中英双语和跨语种语义表征能力。
95
 
96
+ - **RAG适配**:面向RAG做了针对性优化,可以适配大多数相关任务,比如**翻译,摘要,问答**等。此外,针对**问题理解**(query understanding)也做了针对优化,详见 <a href="#rag-evaluations-in-llamaindex">基于LlamaIndex的RAG评测指标</a>。
97
 
98
  - **高效且精确的语义检索**:`EmbeddingModel`采用双编码器,可以在第一阶段实现高效的语义检索。`RerankerModel`采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。
99
 
 
105
 
106
  - **产品化检验**:`BCEmbedding`已经被有道众多真实产品检验。
107
 
 
108
  ## 🚀 Latest Updates
109
 
110
  - ***2024-01-03***: **Model Releases** - [bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) and [bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) are available.
 
115
  - ***2024-01-03***: **RAG评测数据** [[CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset)] - 基于[LlamaIndex](https://github.com/run-llama/llama_index)的RAG评测数据已发布。
116
  - ***2024-01-03***: **跨语种语义表征评测数据** [[详情](https://github.com/netease-youdao/BCEmbedding/BCEmbedding/evaluation/c_mteb/Retrieval.py)] - 基于[MTEB](https://github.com/embeddings-benchmark/mteb)的跨语种评测数据已发布.
117
 
 
118
  ## 🍎 Model List
119
 
120
  | Model Name | Model Type | Languages | Parameters | Weights |
 
122
  | bce-embedding-base_v1 | `EmbeddingModel` | ch, en | 279M | [download](https://huggingface.co/maidalun1020/bce-embedding-base_v1) |
123
  | bce-reranker-base_v1 | `RerankerModel` | ch, en, ja, ko | 279M | [download](https://huggingface.co/maidalun1020/bce-reranker-base_v1) |
124
 
 
125
  ## 📖 Manual
126
 
127
  ### Installation
 
183
  rerank_results = model.rerank(query, passages)
184
  ```
185
 
 
186
  ## ⚙️ Evaluation
187
 
188
  ### Evaluate Semantic Representation by MTEB
 
317
 
318
  The summary of multiple domains evaluations can be seen in <a href=#1-multiple-domains-scenarios>Multiple Domains Scenarios</a>.
319
 
 
320
  ## 📈 Leaderboard
321
 
322
  ### Semantic Representation Evaluations in MTEB
 
374
  | bge-large-en-v1.5 | 52.67/34.69 | 64.59/52.11 | 64.71/52.05 | **65.36/55.50** |
375
  | bge-large-zh-v1.5 | 69.81/47.38 | 79.37/62.13 | 80.11/63.95 | **81.19/68.50** |
376
  | llm-embedder | 50.85/33.26 | 63.62/51.45 | 63.54/51.32 | **64.47/54.98** |
377
+ | CohereV3-en | 53.10/35.39 | 65.75/52.80 | 66.29/53.31 | **66.91/56.93** |
378
+ | CohereV3-multilingual | 79.80/57.22 | 86.34/66.62 | 86.76/68.56 | **88.35/73.73** |
379
+ | JinaAI-v2-Base-en | 50.27/32.31 | 63.97/51.10 | 64.28/51.83 | **64.82/54.98** |
380
  | ***bce-embedding-base_v1*** | **85.91/62.36** | **91.25/69.38** | **91.80/71.13** | ***93.46/77.02*** |
381
 
382
  ***NOTE:***
 
389
  - 在固定Embedding模型设置下,对比不同reranker效果(**横排对比**),`bce-reranker-base_v1`比其他reranker模型效果都要好,包括开源和闭源。
390
  - ***`bce-embedding-base_v1`和`bce-reranker-base_v1`组合,表现SOTA。***
391
 
 
392
  ## 🛠 Youdao's BCEmbedding API
393
 
394
  For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, `BCEmbedding` is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at [Youdao BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html). Here, you'll find all the necessary guidance to easily implement `BCEmbedding` across a variety of use cases, ensuring a smooth and effective integration for optimal results.
395
 
396
  对于那些更喜欢直接调用api的用户,有道提供方便的`BCEmbedding`调用api。该方式是一种简化和高效的方式,将`BCEmbedding`集成到您的项目中,避开了手动设置和系统维护的复杂性。更详细的api调用接口说明详见[有道BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html)。
397
 
 
398
  ## 🧲 WeChat Group
399
 
400
  Welcome to scan the QR code below and join the WeChat group.
 
403
 
404
  <img src="https://github.com/netease-youdao/BCEmbedding/Docs/assets/Wechat.jpg" width="20%" height="auto">
405
 
 
406
  ## ✏️ Citation
407
 
408
  If you use `BCEmbedding` in your research or project, please feel free to cite and star it:
 
418
  }
419
  ```
420
 
 
421
  ## 🔐 License
422
 
423
  `BCEmbedding` is licensed under [Apache 2.0 License](https://github.com/netease-youdao/BCEmbedding/LICENSE)
424
 
 
425
  ## 🔗 Related Links
426
 
427
  [Netease Youdao - QAnything](https://github.com/netease-youdao/qanything)