maidalun1020
commited on
Commit
•
0ca1bc5
1
Parent(s):
a1140bc
Update README.md
Browse files
README.md
CHANGED
@@ -27,24 +27,24 @@ license: apache-2.0
|
|
27 |
<details open="open">
|
28 |
<summary>Click to Open Contents</summary>
|
29 |
|
30 |
-
- <a href="
|
31 |
-
- <a href="
|
32 |
-
- <a href="
|
33 |
-
- <a href="
|
34 |
-
- <a href="
|
35 |
-
- <a href="#installation">Installation</a>
|
36 |
-
- <a href="#quick-start">Quick Start</a>
|
37 |
-
- <a href="
|
38 |
-
- <a href="#evaluate-semantic-representation-by-mteb">Evaluate Semantic Representation by MTEB</a>
|
39 |
-
- <a href="#evaluate-rag-by-llamaindex">Evaluate RAG by LlamaIndex</a>
|
40 |
-
- <a href="
|
41 |
-
- <a href="#semantic-representation-evaluations-in-mteb">Semantic Representation Evaluations in MTEB</a>
|
42 |
-
- <a href="#rag-evaluations-in-llamaindex">RAG Evaluations in LlamaIndex</a>
|
43 |
-
- <a href="
|
44 |
-
- <a href="
|
45 |
-
- <a href="
|
46 |
-
- <a href="
|
47 |
-
- <a href="
|
48 |
|
49 |
</details>
|
50 |
<br>
|
@@ -54,18 +54,17 @@ license: apache-2.0
|
|
54 |
`BCEmbedding` serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably [QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)], an open-source implementation widely integrated in various Youdao products like [Youdao Speed Reading](https://read.youdao.com/#/home) and [Youdao Translation](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation).
|
55 |
|
56 |
Distinguished for its bilingual and crosslingual proficiency, `BCEmbedding` excels in bridging Chinese and English linguistic gaps, which achieves
|
57 |
-
- **A high performence on <a href
|
58 |
-
- **A new benchmark in the realm of <a href
|
59 |
|
60 |
`BCEmbedding`是由网易有道开发的双语和跨语种语义表征算法模型库,其中包含`EmbeddingModel`和`RerankerModel`两类基础模型。`EmbeddingModel`专门用于生成语义向量,在语义搜索和问答中起着关键作用,而`RerankerModel`擅长优化语义搜索结果和语义相关顺序精排。
|
61 |
|
62 |
`BCEmbedding`作为有道的检索增强生成式应用(RAG)的基石,特别是在[QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如[有道速读](https://read.youdao.com/#/home)和[有道翻译](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation)
|
63 |
|
64 |
`BCEmbedding`以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:
|
65 |
-
- **强大的双语和跨语种语义表征能力【<a href
|
66 |
-
- **基于LlamaIndex的RAG评测,表现SOTA【<a href
|
67 |
|
68 |
-
<t id="t1"></t>
|
69 |
## 🌐 Bilingual and Crosslingual Superiority
|
70 |
|
71 |
Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. `BCEmbedding`, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.
|
@@ -76,7 +75,6 @@ Existing embedding models often encounter performance challenges in bilingual an
|
|
76 |
|
77 |
`EmbeddingModel`支持***中文和英文***(之后会支持更多语种);`RerankerModel`支持***中文,英文,日文和韩文***。
|
78 |
|
79 |
-
<t id="t2"></t>
|
80 |
## 💡 Key Features
|
81 |
|
82 |
- **Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
|
@@ -95,7 +93,7 @@ Existing embedding models often encounter performance challenges in bilingual an
|
|
95 |
|
96 |
- **双语和跨语种能力**:基于有道翻译引擎的强大能力,我们的`BCEmbedding`具备强大的中英双语和跨语种语义表征能力。
|
97 |
|
98 |
-
- **RAG适配**:面向RAG做了针对性优化,可以适配大多数相关任务,比如**翻译,摘要,问答**等。此外,针对**问题理解**(query understanding)也做了针对优化,详见 <a href
|
99 |
|
100 |
- **高效且精确的语义检索**:`EmbeddingModel`采用双编码器,可以在第一阶段实现高效的语义检索。`RerankerModel`采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。
|
101 |
|
@@ -107,7 +105,6 @@ Existing embedding models often encounter performance challenges in bilingual an
|
|
107 |
|
108 |
- **产品化检验**:`BCEmbedding`已经被有道众多真实产品检验。
|
109 |
|
110 |
-
<t id="t3"></t>
|
111 |
## 🚀 Latest Updates
|
112 |
|
113 |
- ***2024-01-03***: **Model Releases** - [bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) and [bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) are available.
|
@@ -118,7 +115,6 @@ Existing embedding models often encounter performance challenges in bilingual an
|
|
118 |
- ***2024-01-03***: **RAG评测数据** [[CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset)] - 基于[LlamaIndex](https://github.com/run-llama/llama_index)的RAG评测数据已发布。
|
119 |
- ***2024-01-03***: **跨语种语义表征评测数据** [[详情](https://github.com/netease-youdao/BCEmbedding/BCEmbedding/evaluation/c_mteb/Retrieval.py)] - 基于[MTEB](https://github.com/embeddings-benchmark/mteb)的跨语种评测数据已发布.
|
120 |
|
121 |
-
<t id="t4"></t>
|
122 |
## 🍎 Model List
|
123 |
|
124 |
| Model Name | Model Type | Languages | Parameters | Weights |
|
@@ -126,7 +122,6 @@ Existing embedding models often encounter performance challenges in bilingual an
|
|
126 |
| bce-embedding-base_v1 | `EmbeddingModel` | ch, en | 279M | [download](https://huggingface.co/maidalun1020/bce-embedding-base_v1) |
|
127 |
| bce-reranker-base_v1 | `RerankerModel` | ch, en, ja, ko | 279M | [download](https://huggingface.co/maidalun1020/bce-reranker-base_v1) |
|
128 |
|
129 |
-
<t id="t5"></t>
|
130 |
## 📖 Manual
|
131 |
|
132 |
### Installation
|
@@ -188,7 +183,6 @@ scores = model.compute_score(sentence_pairs)
|
|
188 |
rerank_results = model.rerank(query, passages)
|
189 |
```
|
190 |
|
191 |
-
<t id="t6"></t>
|
192 |
## ⚙️ Evaluation
|
193 |
|
194 |
### Evaluate Semantic Representation by MTEB
|
@@ -323,7 +317,6 @@ python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir result
|
|
323 |
|
324 |
The summary of multiple domains evaluations can be seen in <a href=#1-multiple-domains-scenarios>Multiple Domains Scenarios</a>.
|
325 |
|
326 |
-
<t id="t7"></t>
|
327 |
## 📈 Leaderboard
|
328 |
|
329 |
### Semantic Representation Evaluations in MTEB
|
@@ -381,8 +374,9 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
|
|
381 |
| bge-large-en-v1.5 | 52.67/34.69 | 64.59/52.11 | 64.71/52.05 | **65.36/55.50** |
|
382 |
| bge-large-zh-v1.5 | 69.81/47.38 | 79.37/62.13 | 80.11/63.95 | **81.19/68.50** |
|
383 |
| llm-embedder | 50.85/33.26 | 63.62/51.45 | 63.54/51.32 | **64.47/54.98** |
|
384 |
-
| CohereV3 | 53.10/35.39 | 65.75/52.80 | 66.29/53.31 | **66.91/56.93** |
|
385 |
-
|
|
|
|
386 |
| ***bce-embedding-base_v1*** | **85.91/62.36** | **91.25/69.38** | **91.80/71.13** | ***93.46/77.02*** |
|
387 |
|
388 |
***NOTE:***
|
@@ -395,14 +389,12 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
|
|
395 |
- 在固定Embedding模型设置下,对比不同reranker效果(**横排对比**),`bce-reranker-base_v1`比其他reranker模型效果都要好,包括开源和闭源。
|
396 |
- ***`bce-embedding-base_v1`和`bce-reranker-base_v1`组合,表现SOTA。***
|
397 |
|
398 |
-
<t id="t8"></t>
|
399 |
## 🛠 Youdao's BCEmbedding API
|
400 |
|
401 |
For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, `BCEmbedding` is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at [Youdao BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html). Here, you'll find all the necessary guidance to easily implement `BCEmbedding` across a variety of use cases, ensuring a smooth and effective integration for optimal results.
|
402 |
|
403 |
对于那些更喜欢直接调用api的用户,有道提供方便的`BCEmbedding`调用api。该方式是一种简化和高效的方式,将`BCEmbedding`集成到您的项目中,避开了手动设置和系统维护的复杂性。更详细的api调用接口说明详见[有道BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html)。
|
404 |
|
405 |
-
<t id="t9"></t>
|
406 |
## 🧲 WeChat Group
|
407 |
|
408 |
Welcome to scan the QR code below and join the WeChat group.
|
@@ -411,7 +403,6 @@ Welcome to scan the QR code below and join the WeChat group.
|
|
411 |
|
412 |
<img src="https://github.com/netease-youdao/BCEmbedding/Docs/assets/Wechat.jpg" width="20%" height="auto">
|
413 |
|
414 |
-
<t id="t10"></t>
|
415 |
## ✏️ Citation
|
416 |
|
417 |
If you use `BCEmbedding` in your research or project, please feel free to cite and star it:
|
@@ -427,12 +418,10 @@ If you use `BCEmbedding` in your research or project, please feel free to cite a
|
|
427 |
}
|
428 |
```
|
429 |
|
430 |
-
<t id="t11"></t>
|
431 |
## 🔐 License
|
432 |
|
433 |
`BCEmbedding` is licensed under [Apache 2.0 License](https://github.com/netease-youdao/BCEmbedding/LICENSE)
|
434 |
|
435 |
-
<t id="t12"></t>
|
436 |
## 🔗 Related Links
|
437 |
|
438 |
[Netease Youdao - QAnything](https://github.com/netease-youdao/qanything)
|
|
|
27 |
<details open="open">
|
28 |
<summary>Click to Open Contents</summary>
|
29 |
|
30 |
+
- <a href="#-bilingual-and-crosslingual-superiority" target="_Self">🌐 Bilingual and Crosslingual Superiority</a>
|
31 |
+
- <a href="#-key-features" target="_Self">💡 Key Features</a>
|
32 |
+
- <a href="#-latest-updates" target="_Self">🚀 Latest Updates</a>
|
33 |
+
- <a href="#-model-list" target="_Self">🍎 Model List</a>
|
34 |
+
- <a href="#-manual" target="_Self">📖 Manual</a>
|
35 |
+
- <a href="#installation" target="_Self">Installation</a>
|
36 |
+
- <a href="#quick-start" target="_Self">Quick Start</a>
|
37 |
+
- <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
|
38 |
+
- <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
|
39 |
+
- <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
|
40 |
+
- <a href="#-leaderboard" target="_Self">📈 Leaderboard</a>
|
41 |
+
- <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>
|
42 |
+
- <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>
|
43 |
+
- <a href="#-youdaos-bcembedding-api" target="_Self">🛠 Youdao's BCEmbedding API</a>
|
44 |
+
- <a href="#-wechat-group" target="_Self">🧲 WeChat Group</a>
|
45 |
+
- <a href="#%EF%B8%8F-citation" target="_Self">✏️ Citation</a>
|
46 |
+
- <a href="#-license" target="_Self">🔐 License</a>
|
47 |
+
- <a href="#-related-links" target="_Self">🔗 Related Links</a>
|
48 |
|
49 |
</details>
|
50 |
<br>
|
|
|
54 |
`BCEmbedding` serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implmentation, notably [QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)], an open-source implementation widely integrated in various Youdao products like [Youdao Speed Reading](https://read.youdao.com/#/home) and [Youdao Translation](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation).
|
55 |
|
56 |
Distinguished for its bilingual and crosslingual proficiency, `BCEmbedding` excels in bridging Chinese and English linguistic gaps, which achieves
|
57 |
+
- **A high performence on <a href="#semantic-representation-evaluations-in-mteb">Semantic Representation Evaluations in MTEB</a>**;
|
58 |
+
- **A new benchmark in the realm of <a href="#rag-evaluations-in-llamaindex">RAG Evaluations in LlamaIndex</a>**.
|
59 |
|
60 |
`BCEmbedding`是由网易有道开发的双语和跨语种语义表征算法模型库,其中包含`EmbeddingModel`和`RerankerModel`两类基础模型。`EmbeddingModel`专门用于生成语义向量,在语义搜索和问答中起着关键作用,而`RerankerModel`擅长优化语义搜索结果和语义相关顺序精排。
|
61 |
|
62 |
`BCEmbedding`作为有道的检索增强生成式应用(RAG)的基石,特别是在[QAnything](http://qanything.ai) [[github](https://github.com/netease-youdao/qanything)]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如[有道速读](https://read.youdao.com/#/home)和[有道翻译](https://fanyi.youdao.com/download-Mac?keyfrom=fanyiweb_navigation)
|
63 |
|
64 |
`BCEmbedding`以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:
|
65 |
+
- **强大的双语和跨语种语义表征能力【<a href="#semantic-representation-evaluations-in-mteb">基于MTEB的语义表征评测指标</a>】。**
|
66 |
+
- **基于LlamaIndex的RAG评测,表现SOTA【<a href="#rag-evaluations-in-llamaindex">基于LlamaIndex的RAG评测指标</a>】。**
|
67 |
|
|
|
68 |
## 🌐 Bilingual and Crosslingual Superiority
|
69 |
|
70 |
Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. `BCEmbedding`, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.
|
|
|
75 |
|
76 |
`EmbeddingModel`支持***中文和英文***(之后会支持更多语种);`RerankerModel`支持***中文,英文,日文和韩文***。
|
77 |
|
|
|
78 |
## 💡 Key Features
|
79 |
|
80 |
- **Bilingual and Crosslingual Proficiency**: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
|
|
|
93 |
|
94 |
- **双语和跨语种能力**:基于有道翻译引擎的强大能力,我们的`BCEmbedding`具备强大的中英双语和跨语种语义表征能力。
|
95 |
|
96 |
+
- **RAG适配**:面向RAG做了针对性优化,可以适配大多数相关任务,比如**翻译,摘要,问答**等。此外,针对**问题理解**(query understanding)也做了针对优化,详见 <a href="#rag-evaluations-in-llamaindex">基于LlamaIndex的RAG评测指标</a>。
|
97 |
|
98 |
- **高效且精确的语义检索**:`EmbeddingModel`采用双编码器,可以在第一阶段实现高效的语义检索。`RerankerModel`采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。
|
99 |
|
|
|
105 |
|
106 |
- **产品化检验**:`BCEmbedding`已经被有道众多真实产品检验。
|
107 |
|
|
|
108 |
## 🚀 Latest Updates
|
109 |
|
110 |
- ***2024-01-03***: **Model Releases** - [bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1) and [bce-reranker-base_v1](https://huggingface.co/maidalun1020/bce-reranker-base_v1) are available.
|
|
|
115 |
- ***2024-01-03***: **RAG评测数据** [[CrosslingualMultiDomainsDataset](https://huggingface.co/datasets/maidalun1020/CrosslingualMultiDomainsDataset)] - 基于[LlamaIndex](https://github.com/run-llama/llama_index)的RAG评测数据已发布。
|
116 |
- ***2024-01-03***: **跨语种语义表征评测数据** [[详情](https://github.com/netease-youdao/BCEmbedding/BCEmbedding/evaluation/c_mteb/Retrieval.py)] - 基于[MTEB](https://github.com/embeddings-benchmark/mteb)的跨语种评测数据已发布.
|
117 |
|
|
|
118 |
## 🍎 Model List
|
119 |
|
120 |
| Model Name | Model Type | Languages | Parameters | Weights |
|
|
|
122 |
| bce-embedding-base_v1 | `EmbeddingModel` | ch, en | 279M | [download](https://huggingface.co/maidalun1020/bce-embedding-base_v1) |
|
123 |
| bce-reranker-base_v1 | `RerankerModel` | ch, en, ja, ko | 279M | [download](https://huggingface.co/maidalun1020/bce-reranker-base_v1) |
|
124 |
|
|
|
125 |
## 📖 Manual
|
126 |
|
127 |
### Installation
|
|
|
183 |
rerank_results = model.rerank(query, passages)
|
184 |
```
|
185 |
|
|
|
186 |
## ⚙️ Evaluation
|
187 |
|
188 |
### Evaluate Semantic Representation by MTEB
|
|
|
317 |
|
318 |
The summary of multiple domains evaluations can be seen in <a href=#1-multiple-domains-scenarios>Multiple Domains Scenarios</a>.
|
319 |
|
|
|
320 |
## 📈 Leaderboard
|
321 |
|
322 |
### Semantic Representation Evaluations in MTEB
|
|
|
374 |
| bge-large-en-v1.5 | 52.67/34.69 | 64.59/52.11 | 64.71/52.05 | **65.36/55.50** |
|
375 |
| bge-large-zh-v1.5 | 69.81/47.38 | 79.37/62.13 | 80.11/63.95 | **81.19/68.50** |
|
376 |
| llm-embedder | 50.85/33.26 | 63.62/51.45 | 63.54/51.32 | **64.47/54.98** |
|
377 |
+
| CohereV3-en | 53.10/35.39 | 65.75/52.80 | 66.29/53.31 | **66.91/56.93** |
|
378 |
+
| CohereV3-multilingual | 79.80/57.22 | 86.34/66.62 | 86.76/68.56 | **88.35/73.73** |
|
379 |
+
| JinaAI-v2-Base-en | 50.27/32.31 | 63.97/51.10 | 64.28/51.83 | **64.82/54.98** |
|
380 |
| ***bce-embedding-base_v1*** | **85.91/62.36** | **91.25/69.38** | **91.80/71.13** | ***93.46/77.02*** |
|
381 |
|
382 |
***NOTE:***
|
|
|
389 |
- 在固定Embedding模型设置下,对比不同reranker效果(**横排对比**),`bce-reranker-base_v1`比其他reranker模型效果都要好,包括开源和闭源。
|
390 |
- ***`bce-embedding-base_v1`和`bce-reranker-base_v1`组合,表现SOTA。***
|
391 |
|
|
|
392 |
## 🛠 Youdao's BCEmbedding API
|
393 |
|
394 |
For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, `BCEmbedding` is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at [Youdao BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html). Here, you'll find all the necessary guidance to easily implement `BCEmbedding` across a variety of use cases, ensuring a smooth and effective integration for optimal results.
|
395 |
|
396 |
对于那些更喜欢直接调用api的用户,有道提供方便的`BCEmbedding`调用api。该方式是一种简化和高效的方式,将`BCEmbedding`集成到您的项目中,避开了手动设置和系统维护的复杂性。更详细的api调用接口说明详见[有道BCEmbedding API](https://ai.youdao.com/DOCSIRMA/html/aigc/api/embedding/index.html)。
|
397 |
|
|
|
398 |
## 🧲 WeChat Group
|
399 |
|
400 |
Welcome to scan the QR code below and join the WeChat group.
|
|
|
403 |
|
404 |
<img src="https://github.com/netease-youdao/BCEmbedding/Docs/assets/Wechat.jpg" width="20%" height="auto">
|
405 |
|
|
|
406 |
## ✏️ Citation
|
407 |
|
408 |
If you use `BCEmbedding` in your research or project, please feel free to cite and star it:
|
|
|
418 |
}
|
419 |
```
|
420 |
|
|
|
421 |
## 🔐 License
|
422 |
|
423 |
`BCEmbedding` is licensed under [Apache 2.0 License](https://github.com/netease-youdao/BCEmbedding/LICENSE)
|
424 |
|
|
|
425 |
## 🔗 Related Links
|
426 |
|
427 |
[Netease Youdao - QAnything](https://github.com/netease-youdao/qanything)
|