feat: update README
Browse files
README.md
CHANGED
@@ -1,3 +1,287 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- multilingual
|
5 |
+
- af
|
6 |
+
- am
|
7 |
+
- ar
|
8 |
+
- as
|
9 |
+
- az
|
10 |
+
- be
|
11 |
+
- bg
|
12 |
+
- bn
|
13 |
+
- br
|
14 |
+
- bs
|
15 |
+
- ca
|
16 |
+
- cs
|
17 |
+
- cy
|
18 |
+
- da
|
19 |
+
- de
|
20 |
+
- el
|
21 |
+
- en
|
22 |
+
- eo
|
23 |
+
- es
|
24 |
+
- et
|
25 |
+
- eu
|
26 |
+
- fa
|
27 |
+
- fi
|
28 |
+
- fr
|
29 |
+
- fy
|
30 |
+
- ga
|
31 |
+
- gd
|
32 |
+
- gl
|
33 |
+
- gu
|
34 |
+
- ha
|
35 |
+
- he
|
36 |
+
- hi
|
37 |
+
- hr
|
38 |
+
- hu
|
39 |
+
- hy
|
40 |
+
- id
|
41 |
+
- is
|
42 |
+
- it
|
43 |
+
- ja
|
44 |
+
- jv
|
45 |
+
- ka
|
46 |
+
- kk
|
47 |
+
- km
|
48 |
+
- kn
|
49 |
+
- ko
|
50 |
+
- ku
|
51 |
+
- ky
|
52 |
+
- la
|
53 |
+
- lo
|
54 |
+
- lt
|
55 |
+
- lv
|
56 |
+
- mg
|
57 |
+
- mk
|
58 |
+
- ml
|
59 |
+
- mn
|
60 |
+
- mr
|
61 |
+
- ms
|
62 |
+
- my
|
63 |
+
- ne
|
64 |
+
- nl
|
65 |
+
- 'no'
|
66 |
+
- om
|
67 |
+
- or
|
68 |
+
- pa
|
69 |
+
- pl
|
70 |
+
- ps
|
71 |
+
- pt
|
72 |
+
- ro
|
73 |
+
- ru
|
74 |
+
- sa
|
75 |
+
- sd
|
76 |
+
- si
|
77 |
+
- sk
|
78 |
+
- sl
|
79 |
+
- so
|
80 |
+
- sq
|
81 |
+
- sr
|
82 |
+
- su
|
83 |
+
- sv
|
84 |
+
- sw
|
85 |
+
- ta
|
86 |
+
- te
|
87 |
+
- th
|
88 |
+
- tl
|
89 |
+
- tr
|
90 |
+
- ug
|
91 |
+
- uk
|
92 |
+
- ur
|
93 |
+
- uz
|
94 |
+
- vi
|
95 |
+
- xh
|
96 |
+
- yi
|
97 |
+
- zh
|
98 |
+
tags:
|
99 |
+
- ColBERT
|
100 |
+
- passage-retrieval
|
101 |
+
---
|
102 |
+
|
103 |
+
<br><br>
|
104 |
+
|
105 |
+
<p align="center">
|
106 |
+
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
107 |
+
</p>
|
108 |
+
|
109 |
+
|
110 |
+
<p align="center">
|
111 |
+
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
112 |
+
</p>
|
113 |
+
|
114 |
+
<p align="center">
|
115 |
+
<b>JinaColBERT V2: your multilingual late interaction retriever!</b>
|
116 |
+
</p>
|
117 |
+
|
118 |
+
JinaColBERT V2 (`jina-colbert-v2`) is a new model based on the [JinaColBERT V1](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en) model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction.
|
119 |
+
|
120 |
+
This new release adds new functionality and performance improvements:
|
121 |
+
|
122 |
+
- Multilingual support for dozens of languages, with strong performance on major global languages.
|
123 |
+
- [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
|
124 |
+
- Superior retrieval performance when compared to the English-only [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en).
|
125 |
+
|
126 |
+
JinaColBERT V2 offers three different versions for different embeddings dimensions:
|
127 |
+
[`jinaai/jina-colbert-v2`](https://huggingface.co/jinaai/jina-colbert-v2): 128 dimension embeddings
|
128 |
+
[`jinaai/jina-colbert-v2-96`](https://huggingface.co/jinaai/jina-colbert-v2-96): 96 dimension embeddings
|
129 |
+
[`jinaai/jina-colbert-v2-64`](https://huggingface.co/jinaai/jina-colbert-v2-64): 64 dimension embeddings
|
130 |
+
|
131 |
+
|
132 |
+
## Usage
|
133 |
+
|
134 |
+
### Installation
|
135 |
+
|
136 |
+
`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.
|
137 |
+
|
138 |
+
To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.
|
139 |
+
|
140 |
+
```bash
|
141 |
+
pip install -U einops flash_attn
|
142 |
+
pip install -U ragatouille
|
143 |
+
pip install -U colbert-ai
|
144 |
+
```
|
145 |
+
|
146 |
+
### RAGatouille
|
147 |
+
|
148 |
+
```python
|
149 |
+
from ragatouille import RAGPretrainedModel
|
150 |
+
|
151 |
+
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
|
152 |
+
docs = [
|
153 |
+
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
|
154 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
|
155 |
+
]
|
156 |
+
RAG.index(docs, index_name="demo")
|
157 |
+
query = "What does ColBERT do?"
|
158 |
+
results = RAG.search(query)
|
159 |
+
```
|
160 |
+
|
161 |
+
### Stanford ColBERT
|
162 |
+
|
163 |
+
```python
|
164 |
+
from colbert.infra import ColBERTConfig
|
165 |
+
from colbert.modeling.checkpoint import Checkpoint
|
166 |
+
|
167 |
+
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
|
168 |
+
docs = [
|
169 |
+
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
|
170 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
|
171 |
+
]
|
172 |
+
query_vectors = ckpt.queryFromText(docs, bsize=2)
|
173 |
+
```
|
174 |
+
|
175 |
+
## Evaluation Results
|
176 |
+
|
177 |
+
### Retrieval Benchmarks
|
178 |
+
|
179 |
+
#### BEIR
|
180 |
+
|
181 |
+
| **NDCG@10** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
|
182 |
+
|--------------------|---------------------|---------------------|-----------------|----------|
|
183 |
+
| **avg** | 0.531 | 0.502 | 0.496 | 0.440 |
|
184 |
+
| **nfcorpus** | 0.346 | 0.338 | 0.337 | 0.325 |
|
185 |
+
| **fiqa** | 0.408 | 0.368 | 0.354 | 0.236 |
|
186 |
+
| **trec-covid** | 0.834 | 0.750 | 0.726 | 0.656 |
|
187 |
+
| **arguana** | 0.366 | 0.494 | 0.465 | 0.315 |
|
188 |
+
| **quora** | 0.887 | 0.823 | 0.855 | 0.789 |
|
189 |
+
| **scidocs** | 0.186 | 0.169 | 0.154 | 0.158 |
|
190 |
+
| **scifact** | 0.678 | 0.701 | 0.689 | 0.665 |
|
191 |
+
| **webis-touche** | 0.274 | 0.270 | 0.260 | 0.367 |
|
192 |
+
| **dbpedia-entity** | 0.471 | 0.413 | 0.452 | 0.313 |
|
193 |
+
| **fever** | 0.805 | 0.795 | 0.785 | 0.753 |
|
194 |
+
| **climate-fever** | 0.239 | 0.196 | 0.176 | 0.213 |
|
195 |
+
| **hotpotqa** | 0.766 | 0.656 | 0.675 | 0.603 |
|
196 |
+
| **nq** | 0.640 | 0.549 | 0.524 | 0.329 |
|
197 |
+
|
198 |
+
|
199 |
+
|
200 |
+
#### MS MARCO Passage Retrieval
|
201 |
+
|
202 |
+
| **MRR@10** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
|
203 |
+
|-------------|---------------------|---------------------|-----------------|----------|
|
204 |
+
| **MSMARCO** | 0.396 | 0.390 | 0.397 | 0.187 |
|
205 |
+
|
206 |
+
|
207 |
+
### Multilingual Benchmarks
|
208 |
+
|
209 |
+
#### MIRACLE
|
210 |
+
|
211 |
+
| **NDCG@10** | **jina-colbert-v2** | **mDPR (zero shot)** |
|
212 |
+
|---------|---------------------|----------------------|
|
213 |
+
| **avg** | 0.627 | 0.427 |
|
214 |
+
| **ar** | 0.753 | 0.499 |
|
215 |
+
| **bn** | 0.750 | 0.443 |
|
216 |
+
| **de** | 0.504 | 0.490 |
|
217 |
+
| **es** | 0.538 | 0.478 |
|
218 |
+
| **en** | 0.570 | 0.394 |
|
219 |
+
| **fa** | 0.563 | 0.480 |
|
220 |
+
| **fi** | 0.740 | 0.472 |
|
221 |
+
| **fr** | 0.541 | 0.435 |
|
222 |
+
| **hi** | 0.600 | 0.383 |
|
223 |
+
| **id** | 0.547 | 0.272 |
|
224 |
+
| **ja** | 0.632 | 0.439 |
|
225 |
+
| **ko** | 0.671 | 0.419 |
|
226 |
+
| **ru** | 0.643 | 0.407 |
|
227 |
+
| **sw** | 0.499 | 0.299 |
|
228 |
+
| **te** | 0.742 | 0.356 |
|
229 |
+
| **th** | 0.772 | 0.358 |
|
230 |
+
| **yo** | 0.623 | 0.396 |
|
231 |
+
| **zh** | 0.523 | 0.512 |
|
232 |
+
|
233 |
+
#### mMARCO
|
234 |
+
|
235 |
+
| **MRR@10** | **jina-colbert-v2** | **BM-25** | **ColBERT-XM** |
|
236 |
+
|------------|---------------------|-----------|----------------|
|
237 |
+
| **avg** | 0.313 | 0.141 | 0.254 |
|
238 |
+
| **ar** | 0.272 | 0.111 | 0.195 |
|
239 |
+
| **de** | 0.331 | 0.136 | 0.270 |
|
240 |
+
| **nl** | 0.330 | 0.140 | 0.275 |
|
241 |
+
| **es** | 0.341 | 0.158 | 0.285 |
|
242 |
+
| **fr** | 0.335 | 0.155 | 0.269 |
|
243 |
+
| **hi** | 0.309 | 0.134 | 0.238 |
|
244 |
+
| **id** | 0.319 | 0.149 | 0.263 |
|
245 |
+
| **it** | 0.337 | 0.153 | 0.265 |
|
246 |
+
| **ja** | 0.276 | 0.141 | 0.241 |
|
247 |
+
| **pt** | 0.337 | 0.152 | 0.276 |
|
248 |
+
| **ru** | 0.298 | 0.124 | 0.251 |
|
249 |
+
| **vi** | 0.287 | 0.136 | 0.226 |
|
250 |
+
| **zh** | 0.302 | 0.116 | 0.246 |
|
251 |
+
|
252 |
+
|
253 |
+
|
254 |
+
### Matryoshka Representation Benchmarks
|
255 |
+
|
256 |
+
#### BEIR
|
257 |
+
|
258 |
+
| **NDCG@10** | **dim=128** | **dim=96** | **dim=64** |
|
259 |
+
|----------------|-------------|------------|------------|
|
260 |
+
| **avg** | 0.599 | 0.591 | 0.589 |
|
261 |
+
| **nfcorpus** | 0.346 | 0.340 | 0.347 |
|
262 |
+
| **fiqa** | 0.408 | 0.404 | 0.404 |
|
263 |
+
| **trec-covid** | 0.834 | 0.808 | 0.805 |
|
264 |
+
| **hotpotqa** | 0.766 | 0.764 | 0.756 |
|
265 |
+
| **nq** | 0.640 | 0.640 | 0.635 |
|
266 |
+
|
267 |
+
|
268 |
+
#### MSMARCO
|
269 |
+
|
270 |
+
| **MRR@10** | **dim=128** | **dim=96** | **dim=64** |
|
271 |
+
|----------------|-------------|------------|------------|
|
272 |
+
| **msmarco** | 0.396 | 0.391 | 0.388 |
|
273 |
+
|
274 |
+
## Other Models
|
275 |
+
|
276 |
+
Additionally, we provide the following embedding models, you can also use them for retrieval.
|
277 |
+
|
278 |
+
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
279 |
+
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
|
280 |
+
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
|
281 |
+
- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
|
282 |
+
- [`jina-reranker-v2`](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual): multilingual reranker model.
|
283 |
+
- [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1): English multimodal (text-image) embedding model.
|
284 |
+
|
285 |
+
## Contact
|
286 |
+
|
287 |
+
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|