jinaai
/

jina-colbert-v2-64

+---
+license: cc-by-nc-4.0
+language:
+  - multilingual
+  - af
+  - am
+  - ar
+  - as
+  - az
+  - be
+  - bg
+  - bn
+  - br
+  - bs
+  - ca
+  - cs
+  - cy
+  - da
+  - de
+  - el
+  - en
+  - eo
+  - es
+  - et
+  - eu
+  - fa
+  - fi
+  - fr
+  - fy
+  - ga
+  - gd
+  - gl
+  - gu
+  - ha
+  - he
+  - hi
+  - hr
+  - hu
+  - hy
+  - id
+  - is
+  - it
+  - ja
+  - jv
+  - ka
+  - kk
+  - km
+  - kn
+  - ko
+  - ku
+  - ky
+  - la
+  - lo
+  - lt
+  - lv
+  - mg
+  - mk
+  - ml
+  - mn
+  - mr
+  - ms
+  - my
+  - ne
+  - nl
+  - 'no'
+  - om
+  - or
+  - pa
+  - pl
+  - ps
+  - pt
+  - ro
+  - ru
+  - sa
+  - sd
+  - si
+  - sk
+  - sl
+  - so
+  - sq
+  - sr
+  - su
+  - sv
+  - sw
+  - ta
+  - te
+  - th
+  - tl
+  - tr
+  - ug
+  - uk
+  - ur
+  - uz
+  - vi
+  - xh
+  - yi
+  - zh
+tags:
+- ColBERT
+- passage-retrieval
+---
+<br><br>
+<p align="center">
+<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
+</p>
+<p align="center">
+<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
+</p>
+<p align="center">
+<b>JinaColBERT V2: your multilingual late interaction retriever!</b>
+</p>
+JinaColBERT V2 (`jina-colbert-v2`) is a new model based on the [JinaColBERT V1](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en) model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction.
+This new release adds new functionality and performance improvements:
+- Multilingual support for dozens of languages, with strong performance on major global languages.
+- [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
+- Superior retrieval performance when compared to the English-only [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en).
+JinaColBERT V2 offers three different versions for different embeddings dimensions:
+[`jinaai/jina-colbert-v2`](https://huggingface.co/jinaai/jina-colbert-v2): 128 dimension embeddings
+[`jinaai/jina-colbert-v2-96`](https://huggingface.co/jinaai/jina-colbert-v2-96): 96 dimension embeddings
+[`jinaai/jina-colbert-v2-64`](https://huggingface.co/jinaai/jina-colbert-v2-64): 64 dimension embeddings
+## Usage
+### Installation
+`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.
+To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.
+```bash
+pip install -U einops flash_attn
+pip install -U ragatouille
+pip install -U colbert-ai
+```
+### RAGatouille
+```python
+from ragatouille import RAGPretrainedModel
+RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
+docs = [
+    "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
+    "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
+]
+RAG.index(docs, index_name="demo")
+query = "What does ColBERT do?"
+results = RAG.search(query)
+```
+### Stanford ColBERT
+```python
+from colbert.infra import ColBERTConfig
+from colbert.modeling.checkpoint import Checkpoint
+ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
+docs = [
+    "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
+    "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
+]
+query_vectors = ckpt.queryFromText(docs, bsize=2)
+```
+## Evaluation Results
+### Retrieval Benchmarks
+#### BEIR
+| **NDCG@10**        | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
+|--------------------|---------------------|---------------------|-----------------|----------|
+| **avg**            | 0.531               | 0.502               | 0.496           | 0.440    |
+| **nfcorpus**       | 0.346               | 0.338               | 0.337           | 0.325    |
+| **fiqa**           | 0.408               | 0.368               | 0.354           | 0.236    |
+| **trec-covid**     | 0.834               | 0.750               | 0.726           | 0.656    |
+| **arguana**        | 0.366               | 0.494               | 0.465           | 0.315    |
+| **quora**          | 0.887               | 0.823               | 0.855           | 0.789    |
+| **scidocs**        | 0.186               | 0.169               | 0.154           | 0.158    |
+| **scifact**        | 0.678               | 0.701               | 0.689           | 0.665    |
+| **webis-touche**   | 0.274               | 0.270               | 0.260           | 0.367    |
+| **dbpedia-entity** | 0.471               | 0.413               | 0.452           | 0.313    |
+| **fever**          | 0.805               | 0.795               | 0.785           | 0.753    |
+| **climate-fever**  | 0.239               | 0.196               | 0.176           | 0.213    |
+| **hotpotqa**       | 0.766               | 0.656               | 0.675           | 0.603    |
+| **nq**             | 0.640               | 0.549               | 0.524           | 0.329    |
+#### MS MARCO Passage Retrieval
+| **MRR@10**  | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
+|-------------|---------------------|---------------------|-----------------|----------|
+| **MSMARCO** | 0.396               | 0.390               | 0.397           | 0.187    |
+### Multilingual Benchmarks
+#### MIRACLE
+| **NDCG@10**    | **jina-colbert-v2** | **mDPR (zero shot)** |
+|---------|---------------------|----------------------|
+| **avg** | 0.627               | 0.427                |
+| **ar**  | 0.753               | 0.499                |
+| **bn**  | 0.750               | 0.443                |
+| **de**  | 0.504               | 0.490                |
+| **es**  | 0.538               | 0.478                |
+| **en**  | 0.570               | 0.394                |
+| **fa**  | 0.563               | 0.480                |
+| **fi**  | 0.740               | 0.472                |
+| **fr**  | 0.541               | 0.435                |
+| **hi**  | 0.600               | 0.383                |
+| **id**  | 0.547               | 0.272                |
+| **ja**  | 0.632               | 0.439                |
+| **ko**  | 0.671               | 0.419                |
+| **ru**  | 0.643               | 0.407                |
+| **sw**  | 0.499               | 0.299                |
+| **te**  | 0.742               | 0.356                |
+| **th**  | 0.772               | 0.358                |
+| **yo**  | 0.623               | 0.396                |
+| **zh**  | 0.523               | 0.512                |
+#### mMARCO
+| **MRR@10** | **jina-colbert-v2** | **BM-25** | **ColBERT-XM** |
+|------------|---------------------|-----------|----------------|
+| **avg**    | 0.313               | 0.141     | 0.254          |
+| **ar**     | 0.272               | 0.111     | 0.195          |
+| **de**     | 0.331               | 0.136     | 0.270          |
+| **nl**     | 0.330               | 0.140     | 0.275          |
+| **es**     | 0.341               | 0.158     | 0.285          |
+| **fr**     | 0.335               | 0.155     | 0.269          |
+| **hi**     | 0.309               | 0.134     | 0.238          |
+| **id**     | 0.319               | 0.149     | 0.263          |
+| **it**     | 0.337               | 0.153     | 0.265          |
+| **ja**     | 0.276               | 0.141     | 0.241          |
+| **pt**     | 0.337               | 0.152     | 0.276          |
+| **ru**     | 0.298               | 0.124     | 0.251          |
+| **vi**     | 0.287               | 0.136     | 0.226          |
+| **zh**     | 0.302               | 0.116     | 0.246          |
+### Matryoshka Representation Benchmarks
+#### BEIR
+| **NDCG@10**    | **dim=128** | **dim=96** | **dim=64** |
+|----------------|-------------|------------|------------|
+| **avg**    | 0.599       | 0.591      | 0.589      |
+| **nfcorpus**   | 0.346       | 0.340      | 0.347      |
+| **fiqa**       | 0.408       | 0.404      | 0.404      |
+| **trec-covid** | 0.834       | 0.808      | 0.805      |
+| **hotpotqa**   | 0.766       | 0.764      | 0.756      |
+| **nq**         | 0.640       | 0.640      | 0.635      |
+#### MSMARCO
+| **MRR@10**     | **dim=128** | **dim=96** | **dim=64** |
+|----------------|-------------|------------|------------|
+| **msmarco**    | 0.396       | 0.391      | 0.388      |
+## Other Models
+Additionally, we provide the following embedding models, you can also use them for retrieval.
+- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
+- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
+- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
+- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
+- [`jina-reranker-v2`](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual): multilingual reranker model.
+- [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1): English multimodal (text-image) embedding model.
+## Contact
+Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.