File size: 5,773 Bytes
faa6aeb
97d37d7
faa6aeb
b463025
faa6aeb
 
 
 
 
 
7627c1f
 
 
defef1e
faa6aeb
 
7627c1f
faa6aeb
7627c1f
faa6aeb
e307298
97d37d7
e307298
 
 
 
 
 
 
b0c553d
97d37d7
faa6aeb
e307298
faa6aeb
e307298
 
97d37d7
 
 
e307298
 
 
faa6aeb
e307298
97d37d7
 
 
faa6aeb
e307298
 
 
 
 
faa6aeb
 
e307298
 
 
 
 
 
faa6aeb
 
e307298
faa6aeb
e307298
 
 
 
 
 
 
 
 
faa6aeb
e307298
 
 
97d37d7
faa6aeb
7627c1f
 
faa6aeb
 
4a94c57
a280fde
 
 
 
200a637
faa6aeb
7627c1f
 
faa6aeb
 
 
 
b463025
 
 
 
 
4a94c57
b463025
 
 
 
 
faa6aeb
 
 
 
 
 
7627c1f
faa6aeb
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- sentence-similarity
- colbert
base_model: camembert-base
library_name: RAGatouille
inference: false
---

# 🇫🇷 colbertv1-camembert-base-mmarcoFR

This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

## Usage

Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).

### Using ColBERT-AI

First, you will need to install the following libraries:

```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```

Then, you can use the model like this:

```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    indexer.index(name=index_name, collection=documents)

# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```

### Using RAGatouille

First, you will need to install the following libraries:

```bash
pip install -U ragatouille
```

Then, you can use the model like this:

```python
from ragatouille import RAGPretrainedModel

index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)

# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```

***

## Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

| model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   R@10 |   R@100(↑) |   R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR**                                                                                   |     🇫🇷 |    110M | 443MB |    29.51 |  54.21 |      80.00 |   88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 443MB |    28.53 |  51.46 |      77.82 |   89.13 |

***

## Training

#### Data

We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, 
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. 
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).

#### Implementation

The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax 
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832)) 
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU 
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set 
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.

## Citation

```bibtex
@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model for French',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
```