File size: 5,773 Bytes
faa6aeb 97d37d7 faa6aeb b463025 faa6aeb 7627c1f defef1e faa6aeb 7627c1f faa6aeb 7627c1f faa6aeb e307298 97d37d7 e307298 b0c553d 97d37d7 faa6aeb e307298 faa6aeb e307298 97d37d7 e307298 faa6aeb e307298 97d37d7 faa6aeb e307298 faa6aeb e307298 faa6aeb e307298 faa6aeb e307298 faa6aeb e307298 97d37d7 faa6aeb 7627c1f faa6aeb 4a94c57 a280fde 200a637 faa6aeb 7627c1f faa6aeb b463025 4a94c57 b463025 faa6aeb 7627c1f faa6aeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- sentence-similarity
- colbert
base_model: camembert-base
library_name: RAGatouille
inference: false
---
# 🇫🇷 colbertv1-camembert-base-mmarcoFR
This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
## Usage
Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
### Using ColBERT-AI
First, you will need to install the following libraries:
```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```
Then, you can use the model like this:
```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig
n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
indexer.index(name=index_name, collection=documents)
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```
### Using RAGatouille
First, you will need to install the following libraries:
```bash
pip install -U ragatouille
```
Then, you can use the model like this:
```python
from ragatouille import RAGPretrainedModel
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)
# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```
***
## Evaluation
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
| model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR** | 🇫🇷 | 110M | 443MB | 29.51 | 54.21 | 80.00 | 88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 51.46 | 77.82 | 89.13 |
***
## Training
#### Data
We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
#### Implementation
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
## Citation
```bibtex
@online{louis2023,
author = 'Antoine Louis',
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model for French',
publisher = 'Hugging Face',
month = 'dec',
year = '2023',
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
``` |