antoinelouis's picture
Update README.md
7627c1f verified
|
raw
history blame
5.77 kB
---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- sentence-similarity
- colbert
base_model: camembert-base
library_name: RAGatouille
inference: false
---
# 🇫🇷 colbertv1-camembert-base-mmarcoFR
This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
## Usage
Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
### Using ColBERT-AI
First, you will need to install the following libraries:
```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```
Then, you can use the model like this:
```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig
n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
indexer.index(name=index_name, collection=documents)
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```
### Using RAGatouille
First, you will need to install the following libraries:
```bash
pip install -U ragatouille
```
Then, you can use the model like this:
```python
from ragatouille import RAGPretrainedModel
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)
# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```
***
## Evaluation
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
| model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR** | 🇫🇷 | 110M | 443MB | 29.51 | 54.21 | 80.00 | 88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 51.46 | 77.82 | 89.13 |
***
## Training
#### Data
We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
#### Implementation
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
## Citation
```bibtex
@online{louis2023,
author = 'Antoine Louis',
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model for French',
publisher = 'Hugging Face',
month = 'dec',
year = '2023',
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
```