|
--- |
|
pipeline_tag: sentence-similarity |
|
language: fr |
|
license: apache-2.0 |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
library_name: colbert |
|
inference: false |
|
--- |
|
|
|
# colbertv1-camembert-base-mmarcoFR |
|
|
|
This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset. |
|
|
|
## Installation |
|
|
|
To use this model, you will need to install the following libraries: |
|
``` |
|
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2 |
|
``` |
|
|
|
|
|
## Usage |
|
|
|
**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU! |
|
``` |
|
from colbert import Indexer |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 1 # Set your number of available GPUs |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # The name of your index, i.e. the name of your vector database |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR") |
|
documents = [ |
|
"Ceci est un premier document.", |
|
"Voici un second document.", |
|
... |
|
] |
|
indexer.index(name=index_name, collection=documents) |
|
|
|
``` |
|
|
|
**Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query. |
|
``` |
|
from colbert import Searcher |
|
from colbert.infra import Run, RunConfig |
|
|
|
n_gpu: int = 0 |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # Name of your previously created index where the documents you want to search are stored. |
|
k: int = 10 # how many results you want to retrieve |
|
|
|
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)): |
|
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index. |
|
query = "Comment effectuer une recherche avec ColBERT ?" |
|
results = searcher.search(query, k=k) |
|
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...) |
|
|
|
``` |
|
|
|
## Evaluation |
|
|
|
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). |
|
|
|
| model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 | |
|
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:| |
|
| **colbertv1-camembert-base-mmarcoFR** | 🇫🇷 | 110M | 443MB | 29.51 | 54.21 | 80.00 | 88.40 | |
|
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 51.46 | 77.82 | 89.13 | |
|
|
|
## Training |
|
|
|
#### Details |
|
|
|
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens. |
|
|
|
#### Data |
|
|
|
The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises: |
|
- a corpus of 8.8M passages; |
|
- a training set of ~533k unique queries (with at least one relevant passage); |
|
- a development set of ~101k queries; |
|
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works). |
|
|
|
The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@online{louis2023, |
|
author = 'Antoine Louis', |
|
title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO', |
|
publisher = 'Hugging Face', |
|
month = 'dec', |
|
year = '2023', |
|
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR', |
|
} |
|
``` |