metadata

pipeline_tag: sentence-similarity
language: fr
license: apache-2.0
datasets:
  - unicamp-dl/mmarco
metrics:
  - recall
tags:
  - feature-extraction
  - sentence-similarity
library_name: colbert
inference: false

colbertv1-camembert-base-mmarcoFR

This is a ColBERTv1 model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the French portion of the mMARCO dataset.

Installation

To use this model, you will need to install the following libraries:

pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2

Usage

Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!

from colbert import Indexer
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # The name of your index, i.e. the name of your vector database

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    documents = [
      "Ceci est un premier document.",
      "Voici un second document.",
      ...
    ]
    indexer.index(name=index_name, collection=documents)

Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

from colbert import Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 0
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
k: int = 10 # how many results you want to retrieve

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    query = "Comment effectuer une recherche avec ColBERT ?"
    results = searcher.search(query, k=k)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

model	Vocab.	#Param.	Size	MRR@10	R@10	R@100(↑)	R@500
colbertv1-camembert-base-mmarcoFR	🇫🇷	110M	443MB	29.51	54.21	80.00	88.40
biencoder-camembert-base-mmarcoFR	🇫🇷	110M	443MB	28.53	51.46	77.82	89.13

Training

Details

The model is initialized from the camembert-base checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.

Data

The model is fine-tuned on the French version of the mMARCO dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:

a corpus of 8.8M passages;
a training set of ~533k unique queries (with at least one relevant passage);
a development set of ~101k queries;
a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).

The triples are sampled from the ~39.8M triples of triples.train.small.tsv. In the future, better negatives could be selected by exploiting the msmarco-hard-negatives dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.

Citation

@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}