File size: 5,496 Bytes

faa6aeb
97d37d7
faa6aeb
 
 
 
 
 
 
 
 
 
defef1e
faa6aeb
 
 
 
 
 
97d37d7
 
 
 
b0c553d
97d37d7
faa6aeb
 
97d37d7
faa6aeb
97d37d7
faa6aeb
 
97d37d7
 
 
 
 
faa6aeb
97d37d7
 
 
 
 
 
 
 
faa6aeb
 
 
97d37d7
faa6aeb
 
97d37d7
faa6aeb
97d37d7
 
 
 
faa6aeb
97d37d7
 
 
 
 
faa6aeb
97d37d7
faa6aeb
 
 
4a94c57
a280fde
 
 
 
200a637
faa6aeb
 
 
e31fa91
faa6aeb
513c436
faa6aeb
 
 
513c436
faa6aeb
513c436
faa6aeb
 
4a94c57
5c4728f
faa6aeb

---
pipeline_tag: sentence-similarity
language: fr
license: apache-2.0
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- feature-extraction
- sentence-similarity
library_name: colbert
inference: false
---

# colbertv1-camembert-base-mmarcoFR

This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

## Installation

To use this model, you will need to install the following libraries:
```
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```


## Usage

**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
```
from colbert import Indexer
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # The name of your index, i.e. the name of your vector database

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    documents = [
      "Ceci est un premier document.",
      "Voici un second document.",
      ...
    ]
    indexer.index(name=index_name, collection=documents)

```

**Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
```
from colbert import Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 0
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
k: int = 10 # how many results you want to retrieve

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    query = "Comment effectuer une recherche avec ColBERT ?"
    results = searcher.search(query, k=k)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

```

## Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

| model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   R@10 |   R@100(↑) |   R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR**                                                                                   |     🇫🇷 |    110M | 443MB |    29.51 |  54.21 |      80.00 |   88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 443MB |    28.53 |  51.46 |      77.82 |   89.13 |

## Training

#### Details

The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.

#### Data

The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:
- a corpus of 8.8M passages;
- a training set of ~533k unique queries (with at least one relevant passage);
- a development set of ~101k queries;
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).

The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.

## Citation

```bibtex
@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
```