Update README.md

b0c553d verified 11 months ago

5.5 kB

	---
	pipeline_tag: sentence-similarity
	language: fr
	license: apache-2.0
	datasets:
	- unicamp-dl/mmarco
	metrics:
	- recall
	tags:
	- feature-extraction
	- sentence-similarity
	library_name: colbert
	inference: false
	---

	# colbertv1-camembert-base-mmarcoFR

	This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the French portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

	## Installation

	To use this model, you will need to install the following libraries:
	```
	pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
	```


	## Usage

	Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
	```
	from colbert import Indexer
	from colbert.infra import Run, RunConfig

	n_gpu: int = 1 # Set your number of available GPUs
	experiment: str = "" # Name of the folder where the logs and created indices will be stored
	index_name: str = "" # The name of your index, i.e. the name of your vector database

	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
	documents = [
	"Ceci est un premier document.",
	"Voici un second document.",
	...
	]
	indexer.index(name=index_name, collection=documents)

	```

	Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
	```
	from colbert import Searcher
	from colbert.infra import Run, RunConfig

	n_gpu: int = 0
	experiment: str = "" # Name of the folder where the logs and created indices will be stored
	index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
	k: int = 10 # how many results you want to retrieve

	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
	query = "Comment effectuer une recherche avec ColBERT ?"
	results = searcher.search(query, k=k)
	# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

	```

	## Evaluation

	The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

	\| model \| Vocab. \| #Param. \| Size \| MRR@10 \| R@10 \| R@100(↑) \| R@500 \|
	\|:------------------------------------------------------------------------------------------------------------------------\|:-------\|--------:\|------:\|---------:\|-------:\|-----------:\|--------:\|
	\| colbertv1-camembert-base-mmarcoFR \| 🇫🇷 \| 110M \| 443MB \| 29.51 \| 54.21 \| 80.00 \| 88.40 \|
	\| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) \| 🇫🇷 \| 110M \| 443MB \| 28.53 \| 51.46 \| 77.82 \| 89.13 \|

	## Training

	#### Details

	The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and fine-tuned on 12.8M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.

	#### Data

	The model is fine-tuned on the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset which comprises:
	- a corpus of 8.8M passages;
	- a training set of ~533k unique queries (with at least one relevant passage);
	- a development set of ~101k queries;
	- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).

	The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset). In the future, better negatives could be selected by exploiting the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) dataset that contains 50 hard negatives mined from BM25 and 12 dense retrievers for each training query.

	## Citation

	```bibtex
	@online{louis2023,
	author = 'Antoine Louis',
	title = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
	publisher = 'Hugging Face',
	month = 'dec',
	year = '2023',
	url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
	}
	```