antoinelouis commited on
Commit
97d37d7
·
1 Parent(s): defef1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -27
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- pipeline_tag: feature-extraction
3
  language: fr
4
  license: apache-2.0
5
  datasets:
@@ -17,46 +17,53 @@ inference: false
17
 
18
  This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
19
 
20
- ## Usage
 
 
 
 
 
21
 
22
- Using ColBERT on a dataset typically involves the following steps:
23
 
24
- **Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.
25
 
26
- **Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
27
  ```
28
- from colbert.infra import Run, RunConfig, ColBERTConfig
29
  from colbert import Indexer
 
 
 
 
 
30
 
31
- if __name__=='__main__':
32
- with Run().context(RunConfig(nranks=1, experiment="msmarco")):
 
 
 
 
 
 
33
 
34
- config = ColBERTConfig(
35
- nbits=2,
36
- root="/path/to/experiments",
37
- )
38
- indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
39
- indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
40
  ```
41
 
42
- **Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
43
  ```
44
- from colbert.data import Queries
45
- from colbert.infra import Run, RunConfig, ColBERTConfig
46
  from colbert import Searcher
 
47
 
48
- if __name__=='__main__':
49
- with Run().context(RunConfig(nranks=1, experiment="msmarco")):
 
 
50
 
51
- config = ColBERTConfig(
52
- root="/path/to/experiments",
53
- )
54
- searcher = Searcher(index="msmarco.nbits=2", config=config)
55
- queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
56
- ranking = searcher.search_all(queries, k=100)
57
- ranking.save("msmarco.nbits=2.ranking.tsv")
58
- ```
59
 
 
60
 
61
  ## Evaluation
62
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
  language: fr
4
  license: apache-2.0
5
  datasets:
 
17
 
18
  This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
19
 
20
+ ## Installation
21
+
22
+ To use this model, you will need to install the following libraries:
23
+ ```
24
+ pip install colbert-ir[faiss-gpu] faiss torch
25
+ ```
26
 
 
27
 
28
+ ## Usage
29
 
30
+ **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
31
  ```
 
32
  from colbert import Indexer
33
+ from colbert.infra import Run, RunConfig
34
+
35
+ n_gpu: int = 1 # Set your number of available GPUs
36
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
37
+ index_name: str = "" # The name of your index, i.e. the name of your vector database
38
 
39
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
40
+ indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
41
+ documents = [
42
+ "Ceci est un premier document.",
43
+ "Voici un second document.",
44
+ ...
45
+ ]
46
+ indexer.index(name=index_name, collection=documents)
47
 
 
 
 
 
 
 
48
  ```
49
 
50
+ **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
51
  ```
 
 
52
  from colbert import Searcher
53
+ from colbert.infra import Run, RunConfig
54
 
55
+ n_gpu: int = 0
56
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
57
+ index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
58
+ k: int = 10 # how many results you want to retrieve
59
 
60
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
61
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
62
+ query = "Comment effectuer une recherche avec ColBERT ?"
63
+ results = searcher.search(query, k=k)
64
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 
 
 
65
 
66
+ ```
67
 
68
  ## Evaluation
69