jinhyuklee commited on
Commit
47af12f
1 Parent(s): 1f87897

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # XTR: Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
6
+
7
+ We provide how you can run [XTR](https://arxiv.org/abs/2304.01982) on PyTorch.
8
+
9
+ We thank Mujeen Sung (https://github.com/mjeensung/xtr-pytorch) for providing this functionality.
10
+
11
+ ## Installation
12
+
13
+ ```
14
+ $ git clone git@github.com:mjeensung/xtr-pytorch.git
15
+ $ pip install -e .
16
+ ```
17
+
18
+ ## Usage
19
+ ```
20
+ # Create the dataset
21
+ sample_doc = "Google LLC (/ˈɡuːɡəl/ (listen)) is an American multinational technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence..."
22
+ chunks = [chunk.lower() for chunk in sent_tokenize(sample_doc)]
23
+
24
+ # Load the XTR retriever
25
+ xtr = XtrRetriever(model_name_or_path="google/xtr-base-multilingual", use_faiss=False, device="cuda")
26
+
27
+ # Build the index
28
+ xtr.build_index(chunks)
29
+
30
+ # Retrieve top-3 documents given the query
31
+ query = "Who founded google"
32
+ retrieved_docs, metadata = xtr.retrieve_docs([query], document_top_k=3)
33
+ for rank, (did, score, doc) in enumerate(retrieved_docs[0]):
34
+ print(f"[{rank}] doc={did} ({score:.3f}): {doc}")
35
+
36
+ """
37
+ >> [0] doc=0 (0.925): google llc (/ˈɡuːɡəl/ (listen)) is an american multinational technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics.
38
+ >> [1] doc=1 (0.903): it has been referred to as "the most powerful company in the world" and one of the world's most valuable brands due to its market dominance, data collection, and technological advantages in the area of artificial intelligence.
39
+ >> [2] doc=2 (0.900): its parent company alphabet is considered one of the big five american information technology companies, alongside amazon, apple, meta, and microsoft.
40
+ """
41
+ ```
42
+
43
+ ## Citing this work
44
+
45
+ ```bibtex
46
+ @article{lee2024rethinking,
47
+ title={Rethinking the role of token retrieval in multi-vector retrieval},
48
+ author={Lee, Jinhyuk and Dai, Zhuyun and Duddu, Sai Meher Karthik and Lei, Tao and Naim, Iftekhar and Chang, Ming-Wei and Zhao, Vincent},
49
+ journal={Advances in Neural Information Processing Systems},
50
+ volume={36},
51
+ year={2024}
52
+ }
53
+ ```