|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- google-research-datasets/natural_questions |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
--- |
|
|
|
# svdr-nq |
|
|
|
Semi-Parametric Retrieval via Binary Token Index. Jiawei Zhou, Li Dong, Furu Wei, Lei Chen, arXiv 2024 |
|
|
|
The model is BERT-based with 12 layers and an embedding size of 20,523, derived from the BERT vocabulary of 30,522 with 999 unused tokens excluded. |
|
|
|
|
|
## Quick Start |
|
|
|
Download and install `vsearch` repo: |
|
|
|
``` |
|
git clone git@github.com:jzhoubu/vsearch.git |
|
poetry install |
|
poetry shell |
|
``` |
|
|
|
Below is an example to encode queries and passages and compute similarity. |
|
|
|
```python |
|
import torch |
|
from src.ir import Retriever |
|
|
|
query = "Who first proposed the theory of relativity?" |
|
passages = [ |
|
"Albert Einstein (14 March 1879 β 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.", |
|
"Sir Isaac Newton FRS (25 December 1642 β 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.", |
|
"Nikola Tesla (10 July 1856 β 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system." |
|
] |
|
|
|
ir = Retriever.from_pretrained("vsearch/svdr-nq") |
|
ir = ir.to("cuda") |
|
|
|
# Embed the query and passages |
|
q_emb = ir.encoder_q.embed(query) # Shape: [1, V] |
|
p_emb = ir.encoder_p.embed(passages) # Shape: [4, V] |
|
|
|
scores = q_emb @ p_emb.t() |
|
print(scores) |
|
|
|
# Output: |
|
tensor([[61.5432, 10.3108, 8.6709]], device='cuda:0') |
|
``` |
|
|
|
|
|
## Building Embedding-based Index for Search |
|
|
|
Below are examples to build index for large-scale retrieval |
|
|
|
```python |
|
# Build the sparse index for the passages |
|
ir.build_index(passages, index_type="sparse") |
|
print(ir.index) |
|
|
|
# Output: |
|
# Index Type : SparseIndex |
|
# Vector Type : torch.sparse_csr |
|
# Vector Shape : torch.Size([3, 29523]) |
|
# Vector Device : cuda:0 |
|
# Number of Texts : 3 |
|
|
|
# Save the index to disk |
|
index_file = "/path/to/index.npz" |
|
ir.save_index(path) |
|
|
|
# Load the index from disk |
|
index_file = "/path/to/index.npz" |
|
data_file = "/path/to/texts.jsonl" |
|
ir.load_index(index_file=index_file, data_file=data_file) |
|
|
|
# Search top-k results for queries |
|
queries = [query] |
|
results = ir.retrieve(queries, k=3) |
|
print(results) |
|
|
|
# Output: |
|
# SearchResults( |
|
# ids=tensor([[0, 1, 2]], device='cuda:0'), |
|
# scores=tensor([[97.2458, 39.7507, 37.6407]], device='cuda:0') |
|
# ) |
|
|
|
query_id = 0 |
|
top1_psg_id = results.ids[query_id][0] |
|
top1_psg = ir.index.get_sample(top1_psg_id) |
|
print(top1_psg) |
|
# Output: |
|
|
|
# Albert Einstein (14 March 1879 β 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity. |
|
|
|
``` |
|
|
|
## Building Bag-of-token Index for Search |
|
|
|
Our framework supports using tokenization as an index (i.e., a bag-of-token index), which operates on CPU and reduces indexing time and storage requirements by over 90%, compare to an embedding-based index. |
|
|
|
```python |
|
# Build the bag-of-token index for the passages |
|
ir.build_index(passages, index_type="bag_of_token") |
|
print(ir.index) |
|
|
|
# Output: |
|
# Index Type : BoTIndex |
|
# Vector Type : torch.sparse_csr |
|
# Vector Shape : torch.Size([3, 29523]) |
|
# Vector Device : cuda:0 |
|
# Number of Texts : 3 |
|
|
|
# Search top-k results from bag-of-token index, and embed and rerank them on-the-fly |
|
queries = [query] |
|
results = ir.retrieve(queries, k=3, rerank=True) |
|
print(results) |
|
|
|
# Output: |
|
# SearchResults( |
|
# ids=tensor([0, 2, 1], device='cuda:3'), |
|
# scores=tensor([97.2964, 39.7844, 37.6955], device='cuda:0') |
|
# ) |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
Please refer to our paper at [https://arxiv.org/pdf/2405.01924](https://arxiv.org/pdf/2405.01924). |
|
|
|
|
|
|
|
## Citation |
|
If you find our paper or models helpful, please consider cite as follows: |
|
``` |
|
@article{zhou2024semi, |
|
title={Semi-Parametric Retrieval via Binary Token Index}, |
|
author={Zhou, Jiawei and Dong, Li and Wei, Furu and Chen, Lei}, |
|
journal={arXiv preprint arXiv:2405.01924}, |
|
year={2024} |
|
} |
|
``` |