svdr-nq / README.md

Update README.md

ced11e6 verified 4 months ago

4.34 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- google-research-datasets/natural_questions
	base_model:
	- google-bert/bert-base-uncased
	---

	# svdr-nq

	Semi-Parametric Retrieval via Binary Token Index. Jiawei Zhou, Li Dong, Furu Wei, Lei Chen, arXiv 2024

	The model is BERT-based with 12 layers and an embedding size of 20,523, derived from the BERT vocabulary of 30,522 with 999 unused tokens excluded.


	## Quick Start

	Download and install `vsearch` repo:

	```
	git clone git@github.com:jzhoubu/vsearch.git
	poetry install
	poetry shell
	```

	Below is an example to encode queries and passages and compute similarity.

	```python
	import torch
	from src.ir import Retriever

	query = "Who first proposed the theory of relativity?"
	passages = [
	"Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.",
	"Sir Isaac Newton FRS (25 December 1642 – 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.",
	"Nikola Tesla (10 July 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system."
	]

	ir = Retriever.from_pretrained("vsearch/svdr-nq")
	ir = ir.to("cuda")

	# Embed the query and passages
	q_emb = ir.encoder_q.embed(query) # Shape: [1, V]
	p_emb = ir.encoder_p.embed(passages) # Shape: [4, V]

	scores = q_emb @ p_emb.t()
	print(scores)

	# Output:
	tensor([[61.5432, 10.3108, 8.6709]], device='cuda:0')
	```


	## Building Embedding-based Index for Search

	Below are examples to build index for large-scale retrieval

	```python
	# Build the sparse index for the passages
	ir.build_index(passages, index_type="sparse")
	print(ir.index)

	# Output:
	# Index Type : SparseIndex
	# Vector Type : torch.sparse_csr
	# Vector Shape : torch.Size([3, 29523])
	# Vector Device : cuda:0
	# Number of Texts : 3

	# Save the index to disk
	index_file = "/path/to/index.npz"
	ir.save_index(path)

	# Load the index from disk
	index_file = "/path/to/index.npz"
	data_file = "/path/to/texts.jsonl"
	ir.load_index(index_file=index_file, data_file=data_file)

	# Search top-k results for queries
	queries = [query]
	results = ir.retrieve(queries, k=3)
	print(results)

	# Output:
	# SearchResults(
	# ids=tensor([[0, 1, 2]], device='cuda:0'),
	# scores=tensor([[97.2458, 39.7507, 37.6407]], device='cuda:0')
	# )

	query_id = 0
	top1_psg_id = results.ids[query_id][0]
	top1_psg = ir.index.get_sample(top1_psg_id)
	print(top1_psg)
	# Output:

	# Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.

	```

	## Building Bag-of-token Index for Search

	Our framework supports using tokenization as an index (i.e., a bag-of-token index), which operates on CPU and reduces indexing time and storage requirements by over 90%, compare to an embedding-based index.

	```python
	# Build the bag-of-token index for the passages
	ir.build_index(passages, index_type="bag_of_token")
	print(ir.index)

	# Output:
	# Index Type : BoTIndex
	# Vector Type : torch.sparse_csr
	# Vector Shape : torch.Size([3, 29523])
	# Vector Device : cuda:0
	# Number of Texts : 3

	# Search top-k results from bag-of-token index, and embed and rerank them on-the-fly
	queries = [query]
	results = ir.retrieve(queries, k=3, rerank=True)
	print(results)

	# Output:
	# SearchResults(
	# ids=tensor([0, 2, 1], device='cuda:3'),
	# scores=tensor([97.2964, 39.7844, 37.6955], device='cuda:0')
	# )
	```


	## Training Details

	Please refer to our paper at [https://arxiv.org/pdf/2405.01924](https://arxiv.org/pdf/2405.01924).



	## Citation
	If you find our paper or models helpful, please consider cite as follows:
	```
	@article{zhou2024semi,
	title={Semi-Parametric Retrieval via Binary Token Index},
	author={Zhou, Jiawei and Dong, Li and Wei, Furu and Chen, Lei},
	journal={arXiv preprint arXiv:2405.01924},
	year={2024}
	}
	```