macavaney commited on
Commit
5b8a68e
1 Parent(s): 4533ea5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - retrieval
6
+ - document_expansion
7
+ datasets:
8
+ - irds:msmarco-passage
9
+ library_name: pyterrier
10
+ ---
11
+
12
+ A Doc2Query model based on `t5-base` and trained on MS MARCO. This is a version of [the checkpoint released by the original authors](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip), converted to pytorch format and ready for use in [`pyterrier_doc2query`](https://github.com/terrierteam/pyterrier_doc2query).
13
+
14
+ **Creating a transformer:**
15
+
16
+ ```python
17
+ import pyterrier as pt
18
+ pt.init()
19
+ from pyterrier_doc2query import Doc2Query
20
+ doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')
21
+ ```
22
+
23
+ **Transforming documents**
24
+
25
+ ```python
26
+ import pandas as pd
27
+ doc2query(pd.DataFrame([
28
+ {'docno': '0', 'text': 'Hello Terrier!'},
29
+ {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
30
+ ]))
31
+ # docno text querygen
32
+ # 0 Hello Terrier! hello terrier what kind of dog is a terrier wh...
33
+ # 1 Doc2Query expands queries with potentially rel... can dodoc2query extend query query? what is do...
34
+ ```
35
+
36
+ **Indexing transformed documents**
37
+
38
+ ```python
39
+ doc2query.append = True # append querygen to text
40
+ indexer = pt.IterDictIndexer('./my_index', fields=['text'])
41
+ pipeline = doc2query >> indexer
42
+ pipeline.index([
43
+ {'docno': '0', 'text': 'Hello Terrier!'},
44
+ {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
45
+ ])
46
+ ```
47
+
48
+ **Expanding and indexing a dataset**
49
+
50
+ ```python
51
+ dataset = pt.get_dataset('irds:vaswani')
52
+ pipeline.index(dataset.get_corpus_iter())
53
+ ```
54
+
55
+ ## References
56
+
57
+ - [Nogueira20]: Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf
58
+ - [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271