wrapup.md · terrierteam/splade at b2700ca30684ca71c238469d66827efc1cdace22

Putting it all together

When you use the document encoder in an indexing pipeline, the rewritten document contents are indexed:

SPLADE

Indexer

IDX

import pyterrier as pt
pt.init(version='snapshot')
import pyt_splade

dataset = pt.get_dataset('irds:msmarco-passage')
splade = pyt_splade.SpladeFactory()

indexer = pt.IterDictIndexer('./msmarco_psg', pretokenised=True)

indxer_pipe = splade.indexing() >> indexer
indxer_pipe.index(dataset.get_corpus_iter())

Once you built an index, you can build a retrieval pipeline that first encodes the query, and then performs retrieval:

SPLADE

TF Retriever

IDX

splade_retr = splade.query() >> pt.BatchRetrieve('./msmarco_psg', wmodel='Tf')

References & Credits

This package uses Naver's SPLADE repository.

Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021.
Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. CIKM 2021.