prithivida commited on
Commit
3f35db0
β€’
1 Parent(s): dc7a083

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -24
README.md CHANGED
@@ -23,39 +23,41 @@ pipeline_tag: fill-mask
23
  This work stands on the shoulders of 2 robust researches: [Naver's From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective paper](https://arxiv.org/pdf/2205.04733.pdf) and [Google's SparseEmbed](https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/79f16d3b3b948706d191a7fe6dd02abe516f5564.pdf).
24
  Props to both the teams for such a robust work.
25
 
26
- ## 1 What is a Sparse Representations and Why learn one?
27
 
28
  **Experts in Sparse & Dense representations feel free skip to next section 2**,
29
 
30
  <details>
31
 
32
- 1. Lexical search:
33
 
34
  Lexical search with BOW based sparse vectors are strong baselines, but they famously suffer from vocabulary mismatch problem, as they can only do exact term matching. Here are the pros and cons:
35
 
36
- βœ… Efficient and Cheap.
37
- βœ… No need to fine-tune models.
38
- βœ…οΈ Interpretable.
39
- βœ…οΈ Exact Term Matches.
40
- ❌ Vocabulary mismatch (Need to remember exact terms)
41
 
42
- 2. Semantic Search:
 
 
 
 
 
 
 
43
 
44
  Learned Neural / Dense retrievers (DPR, Sentence transformers*, BGE* models) with approximate nearest neighbors search has shown impressive results. Here are the pros and cons:
45
 
46
- βœ… Search how humans innately think.
47
- βœ… When finetuned beats sparse by long way.
48
- βœ… Easily works with Multiple modals.
49
- ❌ Suffers token amnesia (misses term matching),
50
- ❌ Resource intensive (both index & retreival),
51
- ❌ Famously hard to interpret.
52
- ❌ Needs fine-tuning for OOD data.
53
 
54
- 3. The big idea:
55
 
56
  Getting pros of both searches made sense and that gave rise to interest in learning sparse representations for queries and documents with some interpretability. The sparse representations also double as implicit or explicit (latent, contextualized) expansion mechanisms for both query and documents. If you are new to query expansion learn more here from the master himself Daniel Tunkelang (link below).
57
 
58
- 4. What a Sparse model learns ?
59
 
60
  The model learns to project it's learned dense representations over a MLM head to give a vocabulary distribution. Which is just to say the model can do automatic token expansion. (Image courtesy of pinecone)
61
 
@@ -64,7 +66,7 @@ The model learns to project it's learned dense representations over a MLM head t
64
  </details>
65
 
66
 
67
- ## 2 Motivation:
68
  SPLADE models are a fine balance between retrieval effectiveness (quality) and retrieval efficiency (latency and $), with that in mind we did **very minor retrieval efficiency tweaks** to make it more suitable for a industry setting.
69
  *(Pure MLE folks should not conflate efficiency to model inference efficiency. Our main focus is on retrieval efficiency. Hereinafter efficiency is a short hand for retrieval efficiency unless explicitly qualified otherwise. Not that inference efficiency is not important, we will address that subsequently.)*
70
 
@@ -82,7 +84,7 @@ SPLADE models are a fine balance between retrieval effectiveness (quality) and r
82
 
83
  <br/>
84
 
85
- ## 3 Why FLOPS is one of the key metrics for industry setting ?
86
 
87
  <details>
88
 
@@ -119,7 +121,7 @@ SPLADE BOW rep:
119
 
120
  </details>
121
 
122
- ## 4 How does it translate into Empirical metrics?
123
 
124
  Our models are token sparse and yet effective. It translates to faster retrieval (User experience) and smaller index size ($). Mean retrieval time on the standard MS-MARCO small dev set and Scaled total FLOPS loss are the respective metrics are below.
125
  This is why Google's SparseEmbed is interesting as they also achieve SPLADE quality retrieval effectiveness with much lower FLOPs. Compared to ColBERT, SPLADE and SparseEmbed match query and
@@ -140,7 +142,7 @@ The full [anserini evaluation log](https://huggingface.co/prithivida/Splade_PP_e
140
  - **Same size models:** Official SPLADE++, SparseEmbed and Ours all finetune on the same size based model. Size of `bert-base-uncased`.
141
  </details>
142
 
143
- ## 5 Roadmap and future directions for Industry Suitability.
144
 
145
  - **Custom/Domain Finetuning**: OOD Zeroshot performance of SPLADE models is great but unimportant in the industry setting as we need the ability to finetune on custom datasets or domains. Finetuning SPLADE on a new dataset is not cheap and needs labelling of queries and passages.
146
  So we will continue to see how we can enable economically finetuning our recipe on custom datasets without expensive labelling.
@@ -148,7 +150,7 @@ The full [anserini evaluation log](https://huggingface.co/prithivida/Splade_PP_e
148
  120K and 250K vocab as opposed to 30K as in bert-base-uncased. We will continue to research to see how best we can extend our recipe to the multilingual world.
149
 
150
 
151
- ## 6 Usage
152
 
153
  To enable a light weight inference solution without heavy **No Torch dependency** we will also release a library - **SPLADERunner**
154
  Ofcourse if it doesnt matter you could always use these models Huggingface transformers library.
@@ -157,7 +159,7 @@ Ofcourse if it doesnt matter you could always use these models Huggingface trans
157
  <h1 id="htu">How to use? </h1>
158
 
159
 
160
- ## 7 With SPLADERunner Library
161
 
162
  [SPLADERunner Library](https://github.com/PrithivirajDamodaran/SPLADERunner)
163
 
@@ -175,7 +177,7 @@ sparse_rep = expander.expand(
175
  ```
176
 
177
 
178
- ## 8 With HuggingFace
179
 
180
  ```python
181
  import torch
 
23
  This work stands on the shoulders of 2 robust researches: [Naver's From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective paper](https://arxiv.org/pdf/2205.04733.pdf) and [Google's SparseEmbed](https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/79f16d3b3b948706d191a7fe6dd02abe516f5564.pdf).
24
  Props to both the teams for such a robust work.
25
 
26
+ ## 1. What is a Sparse Representations and Why learn one?
27
 
28
  **Experts in Sparse & Dense representations feel free skip to next section 2**,
29
 
30
  <details>
31
 
32
+ **1. Lexical search:**
33
 
34
  Lexical search with BOW based sparse vectors are strong baselines, but they famously suffer from vocabulary mismatch problem, as they can only do exact term matching. Here are the pros and cons:
35
 
 
 
 
 
 
36
 
37
+ - βœ… Efficient and Cheap.
38
+ - βœ… No need to fine-tune models.
39
+ - βœ…οΈ Interpretable.
40
+ - βœ…οΈ Exact Term Matches.
41
+ - ❌ Vocabulary mismatch (Need to remember exact terms)
42
+
43
+
44
+ **2. Semantic Search:**
45
 
46
  Learned Neural / Dense retrievers (DPR, Sentence transformers*, BGE* models) with approximate nearest neighbors search has shown impressive results. Here are the pros and cons:
47
 
48
+ - βœ… Search how humans innately think.
49
+ - βœ… When finetuned beats sparse by long way.
50
+ - βœ… Easily works with Multiple modals.
51
+ - ❌ Suffers token amnesia (misses term matching),
52
+ - ❌ Resource intensive (both index & retreival),
53
+ - ❌ Famously hard to interpret.
54
+ - ❌ Needs fine-tuning for OOD data.
55
 
56
+ **3. The big idea:**
57
 
58
  Getting pros of both searches made sense and that gave rise to interest in learning sparse representations for queries and documents with some interpretability. The sparse representations also double as implicit or explicit (latent, contextualized) expansion mechanisms for both query and documents. If you are new to query expansion learn more here from the master himself Daniel Tunkelang (link below).
59
 
60
+ **4. What a Sparse model learns ?**
61
 
62
  The model learns to project it's learned dense representations over a MLM head to give a vocabulary distribution. Which is just to say the model can do automatic token expansion. (Image courtesy of pinecone)
63
 
 
66
  </details>
67
 
68
 
69
+ ## 2. Motivation:
70
  SPLADE models are a fine balance between retrieval effectiveness (quality) and retrieval efficiency (latency and $), with that in mind we did **very minor retrieval efficiency tweaks** to make it more suitable for a industry setting.
71
  *(Pure MLE folks should not conflate efficiency to model inference efficiency. Our main focus is on retrieval efficiency. Hereinafter efficiency is a short hand for retrieval efficiency unless explicitly qualified otherwise. Not that inference efficiency is not important, we will address that subsequently.)*
72
 
 
84
 
85
  <br/>
86
 
87
+ ## 3. Why FLOPS is one of the key metrics for industry setting ?
88
 
89
  <details>
90
 
 
121
 
122
  </details>
123
 
124
+ ## 4. How does it translate into Empirical metrics?
125
 
126
  Our models are token sparse and yet effective. It translates to faster retrieval (User experience) and smaller index size ($). Mean retrieval time on the standard MS-MARCO small dev set and Scaled total FLOPS loss are the respective metrics are below.
127
  This is why Google's SparseEmbed is interesting as they also achieve SPLADE quality retrieval effectiveness with much lower FLOPs. Compared to ColBERT, SPLADE and SparseEmbed match query and
 
142
  - **Same size models:** Official SPLADE++, SparseEmbed and Ours all finetune on the same size based model. Size of `bert-base-uncased`.
143
  </details>
144
 
145
+ ## 5. Roadmap and future directions for Industry Suitability.
146
 
147
  - **Custom/Domain Finetuning**: OOD Zeroshot performance of SPLADE models is great but unimportant in the industry setting as we need the ability to finetune on custom datasets or domains. Finetuning SPLADE on a new dataset is not cheap and needs labelling of queries and passages.
148
  So we will continue to see how we can enable economically finetuning our recipe on custom datasets without expensive labelling.
 
150
  120K and 250K vocab as opposed to 30K as in bert-base-uncased. We will continue to research to see how best we can extend our recipe to the multilingual world.
151
 
152
 
153
+ ## 6. Usage
154
 
155
  To enable a light weight inference solution without heavy **No Torch dependency** we will also release a library - **SPLADERunner**
156
  Ofcourse if it doesnt matter you could always use these models Huggingface transformers library.
 
159
  <h1 id="htu">How to use? </h1>
160
 
161
 
162
+ ## 7. With SPLADERunner Library
163
 
164
  [SPLADERunner Library](https://github.com/PrithivirajDamodaran/SPLADERunner)
165
 
 
177
  ```
178
 
179
 
180
+ ## 8. With HuggingFace
181
 
182
  ```python
183
  import torch