prithivida
commited on
Commit
•
6fdd93b
1
Parent(s):
4ed18bf
Update README.md
Browse files
README.md
CHANGED
@@ -37,20 +37,25 @@ pipeline_tag: sentence-similarity
|
|
37 |
</center>
|
38 |
|
39 |
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
- [With Sentence Transformers:](#with-sentence-transformers)
|
45 |
- [With Huggingface Transformers:](#with-huggingface-transformers)
|
|
|
|
|
46 |
- [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
|
47 |
- [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
|
|
|
|
|
48 |
- [Notes on Reproducing:](#notes-on-reproducing)
|
49 |
- [Reference:](#reference)
|
50 |
- [Note on model bias](#note-on-model-bias)
|
|
|
51 |
|
52 |
|
53 |
-
|
54 |
|
55 |
<center>
|
56 |
<img src="./terms.png" width=200%/>
|
@@ -81,7 +86,7 @@ Full set of evaluation numbers for our model
|
|
81 |
|
82 |
<br/>
|
83 |
|
84 |
-
|
85 |
|
86 |
#### With Sentence Transformers:
|
87 |
|
@@ -137,6 +142,12 @@ for query, query_embedding in zip(queries, query_embeddings):
|
|
137 |
#### With Huggingface Transformers:
|
138 |
- T.B.A
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
#### How do I optimise vector index cost ?
|
141 |
[Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
|
142 |
|
@@ -147,8 +158,23 @@ MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid o
|
|
147 |
|-----------|-----|--------------|--------------|----------------|
|
148 |
| **Hindi** | **hi** | **0.458** | **0.383** | **0.616** |
|
149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
|
151 |
-
|
152 |
|
153 |
We welcome everyone to reproduce our results. Here are some tips and observations:
|
154 |
|
@@ -165,7 +191,7 @@ Here are our numbers for the full hindi run on BGE-M3
|
|
165 |
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
|
166 |
```
|
167 |
|
168 |
-
Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the MTEB benchmarks.
|
169 |
|
170 |
|
171 |
# Reference:
|
|
|
37 |
</center>
|
38 |
|
39 |
|
40 |
+
- [License and Terms:](#license-and-terms)
|
41 |
+
- [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
|
42 |
+
- [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
|
43 |
+
- [Usage:](#usage)
|
44 |
- [With Sentence Transformers:](#with-sentence-transformers)
|
45 |
- [With Huggingface Transformers:](#with-huggingface-transformers)
|
46 |
+
- [FAQs](#faqs)
|
47 |
+
- [How can we run these models with out heavy torch dependency?](#how-can-we-run-these-models-with-out-heavy-torch-dependency)
|
48 |
- [How do I optimise vector index cost?](#how-do-i-optimise-vector-index-cost)
|
49 |
- [How do I offer hybrid search to address Vocabulary Mismatch Problem?](#how-do-i-offer)
|
50 |
+
- [Why not run MTEB?](#why-not-run-mteb)
|
51 |
+
- [Roadmap](#roadmap)
|
52 |
- [Notes on Reproducing:](#notes-on-reproducing)
|
53 |
- [Reference:](#reference)
|
54 |
- [Note on model bias](#note-on-model-bias)
|
55 |
+
|
56 |
|
57 |
|
58 |
+
# License and Terms:
|
59 |
|
60 |
<center>
|
61 |
<img src="./terms.png" width=200%/>
|
|
|
86 |
|
87 |
<br/>
|
88 |
|
89 |
+
# Usage:
|
90 |
|
91 |
#### With Sentence Transformers:
|
92 |
|
|
|
142 |
#### With Huggingface Transformers:
|
143 |
- T.B.A
|
144 |
|
145 |
+
# FAQs:
|
146 |
+
|
147 |
+
#### How can we run these models with out heavy torch dependency?
|
148 |
+
- You can use ONNX flavours of these models via [FlashRetrieve](https://github.com/PrithivirajDamodaran/FlashRetrieve) library.
|
149 |
+
|
150 |
+
|
151 |
#### How do I optimise vector index cost ?
|
152 |
[Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)
|
153 |
|
|
|
158 |
|-----------|-----|--------------|--------------|----------------|
|
159 |
| **Hindi** | **hi** | **0.458** | **0.383** | **0.616** |
|
160 |
|
161 |
+
#### Why not run MTEB?
|
162 |
+
MTEB is a general purpose embedding evaluation bechmark covering wide range of tasks available currently only for English, Chinese, French and few other languages but not Indic languages. Besides like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
163 |
+
At the moment MIRACL is the gold standard for a subset of Indic languages.
|
164 |
+
|
165 |
+
|
166 |
+
|
167 |
+
# Roadmap
|
168 |
+
We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
|
169 |
+
|
170 |
+
- Spanish
|
171 |
+
- Tamil
|
172 |
+
- Arabic
|
173 |
+
- German
|
174 |
+
- English ?
|
175 |
+
|
176 |
|
177 |
+
# Notes on reproducing:
|
178 |
|
179 |
We welcome everyone to reproduce our results. Here are some tips and observations:
|
180 |
|
|
|
191 |
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
|
192 |
```
|
193 |
|
194 |
+
Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the retrieval slice of MTEB benchmarks.
|
195 |
|
196 |
|
197 |
# Reference:
|