File size: 8,523 Bytes
905627f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9691176
 
 
c3d9ded
9691176
905627f
 
 
 
 
 
 
 
6fdd93b
 
 
 
905627f
 
6fdd93b
4ecfcd3
 
 
6fdd93b
 
905627f
 
 
6fdd93b
905627f
 
6fdd93b
905627f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fdd93b
905627f
 
 
 
57db413
905627f
 
57db413
 
 
905627f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fdd93b
 
4ecfcd3
14b96ee
6fdd93b
 
4ecfcd3
905627f
 
4ecfcd3
905627f
 
 
 
 
 
6fdd93b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
905627f
6fdd93b
905627f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6fdd93b
905627f
 
 
 
0d1ccc3
 
 
 
 
 
 
905627f
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
license: cc-by-nc-nd-4.0
language:
- hi
datasets:
- MIRACL
tags:
- miniMiracle
- passage-retrieval
- knowledge-distillation
- middle-training
pretty_name: >-
  miniMiracle is a family of High-quality, Light Weight and Easy deploy
  multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
  Indo-Dravidin Languages.
library_name: transformers
pipeline_tag: sentence-similarity
---


<center>
<img src="./logo.png" width=150/>
  <img src="./hi_intro.png" width=120%/>
</center>

<center>
<img src="./hi_metrics_1.png" width=110%/>
  <b><p>Table 1: Hindi retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
</center>

## Architecture:

- Model: BERT.
- Tokenizer: XLM-Roberta's Tokenizer.


<br/>

<center>
  <h1> Table Of Contents </h1>
</center>


- [License and Terms:](#license-and-terms)
- [Detailed comparison & Our Contribution:](#detailed-comparison--our-contribution)
- [ONNX & GGUF Variants:](#detailed-comparison--our-contribution)
- [Usage:](#usage)
    - [With Sentence Transformers:](#with-sentence-transformers)
    - [With Huggingface Transformers:](#with-huggingface-transformers)
- [FAQs](#faqs)
    - [How can I reduce overall inference cost ?](#how-can-i-reduce-overall-inference-cost)
    - [How do I reduce vector storage cost?](#how-do-i-reduce-vector-storage-cost)
    - [How do I offer hybrid search to improve accuracy?](#how-do-i-offer-hybrid-search-to-improve-accuracy)
    - [Why not run MTEB?](#why-not-run-mteb)
- [Roadmap](#roadmap)
- [Notes on Reproducing:](#notes-on-reproducing)
- [Reference:](#reference)
- [Note on model bias](#note-on-model-bias)

  

# License and Terms:

<center>
  <img src="./terms.png" width=200%/>
</center>


## Detailed comparison & Our Contribution:

English language famously have **all-minilm** series models which were great for quick experimentations and for certain production workloads. The Idea is to have same for the other popular langauges, starting with Indo-Aryan and Indo-Dravidian languages. Our innovation is in bringing high quality models which easy to serve and embeddings are cheaper to store without ANY pretraining or expensive finetuning. For instance, **all-minilm** are finetuned on 1-Billion pairs. We offer a very lean model but with a huge vocabulary - around 250K.
We will add more details here.


<center>
  <img src="./hi_metrics_2.png" width=120%/>
  <b><p>Table 2: Detailed Hindi retrieval performance on the MIRACL dev set (measured by nDCG@10)</p></b>
  
</center>

Full set of evaluation numbers for our model

```python
{'NDCG@1': 0.42571, 'NDCG@3': 0.42062, 'NDCG@5': 0.44842, 'NDCG@10': 0.5039, 'NDCG@100': 0.56175, 'NDCG@1000': 0.57772}
{'MAP@1': 0.22683, 'MAP@3': 0.33514, 'MAP@5': 0.37345, 'MAP@10': 0.40861, 'MAP@100': 0.42833, 'MAP@1000': 0.42916}
{'Recall@10': 0.63964, 'Recall@50': 0.80537, 'Recall@100': 0.87136, 'Recall@200': 0.9211, 'Recall@500': 0.96851, 'Recall@1000': 0.97987}
{'P@1': 0.42571, 'P@3': 0.27429, 'P@5': 0.212, 'P@10': 0.13943, 'P@100': 0.01911, 'P@1000': 0.00211}
{'MRR@10': 0.53057, 'MRR@100': 0.53736, 'MRR@1000': 0.5377}
```

<br/>

# Usage:

#### With Sentence Transformers:

```python
from sentence_transformers import SentenceTransformer
import scipy.spatial


model = SentenceTransformer('prithivida/miniMiracle_hi_v1')

corpus = [
    'एक आदमी खाना खा रहा है।',
    'लोग ब्रेड का एक टुकड़ा खा रहे हैं।',
    'लड़की एक बच्चे को उठाए हुए है।',
    'एक आदमी घोड़े पर सवार है।',
    'एक महिला वायलिन बजा रही है।',
    'दो आदमी जंगल में गाड़ी धकेल रहे हैं।',
    'एक आदमी एक सफेद घोड़े पर एक बंद मैदान में सवारी कर रहा है।',
    'एक बंदर ड्रम बजा रहा है।',
    'एक चीता अपने शिकार के पीछे दौड़ रहा है।',
    'एक बड़ा डिनर है।'
]

corpus_embeddings = model.encode(corpus)

queries = [
    'एक आदमी पास्ता खा रहा है।',
    'एक गोरिल्ला सूट पहने व्यक्ति ड्रम बजा रहा है।'
]

query_embeddings = model.encode(queries)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

# Optional: How to quantize the embeddings
# binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")

```

#### With Huggingface Transformers:
- T.B.A

# FAQs:

#### How can I reduce overall inference cost ?
- You can host these models without heavy torch dependency using the ONNX flavours of these models via [FlashRetrieve](https://github.com/PrithivirajDamodaran/FlashRetrieve) library.


#### How do I reduce vector storage cost ?
[Use Binary and Scalar Quantisation](https://huggingface.co/blog/embedding-quantization)

#### How do I offer hybrid search to improve accuracy ?
MIRACL paper shows simply combining BM25 is a good starting point for a Hybrid option: The below numbers are with mDPR model, but miniMiracle_hi_v1 should give a even better hybrid performance.

| Language  | ISO | nDCG@10 BM25 | nDCG@10 mDPR | nDCG@10 Hybrid |
|-----------|-----|--------------|--------------|----------------|
| **Hindi**     | **hi**  | **0.458**        | **0.383**        | **0.616**          |

#### Why not run MTEB?
MTEB is a general purpose embedding evaluation bechmark covering wide range of tasks available currently only for English, Chinese, French and few other languages but not Indic languages. Besides like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
At the moment MIRACL is the gold standard for a subset of Indic languages.



# Roadmap
We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are

- Spanish
- Tamil
- Arabic
- German
- English ?


# Notes on reproducing:

We welcome everyone to reproduce our results. Here are some tips and observations:

- Use CLS Pooling and Inner Product.
- There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.

Here are our numbers for the full hindi run on BGE-M3

```python
{'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
{'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
{'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
{'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
```

Fair warning BGE-M3 is $ expensive to evaluate, probably that's why it's not part of any of the retrieval slice of MTEB benchmarks.


# Reference:
- [All Cohere numbers are copied form here](https://huggingface.co/datasets/Cohere/miracl-en-queries-22-12)
- [BGE M3-Embedding: Multi-Lingual, Multi-Functionality,
Multi-Granularity Text Embeddings Through Self-Knowledge Distillation](https://arxiv.org/pdf/2402.03216.pdf)
- [Making a MIRACL: Multilingual Information Retrieval
Across a Continuum of Languages](https://arxiv.org/pdf/2210.09984.pdf)
- [IndicIRSuite: Multilingual Dataset and Neural
Information Models for Indian Languages](https://arxiv.org/pdf/2312.09508)
  

# Note on model bias:
- Like any model this model might carry inherent biases from the base models and the datasets it was pretrained and finetuned on. Please use responsibly.