File size: 10,058 Bytes
79ea6bf
 
19789cd
 
 
 
 
 
 
2a668b3
 
a7b92e2
 
ced2aea
 
 
a7b92e2
 
 
 
 
 
ced2aea
 
 
a7b92e2
 
ad94f78
927a453
a7b92e2
279bd09
 
ced2aea
37f957a
ced2aea
37f957a
279bd09
 
d2b8c74
279bd09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37f957a
279bd09
ced2aea
279bd09
 
 
 
 
 
 
 
 
 
 
 
ced2aea
279bd09
 
 
 
 
 
 
 
 
 
 
ced2aea
279bd09
 
f93ddb0
 
 
 
ced2aea
f93ddb0
 
 
 
311c6e8
 
05e1d3f
311c6e8
 
babbc6b
1ee5a9c
19bb940
 
d2db686
5b26c69
19bb940
d9a8992
19bb940
babbc6b
d9a8992
311c6e8
 
1ee5a9c
311c6e8
babbc6b
 
d9a8992
babbc6b
 
 
 
f93ddb0
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: bigscience-bloom-rail-1.0
datasets:
- unicamp-dl/mmarco
- rajpurkar/squad
language:
- fr
- en
pipeline_tag: sentence-similarity
---

# Bloomz-3b Reranking

This reranking model is built from [cmarkea/bloomz-3b-dpo-chat](https://huggingface.co/cmarkea/bloomz-3b-dpo-chat) model and aims to measure the semantic correspondence between
a question (query) and a context. With its normalized scoring, it helps to filter the query/context matchings outputted by a retriever in an ODQA (Open-Domain Question Answering)context.
Moreover, it allows to reorder the results using a more efficient modeling approach than the retriever one. However, this modeling type is not conducive to direct
database searching due to its high computational cost.

Developed to be language-agnostic, this model supports both French and English. Consequently, it can effectively score in a cross-language context without being
influenced by its behavior in a monolingual context (English or French).

## Dataset
The training dataset is composed of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco), consisting of query/positive/hard negative triplets. Additionally,
we have included [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) data from the "train" split, forming query/positive/hard negative triplets. In order to generate hard
negative data for SQuAD, we considered contexts from the same theme as the query but from a different set of queries. Hence, the negative observations belong to the same
themes as the queries but presumably do not contain the answer to the question.

Finally, the triplets are flattened to obtain pairs of query/context sentences with a label 1 if query/positive and a label 0 if query/negative. In each element of the
pair (query and context), the language, French or English, is randomly and uniformly chosen.

## Evaluation

To assess the performance of the reranker, we will make use of the "validation" split of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset. We will select
the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that
the number of themes is limited, and each context from a corresponding theme that does not match the query is considered as a hard negative (other contexts outside the theme are
simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

| Theme name                                   | Context number |
|---------------------------------------------:|:---------------|
| Normans                                      | 39             |
| Computational_complexity_theory              | 48             |
| Southern_California                          | 39             |
| Sky_(United_Kingdom)                         | 22             |
| Victoria_(Australia)                         | 25             |
| Huguenot                                     | 44             |
| Steam_engine                                 | 46             |
| Oxygen                                       | 43             |
| 1973_oil_crisis                              | 24             |
| European_Union_law                           | 40             |
| Amazon_rainforest                            | 21             |
| Ctenophora                                   | 31             |
| Fresno,_California                           | 28             |
| Packet_switching                             | 23             |
| Black_Death                                  | 23             |
| Geology                                      | 25             |
| Pharmacy                                     | 26             |
| Civil_disobedience                           | 26             |
| Construction                                 | 22             |
| Private_school                               | 26             |
| Harvard_University                           | 30             |
| Jacksonville,_Florida                        | 21             |
| Economic_inequality                          | 44             |
| University_of_Chicago                        | 37             |
| Yuan_dynasty                                 | 47             |
| Immune_system                                | 49             |
| Intergovernmental_Panel_on_Climate_Change    | 24             |
| Prime_number                                 | 31             |
| Rhine                                        | 44             |
| Scottish_Parliament                          | 39             |
| Islamism                                     | 39             |
| Imperialism                                  | 39             |
| Warsaw                                       | 49             |
| French_and_Indian_War                        | 46             |
| Force                                        | 44             |

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Firstly, the evaluation scores were computed in cases where both the query and the context are in the same language (French/French).

|     Model (French/French)     |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |    14.47   |   92.19   |   69.77   |    92.03   |    98.09    |    77.74   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) |    5.72    |   36.88   |   69.35   |    95.51   |    98.92    |    79.51   |       0.83       |       0.37      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) |    5.54    |   25.90   |   66.11   |    92.77   |    99.17    |    76.00   |       0.80       |       0.39      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    4.43    |   30.27   |   71.51   |    95.68   |    99.42    |    80.17   |       0.78       |       0.38      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |   15.13   |   60.39    |    57.23   |    83.87    |   96.18   |   66.21   |  0.53   |  0.11  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.49    |    2.58   |   83.55   |    99.17   |     100     |    89.98   |       0.93       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    1.06   |   89.37   |    99.75   |     100     |    93.79   |       0.94       |       0.10      |


Then, we evaluated the model in a cross-language context, with queries in French and contexts in English.

|     Model (French/English)    |  Top-mean  |  Top-std  | Top-1 (%) | Top-10 (%) | Top-100 (%) | MRR (x100) |  mean score Top  |  std score Top  |
|:-----------------------------:|:----------:|:---------:|:---------:|:----------:|:-----------:|:----------:|:----------------:|:---------------:|
|              BM25             |   288.04   |   371.46  |   21.93   |    41.93   |    55.15    |    28.41   |        NA        |        NA       |
| [CamemBERT](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)           |    12.20   |   61.39   |   59.55   |    89.71   |    97.42    |    70.38   |       0.65       |       0.47      |
| [DistilCamemBERT](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR)        |    40.97   |   104.78  |   25.66   |    64.78   |    88.62    |    38.83   |       0.53       |       0.49      |
| [mMiniLMv2-L12](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) |    6.91    |   32.16   |   59.88   |    89.95   |    99.09    |    70.39   |       0.61       |       0.46      |
| [RoBERTa (multilingual)](https://huggingface.co/abbasgolestani/ag-nli-DeTS-sentence-similarity-v2) |    79.32   |   153.62    |   27.91   |    49.50    |    78.16    |   35.41   |   0.40    |  0.12  |
| [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking) |    1.51    |    1.92   |   81.89   |    99.09   |     100     |    88.64   |       0.92       |       0.15      |
| [cmarkea/bloomz-3b-reranking](https://huggingface.co/cmarkea/bloomz-3b-reranking) |    1.22    |    0.98   |   89.20   |    99.84   |     100     |    93.63   |       0.94       |       0.10      |

As observed, the cross-language context does not significantly impact the behavior of our models. If the model were used in a context of reranking and filtering the
Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts
for RAG-type applications.

How to Use Bloomz-3b-reranking
------------------------------

The following example is based on the API Pipeline of the Transformers library.

```python
from transformers import pipeline

reranker = pipeline(
    task='feature-extraction',
    model='cmarkea/bloomz-3b-reranking',
    top_k=None
)

similarities = reranker(
    [
        dict(
            text=context, # the model was trained with context in `text`
            text_pair=query # and query in `text_pair` argument.
        )
        for context in contexts
    ]
)
contexts_reranked = sorted(
    filter(
        lambda x: x[0]['label'] == "LABEL_1",
        zip(similarities, contexts)
    ),
    key=lambda x: x[0]
)
score, contexts_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8
    )
)
```

Citation
--------

```bibtex
@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}
```