Dang Phuong Nam
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,17 +1,17 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
-
- vi
|
5 |
library_name: transformers
|
6 |
pipeline_tag: text-classification
|
7 |
tags:
|
8 |
-
- transformers
|
9 |
-
- cross-encoder
|
10 |
-
- rerank
|
11 |
datasets:
|
12 |
-
- unicamp-dl/mmarco
|
13 |
widget:
|
14 |
-
-
|
15 |
output:
|
16 |
- label: >-
|
17 |
nghệ an có diện tích lớn nhất việt nam
|
@@ -19,4 +19,89 @@ widget:
|
|
19 |
- label: >-
|
20 |
bắc ninh có diện tích nhỏ nhất việt nam
|
21 |
score: 0.05
|
22 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
+
- vi
|
5 |
library_name: transformers
|
6 |
pipeline_tag: text-classification
|
7 |
tags:
|
8 |
+
- transformers
|
9 |
+
- cross-encoder
|
10 |
+
- rerank
|
11 |
datasets:
|
12 |
+
- unicamp-dl/mmarco
|
13 |
widget:
|
14 |
+
- text: tỉnh nào có diện tích lớn nhất việt nam.
|
15 |
output:
|
16 |
- label: >-
|
17 |
nghệ an có diện tích lớn nhất việt nam
|
|
|
19 |
- label: >-
|
20 |
bắc ninh có diện tích nhỏ nhất việt nam
|
21 |
score: 0.05
|
22 |
+
---
|
23 |
+
|
24 |
+
# Reranker
|
25 |
+
|
26 |
+
* [Usage](#usage)
|
27 |
+
* [Using FlagEmbedding](#using-flagembedding)
|
28 |
+
* [Using Huggineface transformers](#using-huggingface-transformers)
|
29 |
+
* [Fine tune](#fine-tune)
|
30 |
+
* [Data format](#data-format)
|
31 |
+
|
32 |
+
Different from embedding model, reranker uses question and document as input and directly output similarity instead of
|
33 |
+
embedding.
|
34 |
+
You can get a relevance score by inputting query and passage to the reranker.
|
35 |
+
And the score can be mapped to a float value in [0,1] by sigmoid function.
|
36 |
+
|
37 |
+
## Usage
|
38 |
+
|
39 |
+
### Using FlagEmbedding
|
40 |
+
|
41 |
+
```
|
42 |
+
pip install -U FlagEmbedding
|
43 |
+
```
|
44 |
+
|
45 |
+
Get relevance scores (higher scores indicate more relevance):
|
46 |
+
|
47 |
+
```python
|
48 |
+
from FlagEmbedding import FlagReranker
|
49 |
+
|
50 |
+
reranker = FlagReranker('namdp/bge-reranker-vietnamese',
|
51 |
+
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
52 |
+
|
53 |
+
score = reranker.compute_score(['query', 'passage'])
|
54 |
+
print(score) # -5.65234375
|
55 |
+
|
56 |
+
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
|
57 |
+
score = reranker.compute_score(['query', 'passage'], normalize=True)
|
58 |
+
print(score) # 0.003497010252573502
|
59 |
+
|
60 |
+
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
|
61 |
+
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
|
62 |
+
print(scores) # [-8.1875, 5.26171875]
|
63 |
+
|
64 |
+
# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
|
65 |
+
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
|
66 |
+
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']],
|
67 |
+
normalize=True)
|
68 |
+
print(scores) # [0.00027803096387751553, 0.9948403768236574]
|
69 |
+
```
|
70 |
+
|
71 |
+
### Using Huggingface transformers
|
72 |
+
|
73 |
+
```
|
74 |
+
pip install -U transformers
|
75 |
+
```
|
76 |
+
|
77 |
+
Get relevance scores (higher scores indicate more relevance):
|
78 |
+
|
79 |
+
```python
|
80 |
+
import torch
|
81 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
82 |
+
|
83 |
+
tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese')
|
84 |
+
model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese')
|
85 |
+
model.eval()
|
86 |
+
|
87 |
+
pairs = [['what is panda?', 'hi'], ['what is panda?',
|
88 |
+
'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
|
89 |
+
with torch.no_grad():
|
90 |
+
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
91 |
+
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
|
92 |
+
print(scores)
|
93 |
+
```
|
94 |
+
|
95 |
+
## Fine-tune
|
96 |
+
|
97 |
+
### Data Format
|
98 |
+
|
99 |
+
Train data should be a json file, where each line is a dict like this:
|
100 |
+
|
101 |
+
```
|
102 |
+
{"query": str, "pos": List[str], "neg": List[str]}
|
103 |
+
```
|
104 |
+
|
105 |
+
`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the
|
106 |
+
relationship between query and texts. If you have no negative texts for a query, you can random sample some from the
|
107 |
+
entire corpus as the negatives.
|