Zongxia commited on
Commit
aab8fb8
1 Parent(s): 3b931a0

Roberta large answer equivalence

Browse files
Files changed (7) hide show
  1. README.md +263 -3
  2. config.json +25 -0
  3. merges.txt +0 -0
  4. model.safetensors +3 -0
  5. special_tokens_map.json +51 -0
  6. tokenizer_config.json +56 -0
  7. vocab.json +0 -0
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: mit
4
+ language:
5
+ - en
6
+ metrics:
7
+ - exact_match
8
+ - f1
9
+ - bertscore
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # QA-Evaluation-Metrics
13
+
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
+
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
+
19
+ ### Updates
20
+ - Uopdated to version 0.2.8
21
+ - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
+ - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
+
24
+
25
+ ## Installation
26
+ * Python version >= 3.6
27
+ * openai version >= 1.0
28
+
29
+
30
+ To install the package, run the following command:
31
+
32
+ ```bash
33
+ pip install qa-metrics
34
+ ```
35
+
36
+ ## Usage/Logistics
37
+
38
+ The python package currently provides six QA evaluation methods.
39
+ - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
40
+ - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
41
+ - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
42
+ - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
43
+ - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
44
+
45
+ ## Normalized Exact Match
46
+ #### `em_match`
47
+
48
+ Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
49
+
50
+ **Parameters**
51
+
52
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
53
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
54
+
55
+ **Returns**
56
+
57
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
58
+
59
+ ```python
60
+ from qa_metrics.em import em_match
61
+
62
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
63
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
64
+ match_result = em_match(reference_answer, candidate_answer)
65
+ print("Exact Match: ", match_result)
66
+ '''
67
+ Exact Match: False
68
+ '''
69
+ ```
70
+
71
+ ## F1 Score
72
+ #### `f1_score_with_precision_recall`
73
+
74
+ Calculates F1 score, precision, and recall between a reference and a candidate answer.
75
+
76
+ **Parameters**
77
+
78
+ - `reference_answer` (str): A gold (correct) answers to the question.
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
80
+
81
+ **Returns**
82
+
83
+ - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
84
+
85
+ ```python
86
+ from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
87
+
88
+ f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
89
+ print("F1 stats: ", f1_stats)
90
+ '''
91
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
92
+ '''
93
+
94
+ match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
95
+ print("F1 Match: ", match_result)
96
+ '''
97
+ F1 Match: False
98
+ '''
99
+ ```
100
+
101
+ ## Transformer Neural Evaluation
102
+ Our fine-tuned BERT model is on 🤗 [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! 🔥🔥🔥
103
+
104
+ #### `transformer_match`
105
+
106
+ Returns True if the candidate answer is a match of any of the gold answers.
107
+
108
+ **Parameters**
109
+
110
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
111
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
112
+ - `question` (str): The question for which the answers are being evaluated.
113
+
114
+ **Returns**
115
+
116
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
117
+
118
+ ```python
119
+ from qa_metrics.transformerMatcher import TransformerMatcher
120
+
121
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
122
+ tm = TransformerMatcher("roberta-large")
123
+ scores = tm.get_scores(reference_answer, candidate_answer, question)
124
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
125
+ print("Score: %s; bert Match: %s" % (scores, match_result))
126
+ '''
127
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
128
+ '''
129
+ ```
130
+
131
+ ## Efficient and Robust Question/Answer Type Evaluation
132
+ #### 1. `get_highest_score`
133
+
134
+ Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
135
+
136
+ **Parameters**
137
+
138
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
139
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
140
+ - `question` (str): The question for which the answers are being evaluated.
141
+
142
+ **Returns**
143
+
144
+ - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
145
+
146
+ #### 2. `get_scores`
147
+
148
+ Returns all the gold answer and candidate answer pairs' matching scores.
149
+
150
+ **Parameters**
151
+
152
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
153
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
154
+ - `question` (str): The question for which the answers are being evaluated.
155
+
156
+ **Returns**
157
+
158
+ - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
159
+
160
+ #### 3. `evaluate`
161
+
162
+ Returns True if the candidate answer is a match of any of the gold answers.
163
+
164
+ **Parameters**
165
+
166
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
167
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
168
+ - `question` (str): The question for which the answers are being evaluated.
169
+
170
+ **Returns**
171
+
172
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
173
+
174
+
175
+ ```python
176
+ from qa_metrics.pedant import PEDANT
177
+
178
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
179
+ pedant = PEDANT()
180
+ scores = pedant.get_scores(reference_answer, candidate_answer, question)
181
+ max_pair, highest_scores = pedant.get_highest_score(reference_answer, candidate_answer, question)
182
+ match_result = pedant.evaluate(reference_answer, candidate_answer, question)
183
+ print("Max Pair: %s; Highest Score: %s" % (max_pair, highest_scores))
184
+ print("Score: %s; PANDA Match: %s" % (scores, match_result))
185
+ '''
186
+ Max Pair: ('the princess and the frog', 'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"'); Highest Score: 0.854451712151719
187
+ Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PANDA Match: True
188
+ '''
189
+ ```
190
+
191
+ ```python
192
+ print(pedant.get_score(reference_answer[1], candidate_answer, question))
193
+ '''
194
+ 0.7122460127464126
195
+ '''
196
+ ```
197
+
198
+
199
+ ## Prompting LLM For Evaluation
200
+
201
+ Note: The prompting function can be used for any prompting purposes.
202
+
203
+ ###### OpenAI
204
+ ```python
205
+ from qa_metrics.prompt_llm import CloseLLM
206
+ model = CloseLLM()
207
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
208
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
209
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
210
+
211
+ '''
212
+ 'correct'
213
+ '''
214
+ ```
215
+
216
+ ###### Anthropic
217
+ ```python
218
+ model = CloseLLM()
219
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
220
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
221
+
222
+ '''
223
+ 'correct'
224
+ '''
225
+ ```
226
+
227
+ ###### deepinfra (See below for descriptions of more models)
228
+ ```python
229
+ from qa_metrics.prompt_open_llm import OpenLLM
230
+ model = OpenLLM()
231
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
232
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
233
+
234
+ '''
235
+ 'correct'
236
+ '''
237
+ ```
238
+
239
+ If you find this repo avialable, please cite our paper:
240
+ ```bibtex
241
+ @misc{li2024panda,
242
+ title={PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation},
243
+ author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
244
+ year={2024},
245
+ eprint={2402.11161},
246
+ archivePrefix={arXiv},
247
+ primaryClass={cs.CL}
248
+ }
249
+ ```
250
+
251
+
252
+ ## Updates
253
+ - [01/24/24] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
254
+ - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
255
+ - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
256
+
257
+ ## License
258
+
259
+ This project is licensed under the [MIT License](LICENSE.md) - see the LICENSE file for details.
260
+
261
+ ## Contact
262
+
263
+ For any additional questions or comments, please contact [zli12321@umd.edu].
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.37.2",
22
+ "type_vocab_size": 1,
23
+ "use_cache": true,
24
+ "vocab_size": 50265
25
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2b043d81bf1338ec2253ddab50e86a2b2bc6a39e9e8ba3fbb5f63816299abcd
3
+ size 135
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "unk_token": "<unk>"
56
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff