Spaces:

evaluate-metric
/

super_glue

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

f39d195

1 Parent(s): 69152e8

Update Space (evaluate main: 828c6327)

Browse files

Files changed (5) hide show

README.md +118 -4
app.py +6 -0
record_evaluation.py +111 -0
requirements.txt +4 -0
super_glue.py +237 -0

README.md CHANGED Viewed

@@ -1,12 +1,126 @@
 ---
-title: Super_glue
-emoji: 👀
-colorFrom: yellow
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: SuperGLUE
+emoji: 🤗
+colorFrom: blue
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for SuperGLUE
+## Metric description
+This metric is used to compute the SuperGLUE evaluation metric associated to each of the subsets of the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue).
+SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
+## How to use
+There are two steps: (1) loading the SuperGLUE metric relevant to the subset of the dataset being used for evaluation; and (2) calculating the metric.
+1. **Loading the relevant SuperGLUE metric** : the subsets of SuperGLUE are the following: `boolq`, `cb`, `copa`, `multirc`, `record`, `rte`, `wic`, `wsc`, `wsc.fixed`, `axb`, `axg`.
+More information about the different subsets of the SuperGLUE dataset can be found on the [SuperGLUE dataset page](https://huggingface.co/datasets/super_glue) and on the [official dataset website](https://super.gluebenchmark.com/).
+2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one list of reference labels. The structure of both inputs depends on the SuperGlUE subset being used:
+Format of `predictions`:
+- for `record`: list of question-answer dictionaries with the following keys:
+    - `idx`: index of the question as specified by the dataset
+    - `prediction_text`: the predicted answer text
+- for `multirc`: list of question-answer dictionaries with the following keys:
+    - `idx`: index of the question-answer pair as specified by the dataset
+    - `prediction`: the predicted answer label
+- otherwise: list of predicted labels
+Format of `references`:
+- for `record`: list of question-answers dictionaries with the following keys:
+    - `idx`: index of the question as specified by the dataset
+    - `answers`: list of possible answers
+- otherwise: list of reference labels
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'copa')
+predictions = [0, 1]
+references = [0, 1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+The output of the metric depends on the SuperGLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+`exact_match`: A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise. (See [Exact Match](https://huggingface.co/metrics/exact_match) for more information).
+`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
+### Values from popular papers
+The [original SuperGLUE paper](https://arxiv.org/pdf/1905.00537.pdf) reported average scores ranging from 47 to 71.5%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
+For more recent model performance, see the [dataset leaderboard](https://super.gluebenchmark.com/leaderboard).
+## Examples
+Maximal values for the COPA subset (which outputs `accuracy`):
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'copa')  # any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
+predictions = [0, 1]
+references = [0, 1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'accuracy': 1.0}
+```
+Minimal values for the MultiRC subset (which outputs `pearson` and `spearmanr`):
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'multirc')
+predictions = [{'idx': {'answer': 0, 'paragraph': 0, 'question': 0}, 'prediction': 0}, {'idx': {'answer': 1, 'paragraph': 2, 'question': 3}, 'prediction': 1}]
+references = [1,0]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'exact_match': 0.0, 'f1_m': 0.0, 'f1_a': 0.0}
+```
+Partial match for the COLA subset (which outputs `matthews_correlation`)
+```python
+from evaluate import load
+super_glue_metric = load('super_glue', 'axb')
+references = [0, 1]
+predictions = [1,1]
+results = super_glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'matthews_correlation': 0.0}
+```
+## Limitations and bias
+This metric works only with datasets that have the same format as the [SuperGLUE dataset](https://huggingface.co/datasets/super_glue).
+The dataset also includes Winogender, a subset of the dataset that is designed to measure gender bias in coreference resolution systems. However, as noted in the SuperGLUE paper, this subset has its limitations: *"It offers only positive predictive value: A poor bias score is clear evidence that a model exhibits gender bias, but a good score does not mean that the model is unbiased.[...] Also, Winogender does not cover all forms of social bias, or even all forms of gender. For instance, the version of the data used here offers no coverage of gender-neutral they or non-binary pronouns."
+## Citation
+```bibtex
+@article{wang2019superglue,
+  title={Super{GLUE}: A Stickier Benchmark for General-Purpose Language Understanding Systems},
+  author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
+  journal={arXiv preprint arXiv:1905.00537},
+  year={2019}
+}
+```
+## Further References
+- [SuperGLUE benchmark homepage](https://super.gluebenchmark.com/)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("super_glue")
+launch_gradio_widget(module)

record_evaluation.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+Official evaluation script for ReCoRD v1.0.
+(Some functions are adopted from the SQuAD evaluation script.)
+"""
+import argparse
+import json
+import re
+import string
+import sys
+from collections import Counter
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+    def white_space_fix(text):
+        return " ".join(text.split())
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+    def lower(text):
+        return text.lower()
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+def f1_score(prediction, ground_truth):
+    prediction_tokens = normalize_answer(prediction).split()
+    ground_truth_tokens = normalize_answer(ground_truth).split()
+    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
+    num_same = sum(common.values())
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(prediction_tokens)
+    recall = 1.0 * num_same / len(ground_truth_tokens)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+def exact_match_score(prediction, ground_truth):
+    return normalize_answer(prediction) == normalize_answer(ground_truth)
+def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
+    scores_for_ground_truths = []
+    for ground_truth in ground_truths:
+        score = metric_fn(prediction, ground_truth)
+        scores_for_ground_truths.append(score)
+    return max(scores_for_ground_truths)
+def evaluate(dataset, predictions):
+    f1 = exact_match = total = 0
+    correct_ids = []
+    for passage in dataset:
+        for qa in passage["qas"]:
+            total += 1
+            if qa["id"] not in predictions:
+                message = f'Unanswered question {qa["id"]} will receive score 0.'
+                print(message, file=sys.stderr)
+                continue
+            ground_truths = list(map(lambda x: x["text"], qa["answers"]))
+            prediction = predictions[qa["id"]]
+            _exact_match = metric_max_over_ground_truths(exact_match_score, prediction, ground_truths)
+            if int(_exact_match) == 1:
+                correct_ids.append(qa["id"])
+            exact_match += _exact_match
+            f1 += metric_max_over_ground_truths(f1_score, prediction, ground_truths)
+    exact_match = exact_match / total
+    f1 = f1 / total
+    return {"exact_match": exact_match, "f1": f1}, correct_ids
+if __name__ == "__main__":
+    expected_version = "1.0"
+    parser = argparse.ArgumentParser("Official evaluation script for ReCoRD v1.0.")
+    parser.add_argument("data_file", help="The dataset file in JSON format.")
+    parser.add_argument("pred_file", help="The model prediction file in JSON format.")
+    parser.add_argument("--output_correct_ids", action="store_true", help="Output the correctly answered query IDs.")
+    args = parser.parse_args()
+    with open(args.data_file) as data_file:
+        dataset_json = json.load(data_file)
+        if dataset_json["version"] != expected_version:
+            print(
+                f'Evaluation expects v-{expected_version}, but got dataset with v-{dataset_json["version"]}',
+                file=sys.stderr,
+            )
+        dataset = dataset_json["data"]
+    with open(args.pred_file) as pred_file:
+        predictions = json.load(pred_file)
+    metrics, correct_ids = evaluate(dataset, predictions)
+    if args.output_correct_ids:
+        print(f"Output {len(correct_ids)} correctly answered question IDs.")
+        with open("correct_ids.json", "w") as f:
+            json.dump(correct_ids, f)

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+sklearn

super_glue.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The SuperGLUE benchmark metric."""
+import datasets
+from sklearn.metrics import f1_score, matthews_corrcoef
+import evaluate
+from .record_evaluation import evaluate as evaluate_record
+_CITATION = """\
+@article{wang2019superglue,
+  title={SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems},
+  author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
+  journal={arXiv preprint arXiv:1905.00537},
+  year={2019}
+}
+"""
+_DESCRIPTION = """\
+SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after
+GLUE with a new set of more difficult language understanding tasks, improved
+resources, and a new public leaderboard.
+"""
+_KWARGS_DESCRIPTION = """
+Compute SuperGLUE evaluation metric associated to each SuperGLUE dataset.
+Args:
+    predictions: list of predictions to score. Depending on the SuperGlUE subset:
+        - for 'record': list of question-answer dictionaries with the following keys:
+            - 'idx': index of the question as specified by the dataset
+            - 'prediction_text': the predicted answer text
+        - for 'multirc': list of question-answer dictionaries with the following keys:
+            - 'idx': index of the question-answer pair as specified by the dataset
+            - 'prediction': the predicted answer label
+        - otherwise: list of predicted labels
+    references: list of reference labels. Depending on the SuperGLUE subset:
+        - for 'record': list of question-answers dictionaries with the following keys:
+            - 'idx': index of the question as specified by the dataset
+            - 'answers': list of possible answers
+        - otherwise: list of reference labels
+Returns: depending on the SuperGLUE subset:
+    - for 'record':
+        - 'exact_match': Exact match between answer and gold answer
+        - 'f1': F1 score
+    - for 'multirc':
+        - 'exact_match': Exact match between answer and gold answer
+        - 'f1_m': Per-question macro-F1 score
+        - 'f1_a': Average F1 score over all answers
+    - for 'axb':
+        'matthews_correlation': Matthew Correlation
+    - for 'cb':
+        - 'accuracy': Accuracy
+        - 'f1': F1 score
+    - for all others:
+        - 'accuracy': Accuracy
+Examples:
+    >>> super_glue_metric = evaluate.load('super_glue', 'copa')  # any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
+    >>> predictions = [0, 1]
+    >>> references = [0, 1]
+    >>> results = super_glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'accuracy': 1.0}
+    >>> super_glue_metric = evaluate.load('super_glue', 'cb')
+    >>> predictions = [0, 1]
+    >>> references = [0, 1]
+    >>> results = super_glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'accuracy': 1.0, 'f1': 1.0}
+    >>> super_glue_metric = evaluate.load('super_glue', 'record')
+    >>> predictions = [{'idx': {'passage': 0, 'query': 0}, 'prediction_text': 'answer'}]
+    >>> references = [{'idx': {'passage': 0, 'query': 0}, 'answers': ['answer', 'another_answer']}]
+    >>> results = super_glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'exact_match': 1.0, 'f1': 1.0}
+    >>> super_glue_metric = evaluate.load('super_glue', 'multirc')
+    >>> predictions = [{'idx': {'answer': 0, 'paragraph': 0, 'question': 0}, 'prediction': 0}, {'idx': {'answer': 1, 'paragraph': 2, 'question': 3}, 'prediction': 1}]
+    >>> references = [0, 1]
+    >>> results = super_glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'exact_match': 1.0, 'f1_m': 1.0, 'f1_a': 1.0}
+    >>> super_glue_metric = evaluate.load('super_glue', 'axb')
+    >>> references = [0, 1]
+    >>> predictions = [0, 1]
+    >>> results = super_glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'matthews_correlation': 1.0}
+"""
+def simple_accuracy(preds, labels):
+    return float((preds == labels).mean())
+def acc_and_f1(preds, labels, f1_avg="binary"):
+    acc = simple_accuracy(preds, labels)
+    f1 = float(f1_score(y_true=labels, y_pred=preds, average=f1_avg))
+    return {
+        "accuracy": acc,
+        "f1": f1,
+    }
+def evaluate_multirc(ids_preds, labels):
+    """
+    Computes F1 score and Exact Match for MultiRC predictions.
+    """
+    question_map = {}
+    for id_pred, label in zip(ids_preds, labels):
+        question_id = f'{id_pred["idx"]["paragraph"]}-{id_pred["idx"]["question"]}'
+        pred = id_pred["prediction"]
+        if question_id in question_map:
+            question_map[question_id].append((pred, label))
+        else:
+            question_map[question_id] = [(pred, label)]
+    f1s, ems = [], []
+    for question, preds_labels in question_map.items():
+        question_preds, question_labels = zip(*preds_labels)
+        f1 = f1_score(y_true=question_labels, y_pred=question_preds, average="macro")
+        f1s.append(f1)
+        em = int(sum(p == l for p, l in preds_labels) == len(preds_labels))
+        ems.append(em)
+    f1_m = float(sum(f1s) / len(f1s))
+    em = sum(ems) / len(ems)
+    f1_a = float(f1_score(y_true=labels, y_pred=[id_pred["prediction"] for id_pred in ids_preds]))
+    return {"exact_match": em, "f1_m": f1_m, "f1_a": f1_a}
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class SuperGlue(evaluate.EvaluationModule):
+    def _info(self):
+        if self.config_name not in [
+            "boolq",
+            "cb",
+            "copa",
+            "multirc",
+            "record",
+            "rte",
+            "wic",
+            "wsc",
+            "wsc.fixed",
+            "axb",
+            "axg",
+        ]:
+            raise KeyError(
+                "You should supply a configuration name selected in "
+                '["boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc", "wsc.fixed", "axb", "axg",]'
+            )
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(self._get_feature_types()),
+            codebase_urls=[],
+            reference_urls=[],
+            format="numpy" if not self.config_name == "record" and not self.config_name == "multirc" else None,
+        )
+    def _get_feature_types(self):
+        if self.config_name == "record":
+            return {
+                "predictions": {
+                    "idx": {
+                        "passage": datasets.Value("int64"),
+                        "query": datasets.Value("int64"),
+                    },
+                    "prediction_text": datasets.Value("string"),
+                },
+                "references": {
+                    "idx": {
+                        "passage": datasets.Value("int64"),
+                        "query": datasets.Value("int64"),
+                    },
+                    "answers": datasets.Sequence(datasets.Value("string")),
+                },
+            }
+        elif self.config_name == "multirc":
+            return {
+                "predictions": {
+                    "idx": {
+                        "answer": datasets.Value("int64"),
+                        "paragraph": datasets.Value("int64"),
+                        "question": datasets.Value("int64"),
+                    },
+                    "prediction": datasets.Value("int64"),
+                },
+                "references": datasets.Value("int64"),
+            }
+        else:
+            return {
+                "predictions": datasets.Value("int64"),
+                "references": datasets.Value("int64"),
+            }
+    def _compute(self, predictions, references):
+        if self.config_name == "axb":
+            return {"matthews_correlation": matthews_corrcoef(references, predictions)}
+        elif self.config_name == "cb":
+            return acc_and_f1(predictions, references, f1_avg="macro")
+        elif self.config_name == "record":
+            dataset = [
+                {
+                    "qas": [
+                        {"id": ref["idx"]["query"], "answers": [{"text": ans} for ans in ref["answers"]]}
+                        for ref in references
+                    ]
+                }
+            ]
+            predictions = {pred["idx"]["query"]: pred["prediction_text"] for pred in predictions}
+            return evaluate_record(dataset, predictions)[0]
+        elif self.config_name == "multirc":
+            return evaluate_multirc(predictions, references)
+        elif self.config_name in ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]:
+            return {"accuracy": simple_accuracy(predictions, references)}
+        else:
+            raise KeyError(
+                "You should supply a configuration name selected in "
+                '["boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc", "wsc.fixed", "axb", "axg",]'
+            )