lvwerra HF staff commited on
Commit
8ce6b12
·
1 Parent(s): 1e356d4

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +106 -5
  2. app.py +6 -0
  3. requirements.txt +3 -0
  4. xnli.py +88 -0
README.md CHANGED
@@ -1,12 +1,113 @@
1
  ---
2
- title: Xnli
3
- emoji: 😻
4
- colorFrom: gray
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: XNLI
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for XNLI
16
+
17
+ ## Metric description
18
+
19
+ The XNLI metric allows to evaluate a model's score on the [XNLI dataset](https://huggingface.co/datasets/xnli), which is a subset of a few thousand examples from the [MNLI dataset](https://huggingface.co/datasets/glue/viewer/mnli) that have been translated into a 14 different languages, some of which are relatively low resource such as Swahili and Urdu.
20
+
21
+ As with MNLI, the task is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).
22
+
23
+ ## How to use
24
+
25
+ The XNLI metric is computed based on the `predictions` (a list of predicted labels) and the `references` (a list of ground truth labels).
26
+
27
+ ```python
28
+ from evaluate import load
29
+ xnli_metric = load("xnli")
30
+ predictions = [0, 1]
31
+ references = [0, 1]
32
+ results = xnli_metric.compute(predictions=predictions, references=references)
33
+ ```
34
+
35
+ ## Output values
36
+
37
+ The output of the XNLI metric is simply the `accuracy`, i.e. the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
38
+
39
+ ### Values from popular papers
40
+ The [original XNLI paper](https://arxiv.org/pdf/1809.05053.pdf) reported accuracies ranging from 59.3 (for `ur`) to 73.7 (for `en`) for the BiLSTM-max model.
41
+
42
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/xnli).
43
+
44
+ ## Examples
45
+
46
+ Maximal values:
47
+
48
+ ```python
49
+ >>> from evaluate import load
50
+ >>> xnli_metric = load("xnli")
51
+ >>> predictions = [0, 1]
52
+ >>> references = [0, 1]
53
+ >>> results = xnli_metric.compute(predictions=predictions, references=references)
54
+ >>> print(results)
55
+ {'accuracy': 1.0}
56
+ ```
57
+
58
+ Minimal values:
59
+
60
+ ```python
61
+ >>> from evaluate import load
62
+ >>> xnli_metric = load("xnli")
63
+ >>> predictions = [1, 0]
64
+ >>> references = [0, 1]
65
+ >>> results = xnli_metric.compute(predictions=predictions, references=references)
66
+ >>> print(results)
67
+ {'accuracy': 0.0}
68
+ ```
69
+
70
+ Partial match:
71
+
72
+ ```python
73
+ >>> from evaluate import load
74
+ >>> xnli_metric = load("xnli")
75
+ >>> predictions = [1, 0, 1]
76
+ >>> references = [1, 0, 0]
77
+ >>> results = xnli_metric.compute(predictions=predictions, references=references)
78
+ >>> print(results)
79
+ {'accuracy': 0.6666666666666666}
80
+ ```
81
+
82
+ ## Limitations and bias
83
+
84
+ While accuracy alone does give a certain indication of performance, it can be supplemented by error analysis and a better understanding of the model's mistakes on each of the categories represented in the dataset, especially if they are unbalanced.
85
+
86
+ While the XNLI dataset is multilingual and represents a diversity of languages, in reality, cross-lingual sentence understanding goes beyond translation, given that there are many cultural differences that have an impact on human sentiment annotations. Since the XNLI dataset was obtained by translation based on English sentences, it does not capture these cultural differences.
87
+
88
+
89
+
90
+ ## Citation
91
+
92
+ ```bibtex
93
+ @InProceedings{conneau2018xnli,
94
+ author = "Conneau, Alexis
95
+ and Rinott, Ruty
96
+ and Lample, Guillaume
97
+ and Williams, Adina
98
+ and Bowman, Samuel R.
99
+ and Schwenk, Holger
100
+ and Stoyanov, Veselin",
101
+ title = "XNLI: Evaluating Cross-lingual Sentence Representations",
102
+ booktitle = "Proceedings of the 2018 Conference on Empirical Methods
103
+ in Natural Language Processing",
104
+ year = "2018",
105
+ publisher = "Association for Computational Linguistics",
106
+ location = "Brussels, Belgium",
107
+ }
108
+ ```
109
+
110
+ ## Further References
111
+
112
+ - [XNI Dataset GitHub](https://github.com/facebookresearch/XNLI)
113
+ - [HuggingFace Tasks -- Text Classification](https://huggingface.co/tasks/text-classification)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("xnli")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
xnli.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ XNLI benchmark metric. """
15
+
16
+ import datasets
17
+
18
+ import evaluate
19
+
20
+
21
+ _CITATION = """\
22
+ @InProceedings{conneau2018xnli,
23
+ author = "Conneau, Alexis
24
+ and Rinott, Ruty
25
+ and Lample, Guillaume
26
+ and Williams, Adina
27
+ and Bowman, Samuel R.
28
+ and Schwenk, Holger
29
+ and Stoyanov, Veselin",
30
+ title = "XNLI: Evaluating Cross-lingual Sentence Representations",
31
+ booktitle = "Proceedings of the 2018 Conference on Empirical Methods
32
+ in Natural Language Processing",
33
+ year = "2018",
34
+ publisher = "Association for Computational Linguistics",
35
+ location = "Brussels, Belgium",
36
+ }
37
+ """
38
+
39
+ _DESCRIPTION = """\
40
+ XNLI is a subset of a few thousand examples from MNLI which has been translated
41
+ into a 14 different languages (some low-ish resource). As with MNLI, the goal is
42
+ to predict textual entailment (does sentence A imply/contradict/neither sentence
43
+ B) and is a classification task (given two sentences, predict one of three
44
+ labels).
45
+ """
46
+
47
+ _KWARGS_DESCRIPTION = """
48
+ Computes XNLI score which is just simple accuracy.
49
+ Args:
50
+ predictions: Predicted labels.
51
+ references: Ground truth labels.
52
+ Returns:
53
+ 'accuracy': accuracy
54
+ Examples:
55
+
56
+ >>> predictions = [0, 1]
57
+ >>> references = [0, 1]
58
+ >>> xnli_metric = evaluate.load("xnli")
59
+ >>> results = xnli_metric.compute(predictions=predictions, references=references)
60
+ >>> print(results)
61
+ {'accuracy': 1.0}
62
+ """
63
+
64
+
65
+ def simple_accuracy(preds, labels):
66
+ return (preds == labels).mean()
67
+
68
+
69
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
70
+ class Xnli(evaluate.EvaluationModule):
71
+ def _info(self):
72
+ return evaluate.EvaluationModuleInfo(
73
+ description=_DESCRIPTION,
74
+ citation=_CITATION,
75
+ inputs_description=_KWARGS_DESCRIPTION,
76
+ features=datasets.Features(
77
+ {
78
+ "predictions": datasets.Value("int64" if self.config_name != "sts-b" else "float32"),
79
+ "references": datasets.Value("int64" if self.config_name != "sts-b" else "float32"),
80
+ }
81
+ ),
82
+ codebase_urls=[],
83
+ reference_urls=[],
84
+ format="numpy",
85
+ )
86
+
87
+ def _compute(self, predictions, references):
88
+ return {"accuracy": simple_accuracy(predictions, references)}