Spaces:
Running
Running
Update Space (evaluate main: 828c6327)
Browse files
README.md
CHANGED
@@ -1,12 +1,113 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: XNLI
|
3 |
+
emoji: 🤗
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
# Metric Card for XNLI
|
16 |
+
|
17 |
+
## Metric description
|
18 |
+
|
19 |
+
The XNLI metric allows to evaluate a model's score on the [XNLI dataset](https://huggingface.co/datasets/xnli), which is a subset of a few thousand examples from the [MNLI dataset](https://huggingface.co/datasets/glue/viewer/mnli) that have been translated into a 14 different languages, some of which are relatively low resource such as Swahili and Urdu.
|
20 |
+
|
21 |
+
As with MNLI, the task is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).
|
22 |
+
|
23 |
+
## How to use
|
24 |
+
|
25 |
+
The XNLI metric is computed based on the `predictions` (a list of predicted labels) and the `references` (a list of ground truth labels).
|
26 |
+
|
27 |
+
```python
|
28 |
+
from evaluate import load
|
29 |
+
xnli_metric = load("xnli")
|
30 |
+
predictions = [0, 1]
|
31 |
+
references = [0, 1]
|
32 |
+
results = xnli_metric.compute(predictions=predictions, references=references)
|
33 |
+
```
|
34 |
+
|
35 |
+
## Output values
|
36 |
+
|
37 |
+
The output of the XNLI metric is simply the `accuracy`, i.e. the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
|
38 |
+
|
39 |
+
### Values from popular papers
|
40 |
+
The [original XNLI paper](https://arxiv.org/pdf/1809.05053.pdf) reported accuracies ranging from 59.3 (for `ur`) to 73.7 (for `en`) for the BiLSTM-max model.
|
41 |
+
|
42 |
+
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/xnli).
|
43 |
+
|
44 |
+
## Examples
|
45 |
+
|
46 |
+
Maximal values:
|
47 |
+
|
48 |
+
```python
|
49 |
+
>>> from evaluate import load
|
50 |
+
>>> xnli_metric = load("xnli")
|
51 |
+
>>> predictions = [0, 1]
|
52 |
+
>>> references = [0, 1]
|
53 |
+
>>> results = xnli_metric.compute(predictions=predictions, references=references)
|
54 |
+
>>> print(results)
|
55 |
+
{'accuracy': 1.0}
|
56 |
+
```
|
57 |
+
|
58 |
+
Minimal values:
|
59 |
+
|
60 |
+
```python
|
61 |
+
>>> from evaluate import load
|
62 |
+
>>> xnli_metric = load("xnli")
|
63 |
+
>>> predictions = [1, 0]
|
64 |
+
>>> references = [0, 1]
|
65 |
+
>>> results = xnli_metric.compute(predictions=predictions, references=references)
|
66 |
+
>>> print(results)
|
67 |
+
{'accuracy': 0.0}
|
68 |
+
```
|
69 |
+
|
70 |
+
Partial match:
|
71 |
+
|
72 |
+
```python
|
73 |
+
>>> from evaluate import load
|
74 |
+
>>> xnli_metric = load("xnli")
|
75 |
+
>>> predictions = [1, 0, 1]
|
76 |
+
>>> references = [1, 0, 0]
|
77 |
+
>>> results = xnli_metric.compute(predictions=predictions, references=references)
|
78 |
+
>>> print(results)
|
79 |
+
{'accuracy': 0.6666666666666666}
|
80 |
+
```
|
81 |
+
|
82 |
+
## Limitations and bias
|
83 |
+
|
84 |
+
While accuracy alone does give a certain indication of performance, it can be supplemented by error analysis and a better understanding of the model's mistakes on each of the categories represented in the dataset, especially if they are unbalanced.
|
85 |
+
|
86 |
+
While the XNLI dataset is multilingual and represents a diversity of languages, in reality, cross-lingual sentence understanding goes beyond translation, given that there are many cultural differences that have an impact on human sentiment annotations. Since the XNLI dataset was obtained by translation based on English sentences, it does not capture these cultural differences.
|
87 |
+
|
88 |
+
|
89 |
+
|
90 |
+
## Citation
|
91 |
+
|
92 |
+
```bibtex
|
93 |
+
@InProceedings{conneau2018xnli,
|
94 |
+
author = "Conneau, Alexis
|
95 |
+
and Rinott, Ruty
|
96 |
+
and Lample, Guillaume
|
97 |
+
and Williams, Adina
|
98 |
+
and Bowman, Samuel R.
|
99 |
+
and Schwenk, Holger
|
100 |
+
and Stoyanov, Veselin",
|
101 |
+
title = "XNLI: Evaluating Cross-lingual Sentence Representations",
|
102 |
+
booktitle = "Proceedings of the 2018 Conference on Empirical Methods
|
103 |
+
in Natural Language Processing",
|
104 |
+
year = "2018",
|
105 |
+
publisher = "Association for Computational Linguistics",
|
106 |
+
location = "Brussels, Belgium",
|
107 |
+
}
|
108 |
+
```
|
109 |
+
|
110 |
+
## Further References
|
111 |
+
|
112 |
+
- [XNI Dataset GitHub](https://github.com/facebookresearch/XNLI)
|
113 |
+
- [HuggingFace Tasks -- Text Classification](https://huggingface.co/tasks/text-classification)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("xnli")
|
6 |
+
launch_gradio_widget(module)
|
requirements.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
xnli.py
ADDED
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
""" XNLI benchmark metric. """
|
15 |
+
|
16 |
+
import datasets
|
17 |
+
|
18 |
+
import evaluate
|
19 |
+
|
20 |
+
|
21 |
+
_CITATION = """\
|
22 |
+
@InProceedings{conneau2018xnli,
|
23 |
+
author = "Conneau, Alexis
|
24 |
+
and Rinott, Ruty
|
25 |
+
and Lample, Guillaume
|
26 |
+
and Williams, Adina
|
27 |
+
and Bowman, Samuel R.
|
28 |
+
and Schwenk, Holger
|
29 |
+
and Stoyanov, Veselin",
|
30 |
+
title = "XNLI: Evaluating Cross-lingual Sentence Representations",
|
31 |
+
booktitle = "Proceedings of the 2018 Conference on Empirical Methods
|
32 |
+
in Natural Language Processing",
|
33 |
+
year = "2018",
|
34 |
+
publisher = "Association for Computational Linguistics",
|
35 |
+
location = "Brussels, Belgium",
|
36 |
+
}
|
37 |
+
"""
|
38 |
+
|
39 |
+
_DESCRIPTION = """\
|
40 |
+
XNLI is a subset of a few thousand examples from MNLI which has been translated
|
41 |
+
into a 14 different languages (some low-ish resource). As with MNLI, the goal is
|
42 |
+
to predict textual entailment (does sentence A imply/contradict/neither sentence
|
43 |
+
B) and is a classification task (given two sentences, predict one of three
|
44 |
+
labels).
|
45 |
+
"""
|
46 |
+
|
47 |
+
_KWARGS_DESCRIPTION = """
|
48 |
+
Computes XNLI score which is just simple accuracy.
|
49 |
+
Args:
|
50 |
+
predictions: Predicted labels.
|
51 |
+
references: Ground truth labels.
|
52 |
+
Returns:
|
53 |
+
'accuracy': accuracy
|
54 |
+
Examples:
|
55 |
+
|
56 |
+
>>> predictions = [0, 1]
|
57 |
+
>>> references = [0, 1]
|
58 |
+
>>> xnli_metric = evaluate.load("xnli")
|
59 |
+
>>> results = xnli_metric.compute(predictions=predictions, references=references)
|
60 |
+
>>> print(results)
|
61 |
+
{'accuracy': 1.0}
|
62 |
+
"""
|
63 |
+
|
64 |
+
|
65 |
+
def simple_accuracy(preds, labels):
|
66 |
+
return (preds == labels).mean()
|
67 |
+
|
68 |
+
|
69 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
70 |
+
class Xnli(evaluate.EvaluationModule):
|
71 |
+
def _info(self):
|
72 |
+
return evaluate.EvaluationModuleInfo(
|
73 |
+
description=_DESCRIPTION,
|
74 |
+
citation=_CITATION,
|
75 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
76 |
+
features=datasets.Features(
|
77 |
+
{
|
78 |
+
"predictions": datasets.Value("int64" if self.config_name != "sts-b" else "float32"),
|
79 |
+
"references": datasets.Value("int64" if self.config_name != "sts-b" else "float32"),
|
80 |
+
}
|
81 |
+
),
|
82 |
+
codebase_urls=[],
|
83 |
+
reference_urls=[],
|
84 |
+
format="numpy",
|
85 |
+
)
|
86 |
+
|
87 |
+
def _compute(self, predictions, references):
|
88 |
+
return {"accuracy": simple_accuracy(predictions, references)}
|