lvwerra HF staff commited on
Commit
d64af6b
1 Parent(s): 0c5cd2d

Update Space (evaluate main: 3cd38e2b)

Browse files
Files changed (4) hide show
  1. README.md +63 -5
  2. app.py +6 -0
  3. requirements.txt +3 -0
  4. wilcoxon.py +78 -0
README.md CHANGED
@@ -1,12 +1,70 @@
1
  ---
2
  title: Wilcoxon
3
- emoji: 💻
4
- colorFrom: yellow
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 3.1.4
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Wilcoxon
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - comparison
13
+ description: >-
14
+ Wilcoxon's test is a signed-rank test for comparing paired samples.
15
  ---
16
 
17
+
18
+ # Comparison Card for Wilcoxon
19
+
20
+ ## Comparison description
21
+
22
+ Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.
23
+
24
+ ## How to use
25
+
26
+ The Wilcoxon comparison is used to analyze paired ordinal data.
27
+
28
+ ## Inputs
29
+
30
+ Its arguments are:
31
+
32
+ `predictions1`: a list of predictions from the first model.
33
+
34
+ `predictions2`: a list of predictions from the second model.
35
+
36
+ ## Output values
37
+
38
+ The Wilcoxon comparison outputs two things:
39
+
40
+ `stat`: The Wilcoxon statistic.
41
+
42
+ `p`: The p value.
43
+
44
+ ## Examples
45
+
46
+ Example comparison:
47
+
48
+ ```python
49
+ wilcoxon = evaluate.load("wilcoxon")
50
+ results = wilcoxon.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
51
+ print(results)
52
+ {'stat': 5.0, 'p': 0.625}
53
+ ```
54
+
55
+ ## Limitations and bias
56
+
57
+ The Wilcoxon test is a non-parametric test, so it has relatively few assumptions (basically only that the observations are independent). It should be used to analyze paired ordinal data only.
58
+
59
+ ## Citations
60
+
61
+ ```bibtex
62
+ @incollection{wilcoxon1992individual,
63
+ title={Individual comparisons by ranking methods},
64
+ author={Wilcoxon, Frank},
65
+ booktitle={Breakthroughs in statistics},
66
+ pages={196--202},
67
+ year={1992},
68
+ publisher={Springer}
69
+ }
70
+ ```
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("wilcoxon", module_type="comparison")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ git+https://github.com/huggingface/evaluate@a45df1eb9996eec64ec3282ebe554061cb366388
2
+ datasets~=2.0
3
+ scipy
wilcoxon.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The HuggingFace Evaluate Authors
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """Wilcoxon test for model comparison."""
15
+
16
+ import datasets
17
+ from scipy.stats import wilcoxon
18
+
19
+ import evaluate
20
+
21
+
22
+ _DESCRIPTION = """
23
+ Wilcoxon's test is a non-parametric signed-rank test that tests whether the distribution of the differences is symmetric about zero. It can be used to compare the predictions of two models.
24
+ """
25
+
26
+
27
+ _KWARGS_DESCRIPTION = """
28
+ Args:
29
+ predictions1 (`list` of `float`): Predictions for model 1.
30
+ predictions2 (`list` of `float`): Predictions for model 2.
31
+
32
+ Returns:
33
+ stat (`float`): Wilcoxon test score.
34
+ p (`float`): The p value. Minimum possible value is 0. Maximum possible value is 1.0. A lower p value means a more significant difference.
35
+
36
+ Examples:
37
+ >>> wilcoxon = evaluate.load("wilcoxon")
38
+ >>> results = wilcoxon.compute(predictions1=[-7, 123.45, 43, 4.91, 5], predictions2=[1337.12, -9.74, 1, 2, 3.21])
39
+ >>> print(results)
40
+ {'stat': 5.0, 'p': 0.625}
41
+ """
42
+
43
+
44
+ _CITATION = """
45
+ @incollection{wilcoxon1992individual,
46
+ title={Individual comparisons by ranking methods},
47
+ author={Wilcoxon, Frank},
48
+ booktitle={Breakthroughs in statistics},
49
+ pages={196--202},
50
+ year={1992},
51
+ publisher={Springer}
52
+ }
53
+ """
54
+
55
+
56
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
57
+ class Wilcoxon(evaluate.Comparison):
58
+ def _info(self):
59
+ return evaluate.ComparisonInfo(
60
+ module_type="comparison",
61
+ description=_DESCRIPTION,
62
+ citation=_CITATION,
63
+ inputs_description=_KWARGS_DESCRIPTION,
64
+ features=datasets.Features(
65
+ {
66
+ "predictions1": datasets.Value("float"),
67
+ "predictions2": datasets.Value("float"),
68
+ }
69
+ ),
70
+ )
71
+
72
+ def _compute(self, predictions1, predictions2):
73
+ # calculate difference
74
+ d = [p1 - p2 for (p1, p2) in zip(predictions1, predictions2)]
75
+
76
+ # compute statistic
77
+ res = wilcoxon(d)
78
+ return {"stat": res.statistic, "p": res.pvalue}