saattrupdan commited on
Commit
79f114e
·
1 Parent(s): 5d40291

feat: Add citation

Browse files
Files changed (1) hide show
  1. app.py +22 -7
app.py CHANGED
@@ -42,13 +42,12 @@ available](https://scandeval.com).
42
  The generative models are evaluated using in-context learning with few-shot prompts.
43
  The few-shot examples are sampled randomly from the training split, and we benchmark
44
  the models 10 times with bootstrapped test sets and different few-shot examples in each
45
- iteration. This allows us to better measure the uncertainty of the results.
46
-
47
- We use the uncertainty in the radial plot when we compute the win ratios (i.e., the
48
- percentage of other models that a model beats on a task). Namely, we compute the win
49
- ratio as the percentage of other models that a model _significantly_ beats on a task,
50
- where we use a paired t-test with a significance level of 0.05 to determine whether a
51
- model significantly beats another model.
52
 
53
  ## The Benchmark Datasets
54
 
@@ -104,6 +103,22 @@ classification, we use the probabilities of the answer letter (a, b, c or d) to
104
  the answer. The datasets in this task are machine translated versions of the
105
  [HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
106
  Correlation Coefficient (MCC) as the evaluation metric.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  """
108
 
109
 
 
42
  The generative models are evaluated using in-context learning with few-shot prompts.
43
  The few-shot examples are sampled randomly from the training split, and we benchmark
44
  the models 10 times with bootstrapped test sets and different few-shot examples in each
45
+ iteration. This allows us to better measure the uncertainty of the results. We use the
46
+ uncertainty in the radial plot when we compute the win ratios (i.e., the percentage of
47
+ other models that a model beats on a task). Namely, we compute the win ratio as the
48
+ percentage of other models that a model _significantly_ beats on a task, where we use a
49
+ paired t-test with a significance level of 0.05 to determine whether a model
50
+ significantly beats another model.
 
51
 
52
  ## The Benchmark Datasets
53
 
 
103
  the answer. The datasets in this task are machine translated versions of the
104
  [HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
105
  Correlation Coefficient (MCC) as the evaluation metric.
106
+
107
+
108
+ ## Citation
109
+
110
+ If you use the ScandEval benchmark in your work, please cite [the
111
+ paper](https://aclanthology.org/2023.nodalida-1.20):
112
+
113
+ ```
114
+ @inproceedings{nielsen2023scandeval,
115
+ title={ScandEval: A Benchmark for Scandinavian Natural Language Processing},
116
+ author={Nielsen, Dan},
117
+ booktitle={Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
118
+ pages={185--201},
119
+ year={2023}
120
+ }
121
+ ```
122
  """
123
 
124