mfajcik commited on
Commit
feb5abd
β€’
1 Parent(s): da8b87f

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +1 -1
content.py CHANGED
@@ -11,7 +11,7 @@ Here, you can compare models on tasks in the Czech language or submit your own m
11
  - Check out the **About** page for a brief overview of our evaluation protocol, win score mechanism, citation details, and future plans for this benchmark.
12
  - __How scoring works__:
13
  - On each task, we score every model using one of our metrics (Accuracy for multiple choice tasks, Word Perplexity for language modeling, AUROC for classification).
14
- - On each task, for each model pair, we evaluate a __duel__: a statistical significant test (with alpha 5%) that the model's improvement in metric is significant.
15
  - For each task, the __Duel Win Score__ reflects the proportion of duels a model has won.
16
  - Category scores are calculated by averaging scores across all tasks within that category. When viewing a specific category (other than Overall), the "Average" column displays the Category Duel Win Scores.
17
  - The __Overall__ Duel Win Score is the average across all category scores. When selecting the Overall category, the "Average" column shows the Overall Duel Win Score.
 
11
  - Check out the **About** page for a brief overview of our evaluation protocol, win score mechanism, citation details, and future plans for this benchmark.
12
  - __How scoring works__:
13
  - On each task, we score every model using one of our metrics (Accuracy for multiple choice tasks, Word Perplexity for language modeling, AUROC for classification).
14
+ - On each task for each model pair, we perform a _duel_: a statistical significance test (with a 5% alpha level) to determine if the model's improvement in the metric is significant.
15
  - For each task, the __Duel Win Score__ reflects the proportion of duels a model has won.
16
  - Category scores are calculated by averaging scores across all tasks within that category. When viewing a specific category (other than Overall), the "Average" column displays the Category Duel Win Scores.
17
  - The __Overall__ Duel Win Score is the average across all category scores. When selecting the Overall category, the "Average" column shows the Overall Duel Win Score.