jbnayahu commited on
Commit
70b49b3
·
1 Parent(s): fa55545

Updated to latest benchbench

Browse files

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>

app.py CHANGED
@@ -112,7 +112,7 @@ with st.sidebar:
112
 
113
  n_models_taken_list = [n_models_taken_list]
114
 
115
- n_exps = 5
116
 
117
  submitted = st.form_submit_button(label="Run BAT")
118
 
@@ -393,7 +393,7 @@ st.dataframe(
393
  column_order=cols_used,
394
  hide_index=True,
395
  use_container_width=True,
396
- height=300,
397
  column_config={col: {"alignment": "center"} for col in cols_used},
398
  )
399
 
@@ -448,6 +448,36 @@ with right:
448
  primaryClass={cs.CL},
449
  url={https://arxiv.org/abs/2407.13696},
450
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
451
 
452
 
453
  @misc{berkeley-function-calling-leaderboard,
 
112
 
113
  n_models_taken_list = [n_models_taken_list]
114
 
115
+ n_exps = 3
116
 
117
  submitted = st.form_submit_button(label="Run BAT")
118
 
 
393
  column_order=cols_used,
394
  hide_index=True,
395
  use_container_width=True,
396
+ height=500,
397
  column_config={col: {"alignment": "center"} for col in cols_used},
398
  )
399
 
 
448
  primaryClass={cs.CL},
449
  url={https://arxiv.org/abs/2407.13696},
450
  }
451
+
452
+ @misc{decentralized2024,
453
+ title = {Decentralized Arena via Collective LLM Intelligence: Building Automated, Robust, and Transparent LLM Evaluation for Numerous Dimensions},
454
+ author = {Yanbin Yin AND Zhen Wang AND Kun Zhou AND Xiangdong Zhang AND Shibo Hao AND Yi Gu AND Jieyuan Liu AND Somanshu Singla AND Tianyang Liu AND Xing, Eric P. AND Zhengzhong Liu AND Haojian Jin AND Zhiting Hu},
455
+ year = 2024,
456
+ month = 10,
457
+ url = {https://de-arena.maitrix.org/}
458
+ }
459
+
460
+ @techreport{balachandran2024eureka,
461
+ author = {Balachandran, Vidhisha and Chen, Jingya and Joshi, Neel and Nushi, Besmira and Palangi, Hamid and Salinas, Eduardo and Vineet, Vibhav and Woffinden-Luey, James and Yousefi, Safoora},
462
+ title = {EUREKA: Evaluating and Understanding Large Foundation Models},
463
+ institution = {Microsoft},
464
+ year = {2024},
465
+ month = {September},
466
+ abstract = {Rigorous and reproducible evaluation of large foundation models is critical for assessing the state of the art, informing next steps in model improvement, and for guiding scientific advances in Artificial Intelligence (AI). Evaluation is also important for informing the increasing number of application developers that build services on foundation models. The evaluation process has however become challenging in practice due to several reasons that require immediate attention from the community, including benchmark saturation, lack of transparency in the methods being deployed for measurement, development challenges in extracting the right measurements for generative tasks, and, more generally, the extensive number of capabilities that need to be considered for showing a well-rounded comparison across models. In addition, despite the overwhelming numbers of side-by-side capability evaluations available, we still lack a deeper understanding about when and how different models fail for a given capability and whether the nature of failures is similar across different models being released over time.
467
+
468
+ We make three contributions to alleviate the above challenges. First, we present Eureka, a reusable and open evaluation framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art foundation models and (ii) represent fundamental but overlooked capabilities for completing tasks in both language and vision modalities. The available space for improvement that comes inherently from non-saturated benchmarks, enables us to discover meaningful differences between models at a capability level. Third, using the framework and Eureka-Bench, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights for failure understanding and model comparison by disaggregating the measurements across important subcategories of data. Such insights uncover granular weaknesses of models for a given capability and can then be further leveraged to plan more precisely on what areas are most promising for improvement. Eureka is available as open-source to foster transparent and reproducible evaluation practices.
469
+
470
+ In contrast to recent trends in evaluation reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for several capabilities. Despite the many observed improvements, it also becomes obvious that current models still struggle with a number of fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.},
471
+ url = {https://www.microsoft.com/en-us/research/publication/eureka-evaluating-and-understanding-large-foundation-models/},
472
+ number = {MSR-TR-2024-33},
473
+ }
474
+
475
+ @article{hsieh2024ruler,
476
+ title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
477
+ author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
478
+ year={2024},
479
+ journal={arXiv preprint arXiv:2404.06654},
480
+ }
481
 
482
 
483
  @misc{berkeley-function-calling-leaderboard,
cache_old/aggregate_scoress_cache_151f5bfbf87ac7384c2759731c72ec0c.csv DELETED
@@ -1,122 +0,0 @@
1
- model,score
2
- gpt_4o_2024_05_13,0.9847612958226769
3
- claude_3_5_sonnet_20240620,0.982905982905983
4
- gpt_4o_2024_08_06,0.9575873827791986
5
- gpt_4_turbo_2024_04_09,0.9428463693169576
6
- gpt_4_0125_preview,0.9171132221004344
7
- mistral_large_2407,0.8868286445012787
8
- llama3_1_405b_instruct,0.8672150411280846
9
- yi_large_preview,0.8641553641553642
10
- hermes_3_llama3_1_70b,0.8626160990712074
11
- smaug_qwen2_72b_instruct,0.8593911248710011
12
- claude_3_opus_20240229,0.8573567665639277
13
- llama3_1_70b_instruct,0.8528408270971201
14
- athene_70b,0.8493788819875776
15
- deepseek_coder_v2,0.8444160272804775
16
- qwen2_72b_instruct,0.8354710666091739
17
- yi_large,0.8346273291925466
18
- gpt_4_0613,0.8146763722211293
19
- llama3_70b_instruct,0.8127546753337573
20
- llama3_70b,0.8105600539811066
21
- gemma_2_27b_it,0.8045273029120115
22
- gpt_4o_mini_2024_07_18,0.8032033326150972
23
- gemma_2_9b_it_dpo,0.790057915057915
24
- llama3_instruct_8b_simpo,0.7884068278805121
25
- phi_3_5_moe_instruct,0.7808307533539731
26
- qwen1_5_110b_chat,0.776004448721167
27
- qwen1_5_32b,0.7658569500674763
28
- yi_1_5_34b_chat,0.7553884711779449
29
- llama_2_70b,0.7303193882141251
30
- mixtral_8x22b_instruct_v0_1,0.7256023690940907
31
- gemma_2_9b_it_simpo,0.7199248120300753
32
- qwen1_5_32b_chat,0.7149122807017544
33
- mixtral_8x22b_v0_1,0.7135490753911806
34
- yi_34b,0.7128879892037787
35
- internlm2_5_20b_chat,0.6842105263157895
36
- phi_3_small_128k_instruct,0.66937564499484
37
- phi_3_medium_4k_instruct,0.6675079642841117
38
- claude_3_sonnet_20240229,0.653911731916847
39
- gemma_2_9b_it,0.6422797189051059
40
- infinity_instruct_3m_0625_llama3_8b,0.6273115220483642
41
- mistral_v0_1_7b,0.6239316239316239
42
- phi_3_5_mini_instruct,0.6202270381836945
43
- mistral_medium,0.6122209165687427
44
- mistral_large_2402,0.6058211467418628
45
- claude_instant_1_2,0.6049896049896051
46
- claude_2_0,0.6020066889632107
47
- yi_1_5_9b_chat,0.5881787802840435
48
- qwen1_5_14b,0.5770917678812416
49
- command_r_plus,0.5761033510394125
50
- llama_65b,0.5736992052781527
51
- gpt_3_5_turbo_0613,0.5724018332713985
52
- qwen1_5_72b_chat,0.5668371367348349
53
- phi_3_mini_4k_instruct,0.5548245614035088
54
- deepseek_llm_67b_chat,0.5506756756756757
55
- claude_3_haiku_20240307,0.549424005945745
56
- yi_34b_chat,0.5455449728905107
57
- dbrx_instructruct,0.5344129554655871
58
- jurassic_2_jumbo_178b,0.532051282051282
59
- llama3_1_8b_instruct,0.5175232440678665
60
- claude_2_1,0.5110980545763154
61
- qwen2_7b_instruct,0.5034227726178191
62
- mistral_small_2402,0.49924585218702866
63
- mixtral_8x7b_v0_1,0.49324324324324326
64
- glm_4_9b_chat,0.46499582289055974
65
- qwen1_5_14b_chat,0.4621068436857911
66
- phi_3_small_8k_instruct,0.45481670929241264
67
- gpt_3_5_turbo_0301,0.4528985507246377
68
- snorkel_mistral_pairrm_dpo,0.4521151586368978
69
- gemma_7b,0.4471997300944669
70
- gpt_3_5_turbo_0125,0.4401920188365201
71
- llama3_8b,0.43302968960863697
72
- dbrx_instruct,0.4266409266409266
73
- llama3_8b_instruct,0.420135922511747
74
- phi_3_mini_128k_instruct,0.4153205904787544
75
- llama_2_13b,0.41490478332583597
76
- jurassic_2_grande_17b,0.39529914529914534
77
- openhermes_2_5_mistral_7b,0.3832617447168531
78
- mistral_7b_v0_3,0.3737553342816501
79
- mixtral_8x7b_instruct_v0_1,0.3713078251895724
80
- qwen1_5_7b,0.3508771929824561
81
- yi_1_5_6b_chat,0.3354636591478697
82
- falcon_40b,0.32812265707002547
83
- command_r,0.32386140074759
84
- internlm2_chat_20b,0.32252252252252256
85
- mistral_7b_v0_2,0.31970128022759603
86
- luminous_supreme_70b,0.30128205128205127
87
- starling_lm_7b_alpha,0.29823530624445954
88
- yi_6b,0.29234143049932526
89
- mistral_7b_instruct_v0_2,0.28609513981031004
90
- zephyr_7b_alpha,0.2838442157327606
91
- zephyr_7b_beta,0.2666234345800909
92
- gemma_1_1_7b_it,0.26226051061156724
93
- mistral_7b_instruct_v0_3,0.2537839697282422
94
- starling_lm_7b_beta,0.25234441602728047
95
- llama_2_7b,0.2391288049182786
96
- luminous_extended_30b,0.2329059829059829
97
- alpaca_7b,0.22072072072072071
98
- vicuna_33b_v1_3,0.2056404230317274
99
- phi_2,0.20087901666849037
100
- qwen2_1_5b_instruct,0.19711042311661506
101
- yi_6b_chat,0.1938854489164087
102
- qwen1_5_7b_chat,0.1916569245052217
103
- tulu_2_dpo_70b,0.17624223602484473
104
- qwen1_5_4b_chat,0.1674406604747162
105
- llama_2_70b_chat,0.15527950310559005
106
- gpt_neox_20b,0.14400584795321636
107
- vicuna_7b_v1_5,0.13619501854795973
108
- falcon_40b_instruct,0.13264580369843526
109
- gemma_7b_it,0.12136319058515854
110
- falcon_7b,0.11407257459889038
111
- gpt_j_6b,0.10160818713450293
112
- luminous_base_13b,0.08333333333333333
113
- llama_2_7b_chat,0.08304448781801049
114
- gemma_1_1_2b_it,0.07665903890160183
115
- olmo_7b,0.06545209176788123
116
- gemma_2b_it,0.05921052631578947
117
- qwen1_5_1_8b_chat,0.059167526659786716
118
- qwen2_0_5b_instruct,0.059081527347781215
119
- pythia_12b,0.054093567251461985
120
- pythia_6_9b,0.019736842105263157
121
- falcon_7b_instruct,0.013513513513513514
122
- qwen1_5_0_5b_chat,0.013157894736842105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/aggregate_scoress_cache_1b58bbc4e0d124b0a524da1001369741.csv DELETED
@@ -1,122 +0,0 @@
1
- model,score
2
- gpt_4o_2024_05_13,0.9847612958226769
3
- claude_3_5_sonnet_20240620,0.982905982905983
4
- gpt_4o_2024_08_06,0.9575873827791986
5
- gpt_4_turbo_2024_04_09,0.9428463693169576
6
- gpt_4_0125_preview,0.9171132221004344
7
- mistral_large_2407,0.8868286445012787
8
- llama3_1_405b_instruct,0.8672150411280846
9
- yi_large_preview,0.8641553641553642
10
- hermes_3_llama3_1_70b,0.8626160990712074
11
- smaug_qwen2_72b_instruct,0.8593911248710011
12
- claude_3_opus_20240229,0.8573567665639277
13
- llama3_1_70b_instruct,0.8528408270971201
14
- athene_70b,0.8493788819875776
15
- deepseek_coder_v2,0.8444160272804775
16
- qwen2_72b_instruct,0.8354710666091739
17
- yi_large,0.8346273291925466
18
- gpt_4_0613,0.8146763722211293
19
- llama3_70b_instruct,0.8127546753337573
20
- llama3_70b,0.8105600539811066
21
- gemma_2_27b_it,0.8045273029120115
22
- gpt_4o_mini_2024_07_18,0.8032033326150972
23
- gemma_2_9b_it_dpo,0.790057915057915
24
- llama3_instruct_8b_simpo,0.7884068278805121
25
- phi_3_5_moe_instruct,0.7808307533539731
26
- qwen1_5_110b_chat,0.776004448721167
27
- qwen1_5_32b,0.7658569500674763
28
- yi_1_5_34b_chat,0.7553884711779449
29
- llama_2_70b,0.7303193882141251
30
- mixtral_8x22b_instruct_v0_1,0.7256023690940907
31
- gemma_2_9b_it_simpo,0.7199248120300753
32
- qwen1_5_32b_chat,0.7149122807017544
33
- mixtral_8x22b_v0_1,0.7135490753911806
34
- yi_34b,0.7128879892037787
35
- internlm2_5_20b_chat,0.6842105263157895
36
- phi_3_small_128k_instruct,0.66937564499484
37
- phi_3_medium_4k_instruct,0.6675079642841117
38
- claude_3_sonnet_20240229,0.653911731916847
39
- gemma_2_9b_it,0.6422797189051059
40
- infinity_instruct_3m_0625_llama3_8b,0.6273115220483642
41
- mistral_v0_1_7b,0.6239316239316239
42
- phi_3_5_mini_instruct,0.6202270381836945
43
- mistral_medium,0.6122209165687427
44
- mistral_large_2402,0.6058211467418628
45
- claude_instant_1_2,0.6049896049896051
46
- claude_2_0,0.6020066889632107
47
- yi_1_5_9b_chat,0.5881787802840435
48
- qwen1_5_14b,0.5770917678812416
49
- command_r_plus,0.5761033510394125
50
- llama_65b,0.5736992052781527
51
- gpt_3_5_turbo_0613,0.5724018332713985
52
- qwen1_5_72b_chat,0.5668371367348349
53
- phi_3_mini_4k_instruct,0.5548245614035088
54
- deepseek_llm_67b_chat,0.5506756756756757
55
- claude_3_haiku_20240307,0.549424005945745
56
- yi_34b_chat,0.5455449728905107
57
- dbrx_instructruct,0.5344129554655871
58
- jurassic_2_jumbo_178b,0.532051282051282
59
- llama3_1_8b_instruct,0.5175232440678665
60
- claude_2_1,0.5110980545763154
61
- qwen2_7b_instruct,0.5034227726178191
62
- mistral_small_2402,0.49924585218702866
63
- mixtral_8x7b_v0_1,0.49324324324324326
64
- glm_4_9b_chat,0.46499582289055974
65
- qwen1_5_14b_chat,0.4621068436857911
66
- phi_3_small_8k_instruct,0.45481670929241264
67
- gpt_3_5_turbo_0301,0.4528985507246377
68
- snorkel_mistral_pairrm_dpo,0.4521151586368978
69
- gemma_7b,0.4471997300944669
70
- gpt_3_5_turbo_0125,0.4401920188365201
71
- llama3_8b,0.43302968960863697
72
- dbrx_instruct,0.4266409266409266
73
- llama3_8b_instruct,0.420135922511747
74
- phi_3_mini_128k_instruct,0.4153205904787544
75
- llama_2_13b,0.41490478332583597
76
- jurassic_2_grande_17b,0.39529914529914534
77
- openhermes_2_5_mistral_7b,0.3832617447168531
78
- mistral_7b_v0_3,0.3737553342816501
79
- mixtral_8x7b_instruct_v0_1,0.3713078251895724
80
- qwen1_5_7b,0.3508771929824561
81
- yi_1_5_6b_chat,0.3354636591478697
82
- falcon_40b,0.32812265707002547
83
- command_r,0.32386140074759
84
- internlm2_chat_20b,0.32252252252252256
85
- mistral_7b_v0_2,0.31970128022759603
86
- luminous_supreme_70b,0.30128205128205127
87
- starling_lm_7b_alpha,0.29823530624445954
88
- yi_6b,0.29234143049932526
89
- mistral_7b_instruct_v0_2,0.28609513981031004
90
- zephyr_7b_alpha,0.2838442157327606
91
- zephyr_7b_beta,0.2666234345800909
92
- gemma_1_1_7b_it,0.26226051061156724
93
- mistral_7b_instruct_v0_3,0.2537839697282422
94
- starling_lm_7b_beta,0.25234441602728047
95
- llama_2_7b,0.2391288049182786
96
- luminous_extended_30b,0.2329059829059829
97
- alpaca_7b,0.22072072072072071
98
- vicuna_33b_v1_3,0.2056404230317274
99
- phi_2,0.20087901666849037
100
- qwen2_1_5b_instruct,0.19711042311661506
101
- yi_6b_chat,0.1938854489164087
102
- qwen1_5_7b_chat,0.1916569245052217
103
- tulu_2_dpo_70b,0.17624223602484473
104
- qwen1_5_4b_chat,0.1674406604747162
105
- llama_2_70b_chat,0.15527950310559005
106
- gpt_neox_20b,0.14400584795321636
107
- vicuna_7b_v1_5,0.13619501854795973
108
- falcon_40b_instruct,0.13264580369843526
109
- gemma_7b_it,0.12136319058515854
110
- falcon_7b,0.11407257459889038
111
- gpt_j_6b,0.10160818713450293
112
- luminous_base_13b,0.08333333333333333
113
- llama_2_7b_chat,0.08304448781801049
114
- gemma_1_1_2b_it,0.07665903890160183
115
- olmo_7b,0.06545209176788123
116
- gemma_2b_it,0.05921052631578947
117
- qwen1_5_1_8b_chat,0.059167526659786716
118
- qwen2_0_5b_instruct,0.059081527347781215
119
- pythia_12b,0.054093567251461985
120
- pythia_6_9b,0.019736842105263157
121
- falcon_7b_instruct,0.013513513513513514
122
- qwen1_5_0_5b_chat,0.013157894736842105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/aggregate_scoress_cache_741f08262e15cba4bd6c8b25f2b138ca.csv DELETED
@@ -1,62 +0,0 @@
1
- model,score
2
- claude_3_5_sonnet_20240620,1.0
3
- gpt_4o_2024_05_13,0.9833333333333333
4
- gpt_4_0125_preview,0.9666666666666667
5
- gpt_4o_2024_08_06,0.95
6
- athene_70b,0.9333333333333333
7
- gpt_4o_mini,0.9166666666666666
8
- gemini_1_5_pro_api_preview,0.9
9
- mistral_large_2407,0.8833333333333333
10
- llama3_1_405b_instruct,0.8666666666666667
11
- glm_4_0520,0.85
12
- yi_large,0.8333333333333334
13
- deepseek_coder_v2,0.8166666666666667
14
- claude_3_opus_20240229,0.8
15
- gemma_2_27b_it,0.7833333333333333
16
- llama3_1_70b_instruct,0.75
17
- glm_4_0116,0.75
18
- glm_4_air,0.7333333333333333
19
- gpt_4_0314,0.7166666666666667
20
- gemini_1_5_flash_api_preview,0.7
21
- qwen2_72b_instruct,0.6833333333333333
22
- claude_3_sonnet_20240229,0.6666666666666666
23
- llama3_70b_instruct,0.65
24
- claude_3_haiku_20240307,0.6333333333333333
25
- gpt_4_0613,0.6166666666666667
26
- mistral_large_2402,0.6
27
- mixtral_8x22b_instruct_v0_1,0.5833333333333334
28
- qwen1_5_72b_chat,0.5666666666666667
29
- phi_3_medium_4k_instruct,0.55
30
- command_r_plus,0.5333333333333333
31
- mistral_medium,0.5166666666666667
32
- internlm2_5_20b_chat,0.5
33
- phi_3_small_8k_instruct,0.48333333333333334
34
- mistral_next,0.4666666666666667
35
- gpt_3_5_turbo_0613,0.45
36
- dbrx_instructruct_preview,0.43333333333333335
37
- internlm2_20b_chat,0.4166666666666667
38
- claude_2_0,0.4
39
- mixtral_8x7b_instruct_v0_1,0.38333333333333336
40
- gpt_3_5_turbo_0125,0.36666666666666664
41
- yi_34b_chat,0.35
42
- starling_lm_7b_beta,0.3333333333333333
43
- claude_2_1,0.31666666666666665
44
- llama3_1_8b_instruct,0.3
45
- snorkel_mistral_pairrm_dpo,0.2833333333333333
46
- llama3_8b_instruct,0.26666666666666666
47
- gpt_3_5_turbo_1106,0.25
48
- gpt_3_5_turbo_0301,0.23333333333333334
49
- gemini_1_0_pro,0.21666666666666667
50
- snowflake_arctic_instruct,0.2
51
- command_r,0.18333333333333332
52
- phi_3_mini_128k_instruct,0.16666666666666666
53
- tulu_2_dpo_70b,0.15
54
- starling_lm_7b_alpha,0.13333333333333333
55
- mistral_7b_instruct,0.11666666666666667
56
- gemma_1_1_7b_it,0.1
57
- llama_2_70b_chat,0.08333333333333333
58
- vicuna_33b_v1_3,0.06666666666666667
59
- gemma_7b_it,0.05
60
- llama_2_7b_chat,0.03333333333333333
61
- gemma_1_1_2b_it,0.016666666666666666
62
- gemma_2b_it,0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/aggregate_scoress_cache_dcbcd453e19427bcbf89a901d3f2a925.csv DELETED
@@ -1,62 +0,0 @@
1
- model,score
2
- claude_3_5_sonnet_20240620,1.0
3
- gpt_4o_2024_05_13,0.9833333333333333
4
- gpt_4_0125_preview,0.9666666666666667
5
- gpt_4o_2024_08_06,0.95
6
- athene_70b,0.9333333333333333
7
- gpt_4o_mini,0.9166666666666666
8
- gemini_1_5_pro_api_preview,0.9
9
- mistral_large_2407,0.8833333333333333
10
- llama3_1_405b_instruct,0.8666666666666667
11
- glm_4_0520,0.85
12
- yi_large,0.8333333333333334
13
- deepseek_coder_v2,0.8166666666666667
14
- claude_3_opus_20240229,0.8
15
- gemma_2_27b_it,0.7833333333333333
16
- llama3_1_70b_instruct,0.75
17
- glm_4_0116,0.75
18
- glm_4_air,0.7333333333333333
19
- gpt_4_0314,0.7166666666666667
20
- gemini_1_5_flash_api_preview,0.7
21
- qwen2_72b_instruct,0.6833333333333333
22
- claude_3_sonnet_20240229,0.6666666666666666
23
- llama3_70b_instruct,0.65
24
- claude_3_haiku_20240307,0.6333333333333333
25
- gpt_4_0613,0.6166666666666667
26
- mistral_large_2402,0.6
27
- mixtral_8x22b_instruct_v0_1,0.5833333333333334
28
- qwen1_5_72b_chat,0.5666666666666667
29
- phi_3_medium_4k_instruct,0.55
30
- command_r_plus,0.5333333333333333
31
- mistral_medium,0.5166666666666667
32
- internlm2_5_20b_chat,0.5
33
- phi_3_small_8k_instruct,0.48333333333333334
34
- mistral_next,0.4666666666666667
35
- gpt_3_5_turbo_0613,0.45
36
- dbrx_instructruct_preview,0.43333333333333335
37
- internlm2_20b_chat,0.4166666666666667
38
- claude_2_0,0.4
39
- mixtral_8x7b_instruct_v0_1,0.38333333333333336
40
- gpt_3_5_turbo_0125,0.36666666666666664
41
- yi_34b_chat,0.35
42
- starling_lm_7b_beta,0.3333333333333333
43
- claude_2_1,0.31666666666666665
44
- llama3_1_8b_instruct,0.3
45
- snorkel_mistral_pairrm_dpo,0.2833333333333333
46
- llama3_8b_instruct,0.26666666666666666
47
- gpt_3_5_turbo_1106,0.25
48
- gpt_3_5_turbo_0301,0.23333333333333334
49
- gemini_1_0_pro,0.21666666666666667
50
- snowflake_arctic_instruct,0.2
51
- command_r,0.18333333333333332
52
- phi_3_mini_128k_instruct,0.16666666666666666
53
- tulu_2_dpo_70b,0.15
54
- starling_lm_7b_alpha,0.13333333333333333
55
- mistral_7b_instruct,0.11666666666666667
56
- gemma_1_1_7b_it,0.1
57
- llama_2_70b_chat,0.08333333333333333
58
- vicuna_33b_v1_3,0.06666666666666667
59
- gemma_7b_it,0.05
60
- llama_2_7b_chat,0.03333333333333333
61
- gemma_1_1_2b_it,0.016666666666666666
62
- gemma_2b_it,0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/agreements_cache_151f5bfbf87ac7384c2759731c72ec0c.csv DELETED
The diff for this file is too large to render. See raw diff
 
cache_old/agreements_cache_1b58bbc4e0d124b0a524da1001369741.csv DELETED
The diff for this file is too large to render. See raw diff
 
cache_old/agreements_cache_741f08262e15cba4bd6c8b25f2b138ca.csv DELETED
@@ -1,711 +0,0 @@
1
- scenario,scenario_source,ref_scenario,ref_source,corr_type,model_select_strategy,model_subset_size_requested,exp_n,correlation,p_value
2
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.2778254199662385,0.2400384567875128
3
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.40368671387966554,0.08581278065055217
4
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.42599897728156577,0.07162425926742408
5
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.2778254199662385,0.2400384567875128
6
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.36698792170878686,0.11834981273562825
7
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,-0.018181818181818184,1.0
8
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,-0.018181818181818184,1.0
9
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,-0.05454545454545454,0.8792698312489979
10
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,-0.018181818181818184,1.0
11
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,-0.1272727272727273,0.6480954385121052
12
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.05454545454545454,0.8792698312489979
13
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,-0.018181818181818184,1.0
14
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,-0.018181818181818184,1.0
15
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.05454545454545454,0.8792698312489979
16
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,-0.05454545454545454,0.8792698312489979
17
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.05454545454545454,0.8792698312489979
18
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.23636363636363636,0.3587114698573032
19
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.2,0.4453821448613115
20
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.05454545454545454,0.8792698312489979
21
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.1272727272727273,0.6480954385121052
22
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.587180674734059,0.01246215829454031
23
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
24
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.697277051246695,0.003004262239398284
25
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.587180674734059,0.01246215829454031
26
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6605782590758164,0.004936818556325077
27
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6000000000000001,0.00994553671637005
28
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
29
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
30
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6000000000000001,0.00994553671637005
31
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6363636363636364,0.005707170915504249
32
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.2727272727272727,0.2829668209876543
33
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.34545454545454546,0.16457331248997917
34
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.38181818181818183,0.12097096961680295
35
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.2727272727272727,0.2829668209876543
36
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.34545454545454546,0.16457331248997917
37
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6363636363636364,0.005707170915504249
38
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
39
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6363636363636364,0.005707170915504249
40
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6363636363636364,0.005707170915504249
41
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
42
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.18349396085439343,0.43487965849578336
43
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.2935903373670295,0.21152242941072896
44
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.2727272727272727,0.2829668209876543
45
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.18349396085439343,0.43487965849578336
46
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.2568915451961508,0.27429882739587574
47
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.4909090909090909,0.04053235730319064
48
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6000000000000001,0.00994553671637005
49
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.5636363636363636,0.016540504248837583
50
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.4909090909090909,0.04053235730319064
51
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5636363636363636,0.016540504248837583
52
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,0,0.34545454545454546,0.16457331248997917
53
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,1,0.41818181818181815,0.08656124739458072
54
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,2,0.4909090909090909,0.04053235730319064
55
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,3,0.34545454545454546,0.16457331248997917
56
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,11,4,0.34545454545454546,0.16457331248997917
57
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.9272727272727274,3.2567740901074234e-06
58
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8545454545454545,4.624619207952541e-05
59
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,0.9272727272727274,3.2567740901074234e-06
60
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
61
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
62
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
63
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
64
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,1.0,5.010421677088344e-08
65
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6727272727272727,0.0031063111271444604
66
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
67
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.45454545454545453,0.06017015392015392
68
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.41818181818181815,0.08656124739458072
69
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,0.45454545454545453,0.06017015392015392
70
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
71
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5272727272727272,0.02638447971781305
72
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
73
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.5636363636363636,0.016540504248837583
74
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7454545454545454,0.000759529822029822
75
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
76
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
77
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
78
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.9272727272727274,3.2567740901074234e-06
79
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,1.0,5.010421677088344e-08
80
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
81
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
82
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
83
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.5272727272727272,0.02638447971781305
84
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
85
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.4403855060505442,0.06091869077971648
86
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.587180674734059,0.01246215829454031
87
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7454545454545454,0.000759529822029822
88
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,1,0.45454545454545453,0.06017015392015392
89
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,2,0.38181818181818183,0.12097096961680295
90
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6363636363636364,0.005707170915504249
91
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,11,4,0.587180674734059,0.01246215829454031
92
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
93
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
94
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
95
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
96
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
97
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,11,0,1.0,5.010421677088344e-08
98
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,11,1,1.0,5.010421677088344e-08
99
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,11,2,1.0,5.010421677088344e-08
100
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,11,3,1.0,5.010421677088344e-08
101
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,11,4,1.0,5.010421677088344e-08
102
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8545454545454545,4.624619207952541e-05
103
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
104
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
105
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
106
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
107
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7818181818181819,0.0003334435626102293
108
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7454545454545454,0.000759529822029822
109
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
110
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.4909090909090909,0.04053235730319064
111
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
112
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6238794669049377,0.007931923532795268
113
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6605782590758164,0.004936818556325077
114
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.4403855060505442,0.06091869077971648
115
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
116
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.697277051246695,0.003004262239398284
117
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8545454545454545,4.624619207952541e-05
118
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
119
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
120
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
121
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
122
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6238794669049377,0.007931923532795268
123
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.4403855060505442,0.06091869077971648
124
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.5636363636363636,0.016540504248837583
125
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6363636363636364,0.005707170915504249
126
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5636363636363636,0.016540504248837583
127
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6000000000000001,0.00994553671637005
128
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.5636363636363636,0.016540504248837583
129
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6000000000000001,0.00994553671637005
130
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6000000000000001,0.00994553671637005
131
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5636363636363636,0.016540504248837583
132
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7339758434175737,0.0017872890369872653
133
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.587180674734059,0.01246215829454031
134
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6482593132545567,0.006117582447622459
135
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.759389481241052,0.0013210471654040124
136
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.759389481241052,0.0013210471654040124
137
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6363636363636364,0.005707170915504249
138
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
139
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
140
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.2727272727272727,0.2829668209876543
141
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5272727272727272,0.02638447971781305
142
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.38181818181818183,0.12097096961680295
143
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.4909090909090909,0.04053235730319064
144
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.38895558795273394,0.10000137830747906
145
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.38181818181818183,0.12097096961680295
146
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.697277051246695,0.003004262239398284
147
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,0,0.4909090909090909,0.04053235730319064
148
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,1,0.41818181818181815,0.08656124739458072
149
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6000000000000001,0.00994553671637005
150
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,3,0.6000000000000001,0.00994553671637005
151
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6000000000000001,0.00994553671637005
152
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
153
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8807710121010884,0.00017812930545546289
154
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8807710121010884,0.00017812930545546289
155
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8807710121010884,0.00017812930545546289
156
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
157
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7454545454545454,0.000759529822029822
158
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
159
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
160
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
161
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
162
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.4770842982214229,0.042330229121360724
163
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
164
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6605782590758164,0.004936818556325077
165
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8440722199302099,0.0003281542287518694
166
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7339758434175737,0.0017872890369872653
167
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.9636363636363636,5.511463844797178e-07
168
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
169
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
170
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
171
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
172
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
173
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7454545454545454,0.000759529822029822
174
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
175
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
176
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
177
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.4909090909090909,0.04053235730319064
178
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6363636363636364,0.005707170915504249
179
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7454545454545454,0.000759529822029822
180
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
181
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
182
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.45454545454545453,0.06017015392015392
183
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
184
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6363636363636364,0.005707170915504249
185
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.9272727272727274,3.2567740901074234e-06
186
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
187
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6000000000000001,0.00994553671637005
188
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
189
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
190
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.4909090909090909,0.04053235730319064
191
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
192
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
193
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
194
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
195
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
196
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
197
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7454545454545454,0.000759529822029822
198
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
199
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
200
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
201
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
202
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7454545454545454,0.000759529822029822
203
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
204
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
205
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
206
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
207
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,1.0,5.010421677088344e-08
208
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
209
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
210
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.9636363636363636,5.511463844797178e-07
211
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,1.0,5.010421677088344e-08
212
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,1.0,5.010421677088344e-08
213
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,1.0,5.010421677088344e-08
214
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.9272727272727274,3.2567740901074234e-06
215
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.9636363636363636,5.511463844797178e-07
216
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,1.0,5.010421677088344e-08
217
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.9272727272727274,3.2567740901074234e-06
218
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
219
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
220
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
221
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.9272727272727274,3.2567740901074234e-06
222
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,0,0.9636363636363636,5.511463844797178e-07
223
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
224
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
225
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
226
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,11,4,0.9636363636363636,5.511463844797178e-07
227
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,11,0,1.0,5.010421677088344e-08
228
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,11,1,1.0,5.010421677088344e-08
229
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,11,2,1.0,5.010421677088344e-08
230
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,11,3,1.0,5.010421677088344e-08
231
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,11,4,1.0,5.010421677088344e-08
232
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.45454545454545453,0.06017015392015392
233
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.6000000000000001,0.00994553671637005
234
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.3090909090909091,0.21834651074234407
235
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.5272727272727272,0.02638447971781305
236
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.38181818181818183,0.12097096961680295
237
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.6000000000000001,0.00994553671637005
238
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.5636363636363636,0.016540504248837583
239
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.5272727272727272,0.02638447971781305
240
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.6000000000000001,0.00994553671637005
241
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.4909090909090909,0.04053235730319064
242
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.45454545454545453,0.06017015392015392
243
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.6000000000000001,0.00994553671637005
244
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.34545454545454546,0.16457331248997917
245
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.41818181818181815,0.08656124739458072
246
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.2727272727272727,0.2829668209876543
247
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.5741725345968929,0.015177848122929492
248
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.3519121986239021,0.1366995137219537
249
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.42599897728156577,0.07162425926742408
250
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.38181818181818183,0.12097096961680295
251
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.4403855060505442,0.06091869077971648
252
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
253
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
254
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
255
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
256
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
257
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.6000000000000001,0.00994553671637005
258
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.5272727272727272,0.02638447971781305
259
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.3090909090909091,0.21834651074234407
260
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.45454545454545453,0.06017015392015392
261
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.2727272727272727,0.2829668209876543
262
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,0,0.4403855060505442,0.06091869077971648
263
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,1,0.38181818181818183,0.12097096961680295
264
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,2,0.4403855060505442,0.06091869077971648
265
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,3,0.45454545454545453,0.06017015392015392
266
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,11,4,0.36698792170878686,0.11834981273562825
267
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,11,0,0.2,0.4453821448613115
268
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,11,1,0.38181818181818183,0.12097096961680295
269
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,11,2,0.41818181818181815,0.08656124739458072
270
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,11,3,0.5272727272727272,0.02638447971781305
271
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,11,4,0.5272727272727272,0.02638447971781305
272
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
273
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
274
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
275
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
276
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
277
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7454545454545454,0.000759529822029822
278
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
279
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
280
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.8073734277593311,0.0005907573118657002
281
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
282
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.587180674734059,0.01246215829454031
283
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6482593132545567,0.006117582447622459
284
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
285
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7706746355884524,0.0010393630991335228
286
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5371291452680612,0.02311942970946668
287
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.6238794669049377,0.007931923532795268
288
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.4909090909090909,0.04053235730319064
289
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8440722199302099,0.0003281542287518694
290
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
291
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.36698792170878686,0.11834981273562825
292
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
293
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.9272727272727274,3.2567740901074234e-06
294
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,1.0,5.010421677088344e-08
295
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
296
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
297
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
298
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8073734277593311,0.0005907573118657002
299
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
300
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
301
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.6605782590758164,0.004936818556325077
302
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,-0.0909090909090909,0.7611503928170594
303
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.07339758434175737,0.7547764265871044
304
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.4403855060505442,0.06091869077971648
305
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.3302891295379082,0.15985367483762747
306
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.1272727272727273,0.6480954385121052
307
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
308
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
309
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.8705196492275474,0.00023202582506637044
310
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7479575920067658,0.001637274718449882
311
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5983660736054126,0.01175728488671479
312
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
313
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.6000000000000001,0.00994553671637005
314
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
315
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
316
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.5272727272727272,0.02638447971781305
317
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,11,0,0.7706746355884524,0.0010393630991335228
318
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,11,1,0.8909090909090909,1.3728555395222063e-05
319
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,11,2,0.5272727272727272,0.02638447971781305
320
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,11,3,0.7454545454545454,0.000759529822029822
321
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
322
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.7818181818181819,0.0003334435626102293
323
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.8545454545454545,4.624619207952541e-05
324
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
325
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
326
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
327
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.6731618328060892,0.004677734981047257
328
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.759389481241052,0.0013210471654040124
329
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.7339758434175737,0.0017872890369872653
330
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.6238794669049377,0.007931923532795268
331
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
332
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.6363636363636364,0.005707170915504249
333
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.7454545454545454,0.000759529822029822
334
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.7706746355884524,0.0010393630991335228
335
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
336
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.7706746355884524,0.0010393630991335228
337
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
338
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.8909090909090909,1.3728555395222063e-05
339
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
340
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
341
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
342
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.5636363636363636,0.016540504248837583
343
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.7454545454545454,0.000759529822029822
344
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.6363636363636364,0.005707170915504249
345
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
346
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.6363636363636364,0.005707170915504249
347
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
348
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
349
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.5636363636363636,0.016540504248837583
350
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
351
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
352
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,0,0.6727272727272727,0.0031063111271444604
353
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,1,0.7454545454545454,0.000759529822029822
354
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
355
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,3,0.6000000000000001,0.00994553671637005
356
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,11,4,0.7454545454545454,0.000759529822029822
357
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,11,0,0.2778254199662385,0.2400384567875128
358
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,11,1,0.40368671387966554,0.08581278065055217
359
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,11,2,0.42599897728156577,0.07162425926742408
360
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,11,3,0.2778254199662385,0.2400384567875128
361
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,11,4,0.36698792170878686,0.11834981273562825
362
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,11,0,-0.018181818181818184,1.0
363
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,11,1,-0.018181818181818184,1.0
364
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,11,2,-0.05454545454545454,0.8792698312489979
365
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,11,3,-0.018181818181818184,1.0
366
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,11,4,-0.1272727272727273,0.6480954385121052
367
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,11,0,0.05454545454545454,0.8792698312489979
368
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,11,1,-0.018181818181818184,1.0
369
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,11,2,-0.018181818181818184,1.0
370
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,11,3,0.05454545454545454,0.8792698312489979
371
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,11,4,-0.05454545454545454,0.8792698312489979
372
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,11,0,0.05454545454545454,0.8792698312489979
373
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,11,1,0.23636363636363636,0.3587114698573032
374
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,11,2,0.2,0.4453821448613115
375
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,11,3,0.05454545454545454,0.8792698312489979
376
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,11,4,0.1272727272727273,0.6480954385121052
377
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,11,0,0.587180674734059,0.01246215829454031
378
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
379
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,11,2,0.697277051246695,0.003004262239398284
380
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,11,3,0.587180674734059,0.01246215829454031
381
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,11,4,0.6605782590758164,0.004936818556325077
382
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,11,0,0.6000000000000001,0.00994553671637005
383
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
384
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
385
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,11,3,0.6000000000000001,0.00994553671637005
386
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,11,4,0.6363636363636364,0.005707170915504249
387
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,11,0,0.2727272727272727,0.2829668209876543
388
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,11,1,0.34545454545454546,0.16457331248997917
389
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,11,2,0.38181818181818183,0.12097096961680295
390
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,11,3,0.2727272727272727,0.2829668209876543
391
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,11,4,0.34545454545454546,0.16457331248997917
392
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,11,0,0.6363636363636364,0.005707170915504249
393
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
394
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,11,2,0.6363636363636364,0.005707170915504249
395
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,11,3,0.6363636363636364,0.005707170915504249
396
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
397
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,11,0,0.18349396085439343,0.43487965849578336
398
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,11,1,0.2935903373670295,0.21152242941072896
399
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,11,2,0.2727272727272727,0.2829668209876543
400
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,11,3,0.18349396085439343,0.43487965849578336
401
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,11,4,0.2568915451961508,0.27429882739587574
402
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,11,0,0.4909090909090909,0.04053235730319064
403
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,11,1,0.6000000000000001,0.00994553671637005
404
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,11,2,0.5636363636363636,0.016540504248837583
405
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,11,3,0.4909090909090909,0.04053235730319064
406
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,11,4,0.5636363636363636,0.016540504248837583
407
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,11,0,0.34545454545454546,0.16457331248997917
408
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,11,1,0.41818181818181815,0.08656124739458072
409
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,11,2,0.4909090909090909,0.04053235730319064
410
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,11,3,0.34545454545454546,0.16457331248997917
411
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,11,4,0.34545454545454546,0.16457331248997917
412
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.9272727272727274,3.2567740901074234e-06
413
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.8545454545454545,4.624619207952541e-05
414
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,11,2,0.9272727272727274,3.2567740901074234e-06
415
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
416
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
417
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
418
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
419
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,11,2,1.0,5.010421677088344e-08
420
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.6727272727272727,0.0031063111271444604
421
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
422
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.45454545454545453,0.06017015392015392
423
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.41818181818181815,0.08656124739458072
424
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,11,2,0.45454545454545453,0.06017015392015392
425
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
426
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.5272727272727272,0.02638447971781305
427
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
428
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.5636363636363636,0.016540504248837583
429
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,11,2,0.7454545454545454,0.000759529822029822
430
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
431
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
432
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
433
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.9272727272727274,3.2567740901074234e-06
434
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,11,2,1.0,5.010421677088344e-08
435
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
436
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
437
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
438
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.5272727272727272,0.02638447971781305
439
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
440
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.4403855060505442,0.06091869077971648
441
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.587180674734059,0.01246215829454031
442
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,11,0,0.7454545454545454,0.000759529822029822
443
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,11,1,0.45454545454545453,0.06017015392015392
444
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,11,2,0.38181818181818183,0.12097096961680295
445
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,11,3,0.6363636363636364,0.005707170915504249
446
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,11,4,0.587180674734059,0.01246215829454031
447
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
448
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
449
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
450
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
451
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
452
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,11,0,1.0,5.010421677088344e-08
453
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,11,1,1.0,5.010421677088344e-08
454
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,11,2,1.0,5.010421677088344e-08
455
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,11,3,1.0,5.010421677088344e-08
456
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,11,4,1.0,5.010421677088344e-08
457
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,11,0,0.8545454545454545,4.624619207952541e-05
458
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
459
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
460
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
461
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
462
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,11,0,0.7818181818181819,0.0003334435626102293
463
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,11,1,0.7454545454545454,0.000759529822029822
464
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
465
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,11,3,0.4909090909090909,0.04053235730319064
466
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
467
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,11,0,0.6238794669049377,0.007931923532795268
468
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,11,1,0.6605782590758164,0.004936818556325077
469
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,11,2,0.4403855060505442,0.06091869077971648
470
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
471
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,11,4,0.697277051246695,0.003004262239398284
472
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,11,0,0.8545454545454545,4.624619207952541e-05
473
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
474
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
475
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
476
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
477
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,11,0,0.6238794669049377,0.007931923532795268
478
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,11,1,0.4403855060505442,0.06091869077971648
479
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,11,2,0.5636363636363636,0.016540504248837583
480
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,11,3,0.6363636363636364,0.005707170915504249
481
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,11,4,0.5636363636363636,0.016540504248837583
482
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,11,0,0.6000000000000001,0.00994553671637005
483
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,11,1,0.5636363636363636,0.016540504248837583
484
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,11,2,0.6000000000000001,0.00994553671637005
485
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,11,3,0.6000000000000001,0.00994553671637005
486
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,11,4,0.5636363636363636,0.016540504248837583
487
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,11,0,0.7339758434175737,0.0017872890369872653
488
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,11,1,0.587180674734059,0.01246215829454031
489
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,11,2,0.6482593132545567,0.006117582447622459
490
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,11,3,0.759389481241052,0.0013210471654040124
491
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,11,4,0.759389481241052,0.0013210471654040124
492
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,11,0,0.6363636363636364,0.005707170915504249
493
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
494
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
495
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,11,3,0.2727272727272727,0.2829668209876543
496
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,11,4,0.5272727272727272,0.02638447971781305
497
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,11,0,0.38181818181818183,0.12097096961680295
498
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,11,1,0.4909090909090909,0.04053235730319064
499
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,11,2,0.38895558795273394,0.10000137830747906
500
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,11,3,0.38181818181818183,0.12097096961680295
501
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,11,4,0.697277051246695,0.003004262239398284
502
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,11,0,0.4909090909090909,0.04053235730319064
503
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,11,1,0.41818181818181815,0.08656124739458072
504
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,11,2,0.6000000000000001,0.00994553671637005
505
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,11,3,0.6000000000000001,0.00994553671637005
506
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,11,4,0.6000000000000001,0.00994553671637005
507
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
508
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,11,1,0.8807710121010884,0.00017812930545546289
509
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,11,2,0.8807710121010884,0.00017812930545546289
510
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,11,3,0.8807710121010884,0.00017812930545546289
511
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
512
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,11,0,0.7454545454545454,0.000759529822029822
513
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
514
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
515
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
516
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
517
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,11,0,0.4770842982214229,0.042330229121360724
518
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
519
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,11,2,0.6605782590758164,0.004936818556325077
520
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,11,3,0.8440722199302099,0.0003281542287518694
521
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,11,4,0.7339758434175737,0.0017872890369872653
522
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,11,0,0.9636363636363636,5.511463844797178e-07
523
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,11,1,0.6727272727272727,0.0031063111271444604
524
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
525
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
526
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
527
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
528
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,11,1,0.7454545454545454,0.000759529822029822
529
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
530
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
531
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
532
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,11,0,0.4909090909090909,0.04053235730319064
533
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,11,1,0.6363636363636364,0.005707170915504249
534
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,11,2,0.7454545454545454,0.000759529822029822
535
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
536
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
537
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,11,0,0.45454545454545453,0.06017015392015392
538
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
539
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,11,2,0.6363636363636364,0.005707170915504249
540
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,11,3,0.9272727272727274,3.2567740901074234e-06
541
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
542
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,11,0,0.6000000000000001,0.00994553671637005
543
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,11,1,0.7090909090909091,0.0015912097162097162
544
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
545
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,11,3,0.4909090909090909,0.04053235730319064
546
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
547
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
548
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
549
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
550
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
551
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,11,4,0.8181818181818182,0.00013227513227513228
552
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,11,0,0.7454545454545454,0.000759529822029822
553
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
554
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
555
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
556
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
557
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,11,0,0.7454545454545454,0.000759529822029822
558
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
559
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
560
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
561
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
562
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,11,0,1.0,5.010421677088344e-08
563
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
564
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
565
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,11,3,0.9636363636363636,5.511463844797178e-07
566
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,11,4,1.0,5.010421677088344e-08
567
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,11,0,1.0,5.010421677088344e-08
568
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,11,1,1.0,5.010421677088344e-08
569
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,11,2,0.9272727272727274,3.2567740901074234e-06
570
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,11,3,0.9636363636363636,5.511463844797178e-07
571
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,11,4,1.0,5.010421677088344e-08
572
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,11,0,0.9272727272727274,3.2567740901074234e-06
573
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
574
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
575
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
576
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,11,4,0.9272727272727274,3.2567740901074234e-06
577
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,11,0,0.9636363636363636,5.511463844797178e-07
578
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,11,1,0.9636363636363636,5.511463844797178e-07
579
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
580
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,11,3,0.8909090909090909,1.3728555395222063e-05
581
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,11,4,0.9636363636363636,5.511463844797178e-07
582
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,11,0,1.0,5.010421677088344e-08
583
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,11,1,1.0,5.010421677088344e-08
584
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,11,2,1.0,5.010421677088344e-08
585
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,11,3,1.0,5.010421677088344e-08
586
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,11,4,1.0,5.010421677088344e-08
587
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.45454545454545453,0.06017015392015392
588
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.6000000000000001,0.00994553671637005
589
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.3090909090909091,0.21834651074234407
590
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.5272727272727272,0.02638447971781305
591
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.38181818181818183,0.12097096961680295
592
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.6000000000000001,0.00994553671637005
593
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.5636363636363636,0.016540504248837583
594
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.5272727272727272,0.02638447971781305
595
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.6000000000000001,0.00994553671637005
596
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.4909090909090909,0.04053235730319064
597
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.45454545454545453,0.06017015392015392
598
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.6000000000000001,0.00994553671637005
599
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.34545454545454546,0.16457331248997917
600
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.41818181818181815,0.08656124739458072
601
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.2727272727272727,0.2829668209876543
602
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.5741725345968929,0.015177848122929492
603
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.3519121986239021,0.1366995137219537
604
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.42599897728156577,0.07162425926742408
605
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.38181818181818183,0.12097096961680295
606
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.4403855060505442,0.06091869077971648
607
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
608
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
609
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.6727272727272727,0.0031063111271444604
610
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
611
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
612
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.6000000000000001,0.00994553671637005
613
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.5272727272727272,0.02638447971781305
614
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.3090909090909091,0.21834651074234407
615
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.45454545454545453,0.06017015392015392
616
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.2727272727272727,0.2829668209876543
617
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,0,0.4403855060505442,0.06091869077971648
618
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,1,0.38181818181818183,0.12097096961680295
619
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,2,0.4403855060505442,0.06091869077971648
620
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,3,0.45454545454545453,0.06017015392015392
621
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,11,4,0.36698792170878686,0.11834981273562825
622
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,11,0,0.2,0.4453821448613115
623
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,11,1,0.38181818181818183,0.12097096961680295
624
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,11,2,0.41818181818181815,0.08656124739458072
625
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,11,3,0.5272727272727272,0.02638447971781305
626
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,11,4,0.5272727272727272,0.02638447971781305
627
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
628
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
629
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
630
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
631
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
632
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,11,0,0.7454545454545454,0.000759529822029822
633
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
634
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,11,2,0.8181818181818182,0.00013227513227513228
635
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,11,3,0.8073734277593311,0.0005907573118657002
636
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
637
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,11,0,0.587180674734059,0.01246215829454031
638
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,11,1,0.6482593132545567,0.006117582447622459
639
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,11,2,0.8545454545454545,4.624619207952541e-05
640
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,11,3,0.7706746355884524,0.0010393630991335228
641
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,11,4,0.5371291452680612,0.02311942970946668
642
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,11,0,0.6238794669049377,0.007931923532795268
643
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,11,1,0.4909090909090909,0.04053235730319064
644
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,11,2,0.8440722199302099,0.0003281542287518694
645
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
646
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,11,4,0.36698792170878686,0.11834981273562825
647
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,11,0,0.8909090909090909,1.3728555395222063e-05
648
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,11,1,0.9272727272727274,3.2567740901074234e-06
649
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,11,2,1.0,5.010421677088344e-08
650
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,11,3,0.7090909090909091,0.0015912097162097162
651
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,11,4,0.6727272727272727,0.0031063111271444604
652
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,11,0,0.8181818181818182,0.00013227513227513228
653
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,11,1,0.8073734277593311,0.0005907573118657002
654
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,11,2,0.8909090909090909,1.3728555395222063e-05
655
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
656
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,11,4,0.6605782590758164,0.004936818556325077
657
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,11,0,-0.0909090909090909,0.7611503928170594
658
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,11,1,0.07339758434175737,0.7547764265871044
659
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,11,2,0.4403855060505442,0.06091869077971648
660
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,11,3,0.3302891295379082,0.15985367483762747
661
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,11,4,0.1272727272727273,0.6480954385121052
662
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
663
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,11,1,0.7818181818181819,0.0003334435626102293
664
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,11,2,0.8705196492275474,0.00023202582506637044
665
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,11,3,0.7479575920067658,0.001637274718449882
666
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,11,4,0.5983660736054126,0.01175728488671479
667
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
668
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,11,1,0.6000000000000001,0.00994553671637005
669
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
670
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
671
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,11,4,0.5272727272727272,0.02638447971781305
672
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,11,0,0.7706746355884524,0.0010393630991335228
673
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,11,1,0.8909090909090909,1.3728555395222063e-05
674
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,11,2,0.5272727272727272,0.02638447971781305
675
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,11,3,0.7454545454545454,0.000759529822029822
676
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
677
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,11,0,0.7818181818181819,0.0003334435626102293
678
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,11,1,0.8545454545454545,4.624619207952541e-05
679
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
680
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
681
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,11,4,0.7818181818181819,0.0003334435626102293
682
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,11,0,0.6731618328060892,0.004677734981047257
683
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,11,1,0.759389481241052,0.0013210471654040124
684
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,11,2,0.7339758434175737,0.0017872890369872653
685
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,11,3,0.6238794669049377,0.007931923532795268
686
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,11,4,0.7090909090909091,0.0015912097162097162
687
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,11,0,0.6363636363636364,0.005707170915504249
688
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,11,1,0.7454545454545454,0.000759529822029822
689
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,11,2,0.7706746355884524,0.0010393630991335228
690
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,11,3,0.8181818181818182,0.00013227513227513228
691
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,11,4,0.7706746355884524,0.0010393630991335228
692
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
693
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,11,1,0.8909090909090909,1.3728555395222063e-05
694
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,11,2,0.7818181818181819,0.0003334435626102293
695
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
696
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,11,4,0.8545454545454545,4.624619207952541e-05
697
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,11,0,0.5636363636363636,0.016540504248837583
698
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,11,1,0.7454545454545454,0.000759529822029822
699
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,11,2,0.6363636363636364,0.005707170915504249
700
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,11,3,0.7818181818181819,0.0003334435626102293
701
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,11,4,0.6363636363636364,0.005707170915504249
702
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,11,0,0.7090909090909091,0.0015912097162097162
703
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,11,1,0.8181818181818182,0.00013227513227513228
704
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,11,2,0.5636363636363636,0.016540504248837583
705
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,11,3,0.8545454545454545,4.624619207952541e-05
706
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
707
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,11,0,0.6727272727272727,0.0031063111271444604
708
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,11,1,0.7454545454545454,0.000759529822029822
709
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,11,2,0.7090909090909091,0.0015912097162097162
710
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,11,3,0.6000000000000001,0.00994553671637005
711
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,11,4,0.7454545454545454,0.000759529822029822
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/agreements_cache_dcbcd453e19427bcbf89a901d3f2a925.csv DELETED
@@ -1,731 +0,0 @@
1
- scenario,scenario_source,ref_scenario,ref_source,corr_type,model_select_strategy,model_subset_size_requested,exp_n,correlation,p_value
2
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4447495899966607,0.1315867602811863
3
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.3571428571428571,0.27509920634920637
4
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.47280542884465016,0.10506382347888965
5
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.2545875386086578,0.38281014365989596
6
- Helm Lite,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.40006613209931935,0.17023995462900499
7
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.07142857142857142,0.9048611111111111
8
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,-0.2857142857142857,0.39875992063492066
9
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.07142857142857142,0.9048611111111111
10
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,-0.21428571428571427,0.5484126984126985
11
- Helm Lite NarrativeQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,-0.3571428571428571,0.27509920634920637
12
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.07142857142857142,0.9048611111111111
13
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,-0.2857142857142857,0.39875992063492066
14
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.14285714285714285,0.7195436507936508
15
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,-0.21428571428571427,0.5484126984126985
16
- Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,-0.2857142857142857,0.39875992063492066
17
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.21428571428571427,0.5484126984126985
18
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.0,1.0
19
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.3571428571428571,0.27509920634920637
20
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.0,1.0
21
- Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.0,1.0
22
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7637626158259734,0.008839740160738534
23
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
24
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
25
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
26
- Helm Lite OpenBookQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7637626158259734,0.008839740160738534
27
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
28
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
29
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.6428571428571428,0.03115079365079365
30
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
31
- Helm Lite MMLU,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
32
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.21428571428571427,0.5484126984126985
33
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.2857142857142857,0.39875992063492066
34
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
35
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.14285714285714285,0.7195436507936508
36
- Helm Lite MathEquivalentCOT,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
37
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
38
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
39
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
40
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
41
- Helm Lite GSM8K,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
42
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.2857142857142857,0.39875992063492066
43
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.21428571428571427,0.5484126984126985
44
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.3571428571428571,0.27509920634920637
45
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.036369648372665396,0.9007802600472398
46
- Helm Lite LegalBench,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.21428571428571427,0.5484126984126985
47
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
48
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6428571428571428,0.03115079365079365
49
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
50
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
51
- Helm Lite MedQA,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
52
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
53
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,1,0.3571428571428571,0.27509920634920637
54
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,2,0.42857142857142855,0.17886904761904762
55
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,3,0.3571428571428571,0.27509920634920637
56
- Helm Lite WMT2014,helm_lite_240829.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
57
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
58
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
59
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
60
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
61
- HF OpenLLM v2,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
62
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
63
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
64
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
65
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7142857142857142,0.014136904761904762
66
- HFv2 BBH,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
67
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
68
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.4999999999999999,0.10868055555555556
69
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
70
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
71
- HFv2 GPQA,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
72
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.8571428571428571,0.001736111111111111
73
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.4999999999999999,0.10868055555555556
74
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
75
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
76
- HFv2 IFEval,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
77
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
78
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
79
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
80
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
81
- HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
82
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.8571428571428571,0.001736111111111111
83
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.5714285714285714,0.06101190476190476
84
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
85
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.4999999999999999,0.10868055555555556
86
- HFv2 Math Level 5,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
87
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
88
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,1,0.4999999999999999,0.10868055555555556
89
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,2,0.5714285714285714,0.06101190476190476
90
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
91
- HFv2 MuSR,hf_open_llm_v2_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
92
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
93
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
94
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
95
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
96
- Helm MMLU,helm_mmlu_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
97
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
98
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
99
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
100
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
101
- LMSys Arena,chatbot_arena_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
102
- MMLU Pro,mmlu_pro_240829.csv,aggregate,aggregate,kendall,random,8,0,0.8571428571428571,0.001736111111111111
103
- MMLU Pro,mmlu_pro_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
104
- MMLU Pro,mmlu_pro_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
105
- MMLU Pro,mmlu_pro_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
106
- MMLU Pro,mmlu_pro_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
107
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
108
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
109
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
110
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
111
- MixEval,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
112
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
113
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
114
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.6428571428571428,0.03115079365079365
115
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
116
- MixEval Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
117
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
118
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
119
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
120
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
121
- MixEval TriviaQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
122
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
123
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
124
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
125
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
126
- MixEval MMLU,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
127
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6182840223353117,0.0340492747686748
128
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.18184824186332696,0.5330356744917513
129
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.3571428571428571,0.27509920634920637
130
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
131
- MixEval DROP,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.42857142857142855,0.17886904761904762
132
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
133
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.5714285714285714,0.06101190476190476
134
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
135
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
136
- MixEval HellaSwag,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.42857142857142855,0.17886904761904762
137
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6910233190806425,0.017844011512848347
138
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.5455447255899809,0.0614649096074132
139
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.5455447255899809,0.0614649096074132
140
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.836501912571304,0.004136737098676645
141
- MixEval CommonsenseQA,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6910233190806425,0.017844011512848347
142
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
143
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
144
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
145
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.21428571428571427,0.5484126984126985
146
- MixEval TriviaQA Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.42857142857142855,0.17886904761904762
147
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
148
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6428571428571428,0.03115079365079365
149
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.22237479499833035,0.45088703102517036
150
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.3571428571428571,0.27509920634920637
151
- MixEval MMLU Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.5714285714285714,0.06101190476190476
152
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,0,0.14285714285714285,0.7195436507936508
153
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
154
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
155
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7142857142857142,0.014136904761904762
156
- MixEval DROP Hard,mixeval_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
157
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
158
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,8,1,0.836501912571304,0.004136737098676645
159
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
160
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
161
- AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
162
- OpenCompass Arena,opencompass_arena_240829.csv,aggregate,aggregate,kendall,random,8,0,0.2857142857142857,0.39875992063492066
163
- OpenCompass Arena,opencompass_arena_240829.csv,aggregate,aggregate,kendall,random,8,1,0.5714285714285714,0.06101190476190476
164
- OpenCompass Arena,opencompass_arena_240829.csv,aggregate,aggregate,kendall,random,8,2,0.5714285714285714,0.06101190476190476
165
- OpenCompass Arena,opencompass_arena_240829.csv,aggregate,aggregate,kendall,random,8,3,0.3571428571428571,0.27509920634920637
166
- OpenCompass Arena,opencompass_arena_240829.csv,aggregate,aggregate,kendall,random,8,4,0.4999999999999999,0.10868055555555556
167
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
168
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
169
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
170
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
171
- LiveBench 240725,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
172
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
173
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.3571428571428571,0.27509920634920637
174
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.47280542884465016,0.10506382347888965
175
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
176
- LiveBench Reasoning,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.5455447255899809,0.0614649096074132
177
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
178
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
179
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
180
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
181
- LiveBench Coding,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
182
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
183
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.5714285714285714,0.06101190476190476
184
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
185
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
186
- LiveBench Mathematics,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
187
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
188
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6428571428571428,0.03115079365079365
189
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
190
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
191
- LiveBench Data Analysis,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
192
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
193
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
194
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.4999999999999999,0.10868055555555556
195
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
196
- LiveBench Language,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
197
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
198
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6428571428571428,0.03115079365079365
199
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.6428571428571428,0.03115079365079365
200
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.3571428571428571,0.27509920634920637
201
- LiveBench Instruction Following,livebench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.42857142857142855,0.17886904761904762
202
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.8571428571428571,0.001736111111111111
203
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
204
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
205
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
206
- WildBench Elo LC,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
207
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
208
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
209
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
210
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
211
- WildBench Information Seeking,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
212
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
213
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7142857142857142,0.014136904761904762
214
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
215
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
216
- WildBench Creative,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
217
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
218
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
219
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
220
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
221
- WildBench Code Debugging,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
222
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
223
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
224
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
225
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
226
- WildBench Math & Data,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
227
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
228
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
229
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
230
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
231
- WildBench Reasoning & Planning,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
232
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
233
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
234
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
235
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
236
- WildBench Score,wildbench_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
237
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
238
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
239
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
240
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
241
- Arena Hard,arena_hard_240829.csv,aggregate,aggregate,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
242
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.14285714285714285,0.7195436507936508
243
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.4999999999999999,0.10868055555555556
244
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.6428571428571428,0.03115079365079365
245
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
246
- HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.5714285714285714,0.06101190476190476
247
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
248
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.42857142857142855,0.17886904761904762
249
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
250
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
251
- HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
252
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.0,1.0
253
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.5714285714285714,0.06101190476190476
254
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.5714285714285714,0.06101190476190476
255
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
256
- HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
257
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
258
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.10910894511799618,0.7083840532183997
259
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.6182840223353117,0.0340492747686748
260
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.2857142857142857,0.39875992063492066
261
- HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.40006613209931935,0.17023995462900499
262
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
263
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
264
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
265
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
266
- HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
267
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
268
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.2857142857142857,0.39875992063492066
269
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.6428571428571428,0.03115079365079365
270
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.42857142857142855,0.17886904761904762
271
- HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
272
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,0,0.42857142857142855,0.17886904761904762
273
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,1,0.2857142857142857,0.39875992063492066
274
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,2,0.6182840223353117,0.0340492747686748
275
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,3,0.42857142857142855,0.17886904761904762
276
- HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,aggregate,aggregate,kendall,random,8,4,0.2857142857142857,0.39875992063492066
277
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,8,0,0.21428571428571427,0.5484126984126985
278
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,8,1,0.14285714285714285,0.7195436507936508
279
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,8,2,0.42857142857142855,0.17886904761904762
280
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,8,3,0.4999999999999999,0.10868055555555556
281
- BFCL,bfcl_240906.csv,aggregate,aggregate,kendall,random,8,4,0.3571428571428571,0.27509920634920637
282
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
283
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7857142857142856,0.005505952380952381
284
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
285
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7142857142857142,0.014136904761904762
286
- BIGGEN,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7142857142857142,0.014136904761904762
287
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
288
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
289
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
290
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
291
- BIGGEN Grounding,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
292
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.40006613209931935,0.17023995462900499
293
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6910233190806425,0.017844011512848347
294
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
295
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.7142857142857142,0.014136904761904762
296
- BIGGEN Instruction Following,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.47280542884465016,0.10506382347888965
297
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.47280542884465016,0.10506382347888965
298
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.42857142857142855,0.17886904761904762
299
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.7637626158259734,0.008839740160738534
300
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
301
- BIGGEN Planning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.10910894511799618,0.7083840532183997
302
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.8571428571428571,0.001736111111111111
303
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
304
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
305
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
306
- BIGGEN Reasoning,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
307
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7857142857142856,0.005505952380952381
308
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.7637626158259734,0.008839740160738534
309
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
310
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
311
- BIGGEN Refinement,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.7637626158259734,0.008839740160738534
312
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,-0.2857142857142857,0.39875992063492066
313
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.2545875386086578,0.38281014365989596
314
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.6910233190806425,0.017844011512848347
315
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.10910894511799618,0.7083840532183997
316
- BIGGEN Safety,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.07142857142857142,0.9048611111111111
317
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
318
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.6428571428571428,0.03115079365079365
319
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.836501912571304,0.004136737098676645
320
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.5669467095138409,0.05611472402809984
321
- BIGGEN Theory of Mind,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.6182840223353117,0.0340492747686748
322
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
323
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.4999999999999999,0.10868055555555556
324
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.5714285714285714,0.06101190476190476
325
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
326
- BIGGEN Tool Usage,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.42857142857142855,0.17886904761904762
327
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
328
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
329
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,8,2,0.42857142857142855,0.17886904761904762
330
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,8,3,0.6428571428571428,0.03115079365079365
331
- BIGGEN Multilingual,biggen_240829.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
332
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
333
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
334
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.8571428571428571,0.001736111111111111
335
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
336
- LiveBench 240624,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
337
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.7412493166611012,0.011966745157436277
338
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
339
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7637626158259734,0.008839740160738534
340
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.7142857142857142,0.014136904761904762
341
- LiveBench Reasoning Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
342
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
343
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
344
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
345
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
346
- LiveBench Coding Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.7637626158259734,0.008839740160738534
347
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.7142857142857142,0.014136904761904762
348
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
349
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
350
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.8571428571428571,0.001736111111111111
351
- LiveBench Mathematics Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
352
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.4999999999999999,0.10868055555555556
353
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
354
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
355
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
356
- LiveBench Data Analysis Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.6428571428571428,0.03115079365079365
357
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.6428571428571428,0.03115079365079365
358
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
359
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7142857142857142,0.014136904761904762
360
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.7857142857142856,0.005505952380952381
361
- LiveBench Language Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.8571428571428571,0.001736111111111111
362
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,0,0.5714285714285714,0.06101190476190476
363
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,1,0.8571428571428571,0.001736111111111111
364
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,2,0.7857142857142856,0.005505952380952381
365
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,3,0.5714285714285714,0.06101190476190476
366
- LiveBench Instruction Following Average,livebench_240701.csv,aggregate,aggregate,kendall,random,8,4,0.7857142857142856,0.005505952380952381
367
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,8,0,0.4447495899966607,0.1315867602811863
368
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,8,1,0.3571428571428571,0.27509920634920637
369
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,8,2,0.47280542884465016,0.10506382347888965
370
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,8,3,0.2545875386086578,0.38281014365989596
371
- aggregate,aggregate,Helm Lite,helm_lite_240829.csv,kendall,random,8,4,0.40006613209931935,0.17023995462900499
372
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,8,0,0.07142857142857142,0.9048611111111111
373
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,8,1,-0.2857142857142857,0.39875992063492066
374
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,8,2,0.07142857142857142,0.9048611111111111
375
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,8,3,-0.21428571428571427,0.5484126984126985
376
- aggregate,aggregate,Helm Lite NarrativeQA,helm_lite_240829.csv,kendall,random,8,4,-0.3571428571428571,0.27509920634920637
377
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,8,0,0.07142857142857142,0.9048611111111111
378
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,8,1,-0.2857142857142857,0.39875992063492066
379
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,8,2,0.14285714285714285,0.7195436507936508
380
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,8,3,-0.21428571428571427,0.5484126984126985
381
- aggregate,aggregate,Helm Lite NaturalQuestionsOpen,helm_lite_240829.csv,kendall,random,8,4,-0.2857142857142857,0.39875992063492066
382
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,8,0,0.21428571428571427,0.5484126984126985
383
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,8,1,0.0,1.0
384
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,8,2,0.3571428571428571,0.27509920634920637
385
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,8,3,0.0,1.0
386
- aggregate,aggregate,Helm Lite NaturalQuestionsClosed,helm_lite_240829.csv,kendall,random,8,4,0.0,1.0
387
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,8,0,0.7637626158259734,0.008839740160738534
388
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
389
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
390
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
391
- aggregate,aggregate,Helm Lite OpenBookQA,helm_lite_240829.csv,kendall,random,8,4,0.7637626158259734,0.008839740160738534
392
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
393
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
394
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,8,2,0.6428571428571428,0.03115079365079365
395
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
396
- aggregate,aggregate,Helm Lite MMLU,helm_lite_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
397
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,8,0,0.21428571428571427,0.5484126984126985
398
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,8,1,0.2857142857142857,0.39875992063492066
399
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
400
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,8,3,0.14285714285714285,0.7195436507936508
401
- aggregate,aggregate,Helm Lite MathEquivalentCOT,helm_lite_240829.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
402
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
403
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
404
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
405
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
406
- aggregate,aggregate,Helm Lite GSM8K,helm_lite_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
407
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,8,0,0.2857142857142857,0.39875992063492066
408
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,8,1,0.21428571428571427,0.5484126984126985
409
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,8,2,0.3571428571428571,0.27509920634920637
410
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,8,3,0.036369648372665396,0.9007802600472398
411
- aggregate,aggregate,Helm Lite LegalBench,helm_lite_240829.csv,kendall,random,8,4,0.21428571428571427,0.5484126984126985
412
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
413
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,8,1,0.6428571428571428,0.03115079365079365
414
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
415
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
416
- aggregate,aggregate,Helm Lite MedQA,helm_lite_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
417
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
418
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,8,1,0.3571428571428571,0.27509920634920637
419
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,8,2,0.42857142857142855,0.17886904761904762
420
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,8,3,0.3571428571428571,0.27509920634920637
421
- aggregate,aggregate,Helm Lite WMT2014,helm_lite_240829.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
422
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
423
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
424
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
425
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
426
- aggregate,aggregate,HF OpenLLM v2,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
427
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
428
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
429
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
430
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.7142857142857142,0.014136904761904762
431
- aggregate,aggregate,HFv2 BBH,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
432
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
433
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.4999999999999999,0.10868055555555556
434
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
435
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
436
- aggregate,aggregate,HFv2 GPQA,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
437
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.8571428571428571,0.001736111111111111
438
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.4999999999999999,0.10868055555555556
439
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
440
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
441
- aggregate,aggregate,HFv2 IFEval,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
442
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
443
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
444
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
445
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
446
- aggregate,aggregate,HFv2 MMLU Pro,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
447
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.8571428571428571,0.001736111111111111
448
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.5714285714285714,0.06101190476190476
449
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
450
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.4999999999999999,0.10868055555555556
451
- aggregate,aggregate,HFv2 Math Level 5,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
452
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
453
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,8,1,0.4999999999999999,0.10868055555555556
454
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,8,2,0.5714285714285714,0.06101190476190476
455
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
456
- aggregate,aggregate,HFv2 MuSR,hf_open_llm_v2_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
457
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
458
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
459
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
460
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
461
- aggregate,aggregate,Helm MMLU,helm_mmlu_240829.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
462
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
463
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
464
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
465
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
466
- aggregate,aggregate,LMSys Arena,chatbot_arena_240829.csv,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
467
- aggregate,aggregate,MMLU Pro,mmlu_pro_240829.csv,kendall,random,8,0,0.8571428571428571,0.001736111111111111
468
- aggregate,aggregate,MMLU Pro,mmlu_pro_240829.csv,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
469
- aggregate,aggregate,MMLU Pro,mmlu_pro_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
470
- aggregate,aggregate,MMLU Pro,mmlu_pro_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
471
- aggregate,aggregate,MMLU Pro,mmlu_pro_240829.csv,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
472
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
473
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
474
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
475
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
476
- aggregate,aggregate,MixEval,mixeval_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
477
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
478
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
479
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,8,2,0.6428571428571428,0.03115079365079365
480
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
481
- aggregate,aggregate,MixEval Hard,mixeval_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
482
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
483
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
484
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
485
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
486
- aggregate,aggregate,MixEval TriviaQA,mixeval_240829.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
487
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
488
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
489
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
490
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
491
- aggregate,aggregate,MixEval MMLU,mixeval_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
492
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,8,0,0.6182840223353117,0.0340492747686748
493
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,8,1,0.18184824186332696,0.5330356744917513
494
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,8,2,0.3571428571428571,0.27509920634920637
495
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
496
- aggregate,aggregate,MixEval DROP,mixeval_240829.csv,kendall,random,8,4,0.42857142857142855,0.17886904761904762
497
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
498
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,8,1,0.5714285714285714,0.06101190476190476
499
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
500
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
501
- aggregate,aggregate,MixEval HellaSwag,mixeval_240829.csv,kendall,random,8,4,0.42857142857142855,0.17886904761904762
502
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,8,0,0.6910233190806425,0.017844011512848347
503
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,8,1,0.5455447255899809,0.0614649096074132
504
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,8,2,0.5455447255899809,0.0614649096074132
505
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,8,3,0.836501912571304,0.004136737098676645
506
- aggregate,aggregate,MixEval CommonsenseQA,mixeval_240829.csv,kendall,random,8,4,0.6910233190806425,0.017844011512848347
507
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
508
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
509
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
510
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,8,3,0.21428571428571427,0.5484126984126985
511
- aggregate,aggregate,MixEval TriviaQA Hard,mixeval_240829.csv,kendall,random,8,4,0.42857142857142855,0.17886904761904762
512
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
513
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,8,1,0.6428571428571428,0.03115079365079365
514
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,8,2,0.22237479499833035,0.45088703102517036
515
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,8,3,0.3571428571428571,0.27509920634920637
516
- aggregate,aggregate,MixEval MMLU Hard,mixeval_240829.csv,kendall,random,8,4,0.5714285714285714,0.06101190476190476
517
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,8,0,0.14285714285714285,0.7195436507936508
518
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
519
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
520
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,8,3,0.7142857142857142,0.014136904761904762
521
- aggregate,aggregate,MixEval DROP Hard,mixeval_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
522
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
523
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,8,1,0.836501912571304,0.004136737098676645
524
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
525
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
526
- aggregate,aggregate,AlphacaEval v2lc,alphacaeval_v2lc_240829.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
527
- aggregate,aggregate,OpenCompass Arena,opencompass_arena_240829.csv,kendall,random,8,0,0.2857142857142857,0.39875992063492066
528
- aggregate,aggregate,OpenCompass Arena,opencompass_arena_240829.csv,kendall,random,8,1,0.5714285714285714,0.06101190476190476
529
- aggregate,aggregate,OpenCompass Arena,opencompass_arena_240829.csv,kendall,random,8,2,0.5714285714285714,0.06101190476190476
530
- aggregate,aggregate,OpenCompass Arena,opencompass_arena_240829.csv,kendall,random,8,3,0.3571428571428571,0.27509920634920637
531
- aggregate,aggregate,OpenCompass Arena,opencompass_arena_240829.csv,kendall,random,8,4,0.4999999999999999,0.10868055555555556
532
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
533
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
534
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
535
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
536
- aggregate,aggregate,LiveBench 240725,livebench_240829.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
537
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
538
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,8,1,0.3571428571428571,0.27509920634920637
539
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,8,2,0.47280542884465016,0.10506382347888965
540
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
541
- aggregate,aggregate,LiveBench Reasoning,livebench_240829.csv,kendall,random,8,4,0.5455447255899809,0.0614649096074132
542
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
543
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
544
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
545
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
546
- aggregate,aggregate,LiveBench Coding,livebench_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
547
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
548
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,8,1,0.5714285714285714,0.06101190476190476
549
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
550
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
551
- aggregate,aggregate,LiveBench Mathematics,livebench_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
552
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
553
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,8,1,0.6428571428571428,0.03115079365079365
554
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
555
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
556
- aggregate,aggregate,LiveBench Data Analysis,livebench_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
557
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
558
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
559
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,8,2,0.4999999999999999,0.10868055555555556
560
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
561
- aggregate,aggregate,LiveBench Language,livebench_240829.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
562
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
563
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,8,1,0.6428571428571428,0.03115079365079365
564
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,8,2,0.6428571428571428,0.03115079365079365
565
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,8,3,0.3571428571428571,0.27509920634920637
566
- aggregate,aggregate,LiveBench Instruction Following,livebench_240829.csv,kendall,random,8,4,0.42857142857142855,0.17886904761904762
567
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,8,0,0.8571428571428571,0.001736111111111111
568
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
569
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
570
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
571
- aggregate,aggregate,WildBench Elo LC,wildbench_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
572
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
573
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
574
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
575
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
576
- aggregate,aggregate,WildBench Information Seeking,wildbench_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
577
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
578
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,8,1,0.7142857142857142,0.014136904761904762
579
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
580
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
581
- aggregate,aggregate,WildBench Creative,wildbench_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
582
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
583
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
584
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
585
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
586
- aggregate,aggregate,WildBench Code Debugging,wildbench_240829.csv,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
587
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
588
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
589
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
590
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
591
- aggregate,aggregate,WildBench Math & Data,wildbench_240829.csv,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
592
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,8,0,0.9285714285714285,0.0003968253968253968
593
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
594
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
595
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
596
- aggregate,aggregate,WildBench Reasoning & Planning,wildbench_240829.csv,kendall,random,8,4,0.9285714285714285,0.0003968253968253968
597
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
598
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
599
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
600
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
601
- aggregate,aggregate,WildBench Score,wildbench_240829.csv,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
602
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,8,0,0.9999999999999998,4.96031746031746e-05
603
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
604
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
605
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
606
- aggregate,aggregate,Arena Hard,arena_hard_240829.csv,kendall,random,8,4,0.9999999999999998,4.96031746031746e-05
607
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.14285714285714285,0.7195436507936508
608
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.4999999999999999,0.10868055555555556
609
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.6428571428571428,0.03115079365079365
610
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
611
- aggregate,aggregate,HF OpenLLM v1,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.5714285714285714,0.06101190476190476
612
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
613
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.42857142857142855,0.17886904761904762
614
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
615
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
616
- aggregate,aggregate,HFv1 ARC,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
617
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.0,1.0
618
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.5714285714285714,0.06101190476190476
619
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.5714285714285714,0.06101190476190476
620
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
621
- aggregate,aggregate,HFv1 GSM8K,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
622
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
623
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.10910894511799618,0.7083840532183997
624
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.6182840223353117,0.0340492747686748
625
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.2857142857142857,0.39875992063492066
626
- aggregate,aggregate,HFv1 HellaSwag,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.40006613209931935,0.17023995462900499
627
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
628
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
629
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
630
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.9285714285714285,0.0003968253968253968
631
- aggregate,aggregate,HFv1 MMLU,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
632
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
633
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.2857142857142857,0.39875992063492066
634
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.6428571428571428,0.03115079365079365
635
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.42857142857142855,0.17886904761904762
636
- aggregate,aggregate,HFv1 TruthfulQA,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
637
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,0,0.42857142857142855,0.17886904761904762
638
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,1,0.2857142857142857,0.39875992063492066
639
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,2,0.6182840223353117,0.0340492747686748
640
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,3,0.42857142857142855,0.17886904761904762
641
- aggregate,aggregate,HFv1 Winogrande,hf_open_llm_v1_240829_frozen.csv,kendall,random,8,4,0.2857142857142857,0.39875992063492066
642
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,8,0,0.21428571428571427,0.5484126984126985
643
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,8,1,0.14285714285714285,0.7195436507936508
644
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,8,2,0.42857142857142855,0.17886904761904762
645
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,8,3,0.4999999999999999,0.10868055555555556
646
- aggregate,aggregate,BFCL,bfcl_240906.csv,kendall,random,8,4,0.3571428571428571,0.27509920634920637
647
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
648
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,8,1,0.7857142857142856,0.005505952380952381
649
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
650
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,8,3,0.7142857142857142,0.014136904761904762
651
- aggregate,aggregate,BIGGEN,biggen_240829.csv,kendall,random,8,4,0.7142857142857142,0.014136904761904762
652
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
653
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
654
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
655
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
656
- aggregate,aggregate,BIGGEN Grounding,biggen_240829.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
657
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,8,0,0.40006613209931935,0.17023995462900499
658
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,8,1,0.6910233190806425,0.017844011512848347
659
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
660
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,8,3,0.7142857142857142,0.014136904761904762
661
- aggregate,aggregate,BIGGEN Instruction Following,biggen_240829.csv,kendall,random,8,4,0.47280542884465016,0.10506382347888965
662
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,8,0,0.47280542884465016,0.10506382347888965
663
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,8,1,0.42857142857142855,0.17886904761904762
664
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,8,2,0.7637626158259734,0.008839740160738534
665
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
666
- aggregate,aggregate,BIGGEN Planning,biggen_240829.csv,kendall,random,8,4,0.10910894511799618,0.7083840532183997
667
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,8,0,0.8571428571428571,0.001736111111111111
668
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
669
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,8,2,0.9999999999999998,4.96031746031746e-05
670
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
671
- aggregate,aggregate,BIGGEN Reasoning,biggen_240829.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
672
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,8,0,0.7857142857142856,0.005505952380952381
673
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,8,1,0.7637626158259734,0.008839740160738534
674
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,8,2,0.9285714285714285,0.0003968253968253968
675
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
676
- aggregate,aggregate,BIGGEN Refinement,biggen_240829.csv,kendall,random,8,4,0.7637626158259734,0.008839740160738534
677
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,8,0,-0.2857142857142857,0.39875992063492066
678
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,8,1,0.2545875386086578,0.38281014365989596
679
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,8,2,0.6910233190806425,0.017844011512848347
680
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,8,3,0.10910894511799618,0.7083840532183997
681
- aggregate,aggregate,BIGGEN Safety,biggen_240829.csv,kendall,random,8,4,0.07142857142857142,0.9048611111111111
682
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
683
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,8,1,0.6428571428571428,0.03115079365079365
684
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,8,2,0.836501912571304,0.004136737098676645
685
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,8,3,0.5669467095138409,0.05611472402809984
686
- aggregate,aggregate,BIGGEN Theory of Mind,biggen_240829.csv,kendall,random,8,4,0.6182840223353117,0.0340492747686748
687
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
688
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,8,1,0.4999999999999999,0.10868055555555556
689
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,8,2,0.5714285714285714,0.06101190476190476
690
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
691
- aggregate,aggregate,BIGGEN Tool Usage,biggen_240829.csv,kendall,random,8,4,0.42857142857142855,0.17886904761904762
692
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
693
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
694
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,8,2,0.42857142857142855,0.17886904761904762
695
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,8,3,0.6428571428571428,0.03115079365079365
696
- aggregate,aggregate,BIGGEN Multilingual,biggen_240829.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
697
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
698
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
699
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,8,2,0.8571428571428571,0.001736111111111111
700
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
701
- aggregate,aggregate,LiveBench 240624,livebench_240701.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
702
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,8,0,0.7412493166611012,0.011966745157436277
703
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
704
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,8,2,0.7637626158259734,0.008839740160738534
705
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,8,3,0.7142857142857142,0.014136904761904762
706
- aggregate,aggregate,LiveBench Reasoning Average,livebench_240701.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
707
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
708
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,8,1,0.9285714285714285,0.0003968253968253968
709
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
710
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,8,3,0.9999999999999998,4.96031746031746e-05
711
- aggregate,aggregate,LiveBench Coding Average,livebench_240701.csv,kendall,random,8,4,0.7637626158259734,0.008839740160738534
712
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,8,0,0.7142857142857142,0.014136904761904762
713
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
714
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
715
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,8,3,0.8571428571428571,0.001736111111111111
716
- aggregate,aggregate,LiveBench Mathematics Average,livebench_240701.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
717
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,8,0,0.4999999999999999,0.10868055555555556
718
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
719
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
720
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
721
- aggregate,aggregate,LiveBench Data Analysis Average,livebench_240701.csv,kendall,random,8,4,0.6428571428571428,0.03115079365079365
722
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,8,0,0.6428571428571428,0.03115079365079365
723
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,8,1,0.9999999999999998,4.96031746031746e-05
724
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,8,2,0.7142857142857142,0.014136904761904762
725
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,8,3,0.7857142857142856,0.005505952380952381
726
- aggregate,aggregate,LiveBench Language Average,livebench_240701.csv,kendall,random,8,4,0.8571428571428571,0.001736111111111111
727
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,8,0,0.5714285714285714,0.06101190476190476
728
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,8,1,0.8571428571428571,0.001736111111111111
729
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,8,2,0.7857142857142856,0.005505952380952381
730
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,8,3,0.5714285714285714,0.06101190476190476
731
- aggregate,aggregate,LiveBench Instruction Following Average,livebench_240701.csv,kendall,random,8,4,0.7857142857142856,0.005505952380952381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_old/allbenchs_cache_151f5bfbf87ac7384c2759731c72ec0c.csv DELETED
The diff for this file is too large to render. See raw diff
 
cache_old/allbenchs_cache_1b58bbc4e0d124b0a524da1001369741.csv DELETED
The diff for this file is too large to render. See raw diff
 
cache_old/allbenchs_cache_741f08262e15cba4bd6c8b25f2b138ca.csv DELETED
The diff for this file is too large to render. See raw diff
 
cache_old/allbenchs_cache_dcbcd453e19427bcbf89a901d3f2a925.csv DELETED
The diff for this file is too large to render. See raw diff