ArkaAbacus
commited on
Commit
•
a4e1166
1
Parent(s):
670b7fa
Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
|
|
13 |
|
14 |
| Model | Groups | Version | Filter | n-shot | Metric | Value | | Stderr |
|
15 |
|----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
|
16 |
-
| Smaug-Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
|
17 |
| Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
|
18 |
|
19 |
#### Breakdown:
|
@@ -84,6 +84,14 @@ Qwen2-72B-Instruct:
|
|
84 |
| - bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
|
85 |
| - bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
|
86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
## Arena-Hard
|
88 |
|
89 |
Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
|
|
|
13 |
|
14 |
| Model | Groups | Version | Filter | n-shot | Metric | Value | | Stderr |
|
15 |
|----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
|
16 |
+
| **Smaug-Qwen2-72B-Instruct** | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
|
17 |
| Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
|
18 |
|
19 |
#### Breakdown:
|
|
|
84 |
| - bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
|
85 |
| - bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
|
86 |
|
87 |
+
## LiveCodeBench
|
88 |
+
|
89 |
+
| Model | Pass@1 | Easy Pass@1 | Medium Pass@1 | Hard Pass@1 |
|
90 |
+
|--------------------------|--------|-------------|---------------|-------------|
|
91 |
+
| **Smaug-Qwen2-72B-Instruct** | 0.3357 | 0.7286 | 0.1633 | 0.0000 |
|
92 |
+
| Qwen2-72B-Instruct | 0.3139 | 0.6810 | 0.1531 | 0.0000 |
|
93 |
+
|
94 |
+
|
95 |
## Arena-Hard
|
96 |
|
97 |
Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
|