shenzhi-wang commited on
Commit
46b6c6f
Β·
verified Β·
1 Parent(s): 8f69a30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -16
README.md CHANGED
@@ -87,22 +87,46 @@ print(response)
87
  πŸ”’: Proprietary
88
 
89
  ### 3.1 Arena-Hard-Auto
90
- | | Score | 95% CIs |
91
- | --------------------------------- | -------- | ----------- |
92
- | **Xwen-72B-Chat** πŸ”‘ | **86.1** | (-1.5, 1.7) |
93
- | Qwen2.5-72B-Chat πŸ”‘ | 63.3 | (-2.5, 2.3) |
94
- | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
95
- | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
96
- | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
97
- | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** | (-1.3, 1.3) |
98
- | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
99
- | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
100
- | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
101
- | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
102
- | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
103
- | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
104
- | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
105
- | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
 
108
 
 
87
  πŸ”’: Proprietary
88
 
89
  ### 3.1 Arena-Hard-Auto
90
+
91
+ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
92
+
93
+ #### 3.1.1 No Style Control
94
+
95
+ | | Score | 95% CIs |
96
+ | --------------------------------- | ------------------------ | ----------- |
97
+ | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
98
+ | Qwen2.5-72B-Chat πŸ”‘ | 78.0 | (-1.8, 1.8) |
99
+ | Athene-v2-Chat πŸ”‘ | 85.0 | (-1.4, 1.7) |
100
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 84.9 | (-1.7, 1.8) |
101
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 69.3 | (-2.4, 2.2) |
102
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | 85.2 | (-1.4, 1.6) |
103
+ | O1-Preview-2024-09-12 πŸ”’ | **92.0** (Top-1 among πŸ”’) | (-1.2, 1.0) |
104
+ | O1-Mini-2024-09-12 πŸ”’ | 90.4 | (-1.1, 1.3) |
105
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 82.6 | (-1.8, 1.5) |
106
+ | GPT-4-0125-Preview πŸ”’ | 78.0 | (-2.1, 2.4) |
107
+ | GPT-4o-2024-08-06 πŸ”’ | 77.9 | (-2.0, 2.1) |
108
+ | Yi-Lightning πŸ”’ | 81.5 | (-1.6, 1.6) |
109
+ | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
110
+ | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
111
+
112
+ #### 3.1.2 Style Control
113
+
114
+ | | Score | 95% CIs |
115
+ | --------------------------------- | ------------------------ | ----------- |
116
+ | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
117
+ | Qwen2.5-72B-Chat πŸ”‘ | 63.3 | (-2.5, 2.3) |
118
+ | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
119
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
120
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
121
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** (Top-1 Among πŸ”’) | (-1.3, 1.3) |
122
+ | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
123
+ | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
124
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
125
+ | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
126
+ | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
127
+ | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
128
+ | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
129
+ | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
130
 
131
 
132