Update space
Browse files
app.py
CHANGED
@@ -137,6 +137,9 @@ with demo:
|
|
137 |
|
138 |
DESCRIPTION_TEXT = """
|
139 |
Total #models: 52 (Last updated: 2024-10-08)
|
|
|
|
|
|
|
140 |
"""
|
141 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
142 |
|
@@ -160,8 +163,8 @@ with demo:
|
|
160 |
with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
|
161 |
DESCRIPTION_TEXT = """
|
162 |
Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
|
163 |
-
We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
|
164 |
-
|
165 |
"""
|
166 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
167 |
|
@@ -192,6 +195,7 @@ with demo:
|
|
192 |
[MathQA](https://arxiv.org/abs/1905.13319),
|
193 |
[MathBench](https://arxiv.org/abs/2405.12209),
|
194 |
[SciBench](https://arxiv.org/abs/2307.10635), and more!
|
|
|
195 |
We plan to include more math domains, such as calculus, number theory, and more in the future.
|
196 |
"""
|
197 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
@@ -250,22 +254,24 @@ with demo:
|
|
250 |
|
251 |
with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
|
252 |
DESCRIPTION_TEXT = """
|
253 |
-
Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective
|
254 |
-
|
255 |
|
256 |
-
For logical reasoning, we
|
257 |
-
[
|
258 |
[FOLIO](https://arxiv.org/abs/2209.00840),
|
259 |
[LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
|
260 |
[PrOntoQA](https://arxiv.org/abs/2210.01240),
|
261 |
-
[ReClor](https://arxiv.org/abs/2002.04326)
|
262 |
-
|
|
|
263 |
For social reasoning, we collect datasets from
|
264 |
-
[MMToM-QA](https://arxiv.org/abs/2401.08743),
|
265 |
[BigToM](https://arxiv.org/abs/2306.15448),
|
266 |
[Adv-CSFB](https://arxiv.org/abs/2305.14763),
|
267 |
[SocialIQA](https://arxiv.org/abs/1904.09728),
|
268 |
-
[NormBank](https://arxiv.org/abs/2305.17008)
|
|
|
269 |
|
270 |
"""
|
271 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
|
|
137 |
|
138 |
DESCRIPTION_TEXT = """
|
139 |
Total #models: 52 (Last updated: 2024-10-08)
|
140 |
+
|
141 |
+
This page provids a comprehensive overview of model ranks across various dimensions. Models are sorted based on their averaged rank across all dimensions.
|
142 |
+
(Some missing values are due to the slow or problemtic model responses, and we will update the leaderboard once we have the complete results.)
|
143 |
"""
|
144 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
145 |
|
|
|
163 |
with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
|
164 |
DESCRIPTION_TEXT = """
|
165 |
Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
|
166 |
+
We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
|
167 |
+
coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
|
168 |
"""
|
169 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
170 |
|
|
|
195 |
[MathQA](https://arxiv.org/abs/1905.13319),
|
196 |
[MathBench](https://arxiv.org/abs/2405.12209),
|
197 |
[SciBench](https://arxiv.org/abs/2307.10635), and more!
|
198 |
+
|
199 |
We plan to include more math domains, such as calculus, number theory, and more in the future.
|
200 |
"""
|
201 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|
|
|
254 |
|
255 |
with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
|
256 |
DESCRIPTION_TEXT = """
|
257 |
+
Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective in differentiating modern LLMs.
|
258 |
+
We now present two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
|
259 |
|
260 |
+
For logical reasoning, we leverage datasets from sources such as
|
261 |
+
[BIG-Bench Hard (BBH)](https://arxiv.org/abs/2210.09261),
|
262 |
[FOLIO](https://arxiv.org/abs/2209.00840),
|
263 |
[LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
|
264 |
[PrOntoQA](https://arxiv.org/abs/2210.01240),
|
265 |
+
[ReClor](https://arxiv.org/abs/2002.04326),
|
266 |
+
These cover a range of tasks including deductive reasoning, object counting and tracking, pattern recognition,
|
267 |
+
temporal reasoning, first-order logic reaosning, etc.
|
268 |
For social reasoning, we collect datasets from
|
269 |
+
[MMToM-QA (Text-only)](https://arxiv.org/abs/2401.08743),
|
270 |
[BigToM](https://arxiv.org/abs/2306.15448),
|
271 |
[Adv-CSFB](https://arxiv.org/abs/2305.14763),
|
272 |
[SocialIQA](https://arxiv.org/abs/1904.09728),
|
273 |
+
[NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
|
274 |
+
such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
|
275 |
|
276 |
"""
|
277 |
gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
|