yzabc007 commited on
Commit
9f4c149
·
1 Parent(s): 1062c17

Update space

Browse files
Files changed (1) hide show
  1. app.py +16 -10
app.py CHANGED
@@ -137,6 +137,9 @@ with demo:
137
 
138
  DESCRIPTION_TEXT = """
139
  Total #models: 52 (Last updated: 2024-10-08)
 
 
 
140
  """
141
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
142
 
@@ -160,8 +163,8 @@ with demo:
160
  with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
161
  DESCRIPTION_TEXT = """
162
  Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
163
- We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685), coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
164
-
165
  """
166
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
167
 
@@ -192,6 +195,7 @@ with demo:
192
  [MathQA](https://arxiv.org/abs/1905.13319),
193
  [MathBench](https://arxiv.org/abs/2405.12209),
194
  [SciBench](https://arxiv.org/abs/2307.10635), and more!
 
195
  We plan to include more math domains, such as calculus, number theory, and more in the future.
196
  """
197
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
@@ -250,22 +254,24 @@ with demo:
250
 
251
  with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
252
  DESCRIPTION_TEXT = """
253
- Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective at distinguishing between modern LLMs.
254
- Our current focus is on two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
255
 
256
- For logical reasoning, we collect datasets from
257
- [BigBench Hard (BBH)](https://arxiv.org/abs/2210.09261),
258
  [FOLIO](https://arxiv.org/abs/2209.00840),
259
  [LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
260
  [PrOntoQA](https://arxiv.org/abs/2210.01240),
261
- [ReClor](https://arxiv.org/abs/2002.04326).
262
-
 
263
  For social reasoning, we collect datasets from
264
- [MMToM-QA](https://arxiv.org/abs/2401.08743),
265
  [BigToM](https://arxiv.org/abs/2306.15448),
266
  [Adv-CSFB](https://arxiv.org/abs/2305.14763),
267
  [SocialIQA](https://arxiv.org/abs/1904.09728),
268
- [NormBank](https://arxiv.org/abs/2305.17008).
 
269
 
270
  """
271
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
 
137
 
138
  DESCRIPTION_TEXT = """
139
  Total #models: 52 (Last updated: 2024-10-08)
140
+
141
+ This page provids a comprehensive overview of model ranks across various dimensions. Models are sorted based on their averaged rank across all dimensions.
142
+ (Some missing values are due to the slow or problemtic model responses, and we will update the leaderboard once we have the complete results.)
143
  """
144
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
145
 
 
163
  with gr.TabItem("🎯 Overall", elem_id="llm-benchmark-tab-table", id=1):
164
  DESCRIPTION_TEXT = """
165
  Overall dimension measures the comprehensive performance of LLMs across diverse tasks.
166
+ We start with diverse questions from the widely-used [MT-Bench](https://arxiv.org/abs/2306.05685),
167
+ coving a wide range of domains, including writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
168
  """
169
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
170
 
 
195
  [MathQA](https://arxiv.org/abs/1905.13319),
196
  [MathBench](https://arxiv.org/abs/2405.12209),
197
  [SciBench](https://arxiv.org/abs/2307.10635), and more!
198
+
199
  We plan to include more math domains, such as calculus, number theory, and more in the future.
200
  """
201
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
 
254
 
255
  with gr.TabItem("🧠 Reasoning", elem_id="reasonong-tab-table", id=3):
256
  DESCRIPTION_TEXT = """
257
+ Reasoning is a broad domain for evaluating LLMs, but traditional tasks like commonsense reasoning have become less effective in differentiating modern LLMs.
258
+ We now present two challenging types of reasoning: logical reasoning and social reasoning, both of which present more meaningful and sophisticated ways to assess LLM performance.
259
 
260
+ For logical reasoning, we leverage datasets from sources such as
261
+ [BIG-Bench Hard (BBH)](https://arxiv.org/abs/2210.09261),
262
  [FOLIO](https://arxiv.org/abs/2209.00840),
263
  [LogiQA2.0](https://github.com/csitfun/LogiQA2.0),
264
  [PrOntoQA](https://arxiv.org/abs/2210.01240),
265
+ [ReClor](https://arxiv.org/abs/2002.04326),
266
+ These cover a range of tasks including deductive reasoning, object counting and tracking, pattern recognition,
267
+ temporal reasoning, first-order logic reaosning, etc.
268
  For social reasoning, we collect datasets from
269
+ [MMToM-QA (Text-only)](https://arxiv.org/abs/2401.08743),
270
  [BigToM](https://arxiv.org/abs/2306.15448),
271
  [Adv-CSFB](https://arxiv.org/abs/2305.14763),
272
  [SocialIQA](https://arxiv.org/abs/1904.09728),
273
+ [NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
274
+ such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
275
 
276
  """
277
  gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")