Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
kennymckormick
commited on
Commit
•
e2f94e9
1
Parent(s):
907ba25
update sorting logic
Browse files- gen_table.py +2 -1
- meta_data.py +7 -1
gen_table.py
CHANGED
@@ -88,7 +88,8 @@ def BUILD_L1_DF(results, fields):
|
|
88 |
res['Avg Rank'].append(round(np.mean(ranks), 2))
|
89 |
|
90 |
df = pd.DataFrame(res)
|
91 |
-
df = df.sort_values('Avg
|
|
|
92 |
|
93 |
check_box = {}
|
94 |
check_box['essential'] = ['Method', 'Parameters (B)', 'Language Model', 'Vision Model']
|
|
|
88 |
res['Avg Rank'].append(round(np.mean(ranks), 2))
|
89 |
|
90 |
df = pd.DataFrame(res)
|
91 |
+
df = df.sort_values('Avg Score')
|
92 |
+
df = df.iloc[::-1]
|
93 |
|
94 |
check_box = {}
|
95 |
check_box['essential'] = ['Method', 'Parameters (B)', 'Language Model', 'Vision Model']
|
meta_data.py
CHANGED
@@ -37,7 +37,7 @@ LEADERBOARD_MD['MAIN'] = f"""
|
|
37 |
- Metrics:
|
38 |
- Avg Score: The average score on all VLM Benchmarks (normalized to 0 - 100, the higher the better).
|
39 |
- Avg Rank: The average rank on all VLM Benchmarks (the lower the better).
|
40 |
-
- The overall evaluation results on {len(MAIN_FIELDS)} VLM benchmarks, sorted by the
|
41 |
- The following datasets are included in the main results: {', '.join(MAIN_FIELDS)}.
|
42 |
- Detailed evaluation results for each dataset (included or not included in main) are provided in the consequent tabs.
|
43 |
"""
|
@@ -151,4 +151,10 @@ LEADERBOARD_MD['MMStar'] = """
|
|
151 |
|
152 |
- MMStar is an elite vision-indispensable multi-modal benchmark, including 1,500 challenging samples meticulously selected by humans.
|
153 |
- During the evaluation of MMStar, we find that some API models may reject to answer some of the questions. Currently, we treat such cases as wrong answers when reporting the results.
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
"""
|
|
|
37 |
- Metrics:
|
38 |
- Avg Score: The average score on all VLM Benchmarks (normalized to 0 - 100, the higher the better).
|
39 |
- Avg Rank: The average rank on all VLM Benchmarks (the lower the better).
|
40 |
+
- The overall evaluation results on {len(MAIN_FIELDS)} VLM benchmarks, sorted by the descending order of Avg Score.
|
41 |
- The following datasets are included in the main results: {', '.join(MAIN_FIELDS)}.
|
42 |
- Detailed evaluation results for each dataset (included or not included in main) are provided in the consequent tabs.
|
43 |
"""
|
|
|
151 |
|
152 |
- MMStar is an elite vision-indispensable multi-modal benchmark, including 1,500 challenging samples meticulously selected by humans.
|
153 |
- During the evaluation of MMStar, we find that some API models may reject to answer some of the questions. Currently, we treat such cases as wrong answers when reporting the results.
|
154 |
+
"""
|
155 |
+
|
156 |
+
LEADERBOARD_MD['RealWorldQA'] = """
|
157 |
+
## RealWorldQA Evaluation Results
|
158 |
+
|
159 |
+
- RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.
|
160 |
"""
|