lioushz

Shz

AI & ML interests

None yet

Recent Activity

liked a dataset about 1 month ago

opencompass/LiveMathBench

upvoted a paper about 2 months ago

Are Your LLMs Capable of Stable Reasoning?

updated a dataset 3 months ago

opencompass/mmmlu_lite

View all activity

Organizations

Shz's activity

liked a dataset about 1 month ago

opencompass/LiveMathBench

Viewer • Updated 15 days ago • 283 • 216 • 4

upvoted a paper about 2 months ago

Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published Dec 17, 2024 • 91

updated a dataset 3 months ago

opencompass/mmmlu_lite

Viewer • Updated Nov 1, 2024 • 20k • 50 • 2

liked a dataset 3 months ago

opencompass/mmmlu_lite

Viewer • Updated Nov 1, 2024 • 20k • 50 • 2

upvoted a paper 4 months ago

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Paper • 2410.16256 • Published Oct 21, 2024 • 60

liked a Space 4 months ago

Open VLM Video Leaderboard

🌎

VLMEvalKit Eval Results in video understanding benchmark

upvoted a paper 5 months ago

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Paper • 2409.16191 • Published Sep 24, 2024 • 42

liked a dataset 6 months ago

MU-NLPC/Calc-gsm8k

Viewer • Updated Oct 30, 2023 • 17.6k • 202 • 5

upvoted a paper 7 months ago

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Paper • 2407.11963 • Published Jul 16, 2024 • 44

liked a Space 7 months ago

4.35k

OpenGPT 4o

🔥

GPT 4o like bot.

upvoted 2 papers 8 months ago

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20, 2024 • 33

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 35

liked 4 models about 2 years ago