Running 110 110 Open-LLM performances are plateauing, let’s make the leaderboard steep again 🏔 Update leaderboard for fair model evaluation