A newer version of the Gradio SDK is available:
5.9.1
title: π Multilingual MMLU Benchmark Leaderboard
emoji: π
colorFrom: pink
colorTo: purple
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
description: >-
Compare multilingual models on the MMLU benchmark leaderboard across languages
and tasks
tags:
- multilingual
- benchmark
- MMMLU
- leaderboard
- machine learning
Multilingual MMLU Benchmark Leaderboard
Welcome to the Multilingual MMLU Benchmark Leaderboard! This leaderboard is designed to assess the performance of both open-source and closed-source language models on the Multilingual MMLU (Massive Multitask Language Understanding) benchmark. The benchmark evaluates the memorization, reasoning, and linguistic capabilities of models across a wide range of languages, making it a crucial tool for comparing multilingual AI performance.
Overview
The Multilingual MMLU Benchmark is a comprehensive evaluation platform for AI models, assessing their general knowledge and ability to handle diverse tasks across 57 distinct domains. These tasks range from elementary-level knowledge to more advanced subjects such as law, physics, history, and computer science. With this leaderboard, we aim to provide an accessible and reliable way for developers, researchers, and organizations to compare language models' multilingual understanding and reasoning abilities.
Evaluation Scope
Key Features:
- Multilingual Coverage: The leaderboard evaluates language models on 14 different languages, including widely spoken languages such as English, Spanish, Arabic, and Chinese, as well as languages with fewer resources like Swahili and Yoruba.
- Diverse Domains: The benchmark includes tasks across 57 domains, ensuring that models are tested on a wide range of topics, from elementary knowledge to complex professional fields.
- Comprehensive QA Tasks: The evaluation focuses on Question Answering (QA) tasks, where models answer questions based on general knowledge in various domains. This helps assess both the depth and breadth of the model's knowledge.
Languages Covered:
The leaderboard includes models evaluated on the following languages:
- AR_XY: Arabic
- BN_BD: Bengali
- DE_DE: German
- ES_LA: Spanish
- FR_FR: French
- HI_IN: Hindi
- ID_ID: Indonesian
- IT_IT: Italian
- JA_JP: Japanese
- KO_KR: Korean
- PT_BR: Brazilian Portuguese
- SW_KE: Swahili
- YO_NG: Yoruba
- ZH_CN: Simplified Chinese
How It Works
Submitting Models
To evaluate your model on the Multilingual MMLU Benchmark, you can submit it through the "Submit here" tab on the leaderboard. The evaluation process will run your model through a series of tests across the 57 domains in the 14 supported languages. Results will be provided on the leaderboard, with detailed scores for each language and domain.
Evaluation Process
We use the OpenCompass framework to automate the evaluation process. This framework enables efficient execution of multiple tests across different languages and domains, ensuring scalability and reproducibility. The following is the evaluation setup:
- Evaluation Method: All tasks are evaluated using a 5-shot setting.
- Normalization: Results are normalized using the following formula:
normalized_value = (raw_value - random_baseline) / (max_value - random_baseline)
Evaluation Details
For Generative Tasks:
- Random Baseline:
random_baseline = 0
For Multiple-Choice QA Tasks:
- Random Baseline:
random_baseline = 1/n
, wheren
is the number of choices.
Aggregated Results:
Scores for each language are averaged across all tasks to provide a comprehensive model evaluation.
Results
- Numerical Results: You can access the detailed evaluation results in the results dataset.
- Community Queries: Track ongoing tasks and request status in the requests dataset.
Reproducibility
To ensure reproducibility, we provide a fork of the lm_eval framework, which allows you to recreate the evaluation setup and results on your own models. While not all contributions are integrated into the main repository, our fork contains all the necessary updates for evaluation.
For detailed setup instructions and to reproduce results on your local machine, please refer to our lm_eval fork.
Acknowledgements
This leaderboard was developed as part of the #ProjectName, led by [OrganizationName], and has benefited from the support of high-quality evaluation datasets donated by the following institutions:
- [Institution 1]
- [Institution 2]
- [Institution 3]
- [Institution 4]
- [Institution 5]
- [Institution 6]
- [Institution 7]
We also thank [Institution1], [Institution2], and [Institution3] for sponsoring inference GPUs, which were crucial for running the large-scale evaluations.
Special Thanks to the Contributors:
- Task Implementation: [Name1], [Name2], [Name3]
- Leaderboard Implementation: [Name4], [Name5]
- Model Evaluation: [Name6], [Name7]
- Communications: [Name8], [Name9]
- Organization & Collaboration: [Name10], [Name11], [Name12]
For more information on the datasets, please refer to the Dataset Cards available in the "Tasks" tab and the Citations section below.
Collaborate
We are always looking for collaborators to expand the range of evaluated models and datasets. If you would like us to include your evaluation dataset or contribute in any other way, please feel free to get in touch!
Your feedback, suggestions, and contributions are more than welcome! Visit our [Community Page] to share your thoughts, or feel free to open a pull request (PR).
Thank you for your interest in the Multilingual MMLU Benchmark Leaderboard. Letβs work together to advance multilingual AI capabilities!
For more details on dataset authors and dataset cards, please refer to the "Tasks" tab.