Spaces:

StarscreamDeceptions
/

Multilingual-MMLU-Benchmark-Leaderboard

Sleeping

App Files Files Community

StarscreamDeceptions commited on Nov 24, 2024

Commit

70a352d

•

1 Parent(s): f359f0f

Update src/about.py

Browse files

Files changed (1) hide show

src/about.py +79 -57

src/about.py CHANGED Viewed

@@ -15,27 +15,28 @@ class Tasks(Enum):
     # task0 = Task("mmmlu", "acc", "MMMLU")
     # task1 = Task("mmlu", "acc", "MMLU")
     # task2 = Task("cmmlu", "acc", "CMMLU")
-    mmmlu_ar = Task("mmmlu_ar", "acc", "MMMLU_AR")
-    mmmlu_bn = Task("mmmlu_bn", "acc", "MMMLU_BN")
-    mmmlu_de = Task("mmmlu_de", "acc", "MMMLU_DE")
-    mmmlu_es = Task("mmmlu_es", "acc", "MMMLU_ES")
-    mmmlu_fr = Task("mmmlu_fr", "acc", "MMMLU_FR")
-    mmmlu_hi = Task("mmmlu_hi", "acc", "MMMLU_HI")
-    mmmlu_id = Task("mmmlu_id", "acc", "MMMLU_ID")
-    mmmlu_it = Task("mmmlu_it", "acc", "MMMLU_IT")
-    mmmlu_ja = Task("mmmlu_ja", "acc", "MMMLU_JA")
-    mmmlu_ko = Task("mmmlu_ko", "acc", "MMMLU_KO")
-    mmmlu_pt = Task("mmmlu_pt", "acc", "MMMLU_PT")
-    mmmlu_sw = Task("mmmlu_sw", "acc", "MMMLU_SW")
-    mmmlu_yo = Task("mmmlu_yo", "acc", "MMMLU_YO")
-    mmmlu_zh = Task("mmmlu_zh", "acc", "MMMLU_ZH")
 NUM_FEWSHOT = 5 # Change with your few shot
 # ---------------------------------------------------
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Multilingual MMLU Benchmark Leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
@@ -49,9 +50,6 @@ INTRODUCTION_TEXT_ZH = """
 LLM_BENCHMARKS_TEXT = """
 ## 💡 About "Multilingual Benchmark MMLU Leaderboard"
-- Press release: [TBD - XXX](#), [TBD - XXX](#), [TBD - XXX](#), [TBD - XXX](#)
-- YouTube: [TBD - XXX](#)
 ### Overview
 The **Multilingual Massive Multitask Language Understanding (MMMLU)** benchmark is a comprehensive evaluation platform designed to assess the general knowledge capabilities of AI models across a wide range of domains. It includes a series of **Question Answering (QA)** tasks across **57 distinct domains**, ranging from elementary-level knowledge to advanced professional subjects such as law, physics, history, and computer science.
@@ -107,48 +105,59 @@ Notes:
 You can find:
-- Detailed numerical results in the [results dataset](link_to_results)
-- Community queries and running status in the [requests dataset](link_to_requests)
 ### ✅ Reproducibility
-To reproduce the results, you can use [our fork of lm_eval](#), as not all of our PRs are currently integrated into the main repository.
 ## 🙌 Acknowledgements
-This leaderboard was developed as part of the [#ProjectName](link_to_project) led by [OrganizationName](link_to_organization) thanks to the donation of high-quality evaluation datasets by:
-- [Institution 1](link_to_institution_1)
-- [Institution 2](link_to_institution_2)
-- [Institution 3](link_to_institution_3)
-- [Institution 4](link_to_institution_4)
-- [Institution 5](link_to_institution_5)
-- [Institution 6](link_to_institution_6)
-- [Institution 7](link_to_institution_7)
-- [Institution 8](link_to_institution_8)
-- [Institution 9](link_to_institution_9)
 The entities above are ordered chronologically by the date they joined the project. However, the logos in the footer are ordered by the number of datasets donated.
 Thank you in particular to:
-- Task implementation: [Name 1], [Name 2], [Name 3], [Name 4], [Name 5], [Name 6], [Name 7], [Name 8], [Name 9], [Name 10]
-- Leaderboard implementation: [Name 11], [Name 12]
-- Model evaluation: [Name 13], [Name 14], [Name 15], [Name 16], [Name 17]
-- Communication: [Name 18], [Name 19]
-- Organization & colab leads: [Name 20], [Name 21], [Name 22], [Name 23], [Name 24], [Name 25], [Name 26], [Name 27], [Name 28], [Name 29], [Name 30]
 For information about the dataset authors please check the corresponding Dataset Cards (linked in the "Tasks" tab) and papers (included in the "Citation" section below). We would like to specially thank the teams that created or open-sourced their datasets specifically for the leaderboard (in chronological order):
-- [Dataset1 Placeholder] and [Dataset2 Placeholder]: [Team members placeholder]
-- [Dataset3 Placeholder], [Dataset4 Placeholder] and [Dataset5 Placeholder]: [Team members placeholder]
-- [Dataset6 Placeholder]: [Team members placeholder]
-We also thank [Institution1 Placeholder], [Institution2 Placeholder], [Organization Placeholder], [Person1 Placeholder], [Person2 Placeholder] and [Institution3 Placeholder] for sponsoring the inference GPUs.
 ## 🚀 Collaborate!
 We would like to create a leaderboard as diverse as possible, reach out if you would like us to include your evaluation dataset!
-Comments and suggestions are more than welcome! Visit the [👏 Community](<Community Page Placeholder>) page, tell us what you think about MMMLU Leaderboard and how we can improve it, or go ahead and open a PR!
 Thank you very much! 💛
@@ -292,7 +301,19 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
-CITATION_BUTTON_TEXT = r"""
 """
 EVALUATION_QUEUE_TEXT_ZH = """
 ## 提交模型前的一些良好实践
@@ -320,20 +341,21 @@ tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
 模型失败时的处理
 如果你的模型出现在 FAILED 分类中，表示其执行停止。 首先确保你已经遵循了上述步骤。 如果一切都完成，检查你是否可以使用上面的命令在本地启动 EleutherAIHarness 来���试你的模型（你可以添加 --limit 来限制每个任务的示例数）。 """
-CITATION_BUTTON_LABEL = "复制以下代码引用这些结果"
-CITATION_BUTTON_TEXT = r"""
-"""
 LOGOS = [
-    "logo/amsterdam-logo.png",
-    "logo/cardiff-logo.png",
-    "logo/coimbra-logo.png",
-    "logo/dcu-logo.png",
-    "logo/MBZU-logo.png",
-    "logo/NAIST-logo.png",
-    "logo/OSU-logo.png",
-    "logo/rmit.png",
-    "logo/sjtu-logo.png",
-    "logo/tsinghua-logo.png",
-    "logo/UGA-logo.png",
-    "logo/um-logo.png"
 ]

     # task0 = Task("mmmlu", "acc", "MMMLU")
     # task1 = Task("mmlu", "acc", "MMLU")
     # task2 = Task("cmmlu", "acc", "CMMLU")
+    mmmlu_ar = Task("mmmlu_ar", "acc", "AR")
+    mmmlu_bn = Task("mmmlu_bn", "acc", "BN")
+    mmmlu_de = Task("mmmlu_de", "acc", "DE")
+    mmmlu_es = Task("mmmlu_es", "acc", "ES")
+    mmmlu_fr = Task("mmmlu_fr", "acc", "FR")
+    mmmlu_hi = Task("mmmlu_hi", "acc", "HI")
+    mmmlu_id = Task("mmmlu_id", "acc", "ID")
+    mmmlu_it = Task("mmmlu_it", "acc", "IT")
+    mmmlu_ja = Task("mmmlu_ja", "acc", "JA")
+    mmmlu_ko = Task("mmmlu_ko", "acc", "KO")
+    mmmlu_pt = Task("mmmlu_pt", "acc", "PT")
+    mmmlu_sw = Task("mmmlu_sw", "acc", "SW")
+    mmmlu_yo = Task("mmmlu_yo", "acc", "YO")
+    mmmlu_zh = Task("mmmlu_zh", "acc", "ZH")
 NUM_FEWSHOT = 5 # Change with your few shot
 # ---------------------------------------------------
 # Your leaderboard name
+TITLE = """<img src="https://raw.githubusercontent.com/BobTsang1995/Multilingual-MMLU-Benchmark-Leaderboard/main/static/title/title.png" style="width:30%;display:block;margin-left:auto;margin-right:auto;border-radius:15px;">"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 LLM_BENCHMARKS_TEXT = """
 ## 💡 About "Multilingual Benchmark MMLU Leaderboard"
 ### Overview
 The **Multilingual Massive Multitask Language Understanding (MMMLU)** benchmark is a comprehensive evaluation platform designed to assess the general knowledge capabilities of AI models across a wide range of domains. It includes a series of **Question Answering (QA)** tasks across **57 distinct domains**, ranging from elementary-level knowledge to advanced professional subjects such as law, physics, history, and computer science.
 You can find:
+- Detailed numerical results in the [results dataset](https://huggingface.co/datasets/StarscreamDeceptions/results)
+- Community queries and running status in the [requests dataset](https://huggingface.co/datasets/StarscreamDeceptions/requests)
 ### ✅ Reproducibility
+To reproduce the results, you can use [opencompass](https://github.com/BobTsang1995/opencompass), Since many open-source models cannot fully adhere to instructions for QA tasks, we perform post-processing on the results by using Qwen2.5-7B-Instruct to extract the answers from the model's output. This is a relatively simple task, so we can generally extract the model's true output, which corresponds to options A, B, C, and D. As not all of our PRs are currently integrated into the main repository.
+```
+git clone git@github.com:BobTsang1995/opencompass.git
+cd opencompass
+pip install -e .
+pip install lmdeploy
+python run.py --models lmdeploy_qwen2_7b_instruct --datasets mmmlu_gen_5_shot -a lmdeploy
+```
 ## 🙌 Acknowledgements
+This leaderboard was independently developed as a non-profit initiative with the support of several academic institutions, which provided valuable assistance to make this effort possible. We extend our heartfelt gratitude to these institutions for their support.
+- [Technische Universität München (TUM)](https://www.tum.de/)
+- [Tsinghua University](https://www.tsinghua.edu.cn/en/)
+- [Universiteit van Amsterdam](https://uva.nl/)
+- [Mohamed Bin Zayed University of Artificial Intelligence](https://mbzuai.ac.ae/)
+- [University of Macau](https://www.um.edu.mo/)
+- [Cardiff University](https://www.cardiff.ac.uk/)
+- [Nara Institute of Science and Technology](https://www.naist.jp/en/)
+- [Shanghai Jiao Tong University](https://en.sjtu.edu.cn/)
+- [Dublin City University](https://www.dcu.ie/)
+- [Université Grenoble Alpes](https://www.univ-grenoble-alpes.fr/)
+- [Universidade de Coimbra](https://www.uc.pt/)
+- [The Ohio State University](https://www.osu.edu/)
+- [RMIT University](https://www.rmit.edu.au/)
 The entities above are ordered chronologically by the date they joined the project. However, the logos in the footer are ordered by the number of datasets donated.
 Thank you in particular to:
+- Task implementation: Bo Zeng, Yue Zhao, Chengyang Lyu, Huifeng Yin
+- Leaderboard implementation: Bo Zeng, Longyue Wang
+- Model evaluation: Bo Zeng, Tianqi Shi, Fengye Liu, Lingfeng Ming, Xue Yang, Yiyu Wang
+- Communication: Longyue Wang, Weihua Luo, Kaifu Zhang
+- Organization & colab leads: Yi Zhou (Cardiff University), Yusuke Sakai (Nara Institute of Science and Technology), Yongxin Zhou (Université Grenoble Alpes), Haonan Li (MBZUAI), Jiahui Geng (MBZUAI), Qing Li (MBZUAI), Wenxi Li (Tsinghua University/Shanghai Jiaotong University), Yuanyu Lin (University of Macau), Andy Way (Dublin City University), Zhuang Li (RMIT University), Zhongwei Wan (The Ohio State University), Di Wu (University of Amsterdam), Wen Lai (Technical University of Munich) (TUM)
 For information about the dataset authors please check the corresponding Dataset Cards (linked in the "Tasks" tab) and papers (included in the "Citation" section below). We would like to specially thank the teams that created or open-sourced their datasets specifically for the leaderboard (in chronological order):
+- [MMMLU](https://huggingface.co/datasets/openai/MMMLU): OpenAI
+We also thank MacroPolo Team, Alibaba International Digital Commerce for sponsoring the inference GPUs.
 ## 🚀 Collaborate!
 We would like to create a leaderboard as diverse as possible, reach out if you would like us to include your evaluation dataset!
+Comments and suggestions are more than welcome! Visit the [👏 Multilingual-MMLU-Benchmark-Leaderboard discussions](https://huggingface.co/spaces/StarscreamDeceptions/Multilingual-MMLU-Benchmark-Leaderboard/discussions) page, tell us what you think about MMMLU Leaderboard and how we can improve it, or go ahead and open a PR!
 Thank you very much! 💛
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
+CITATION_BUTTON_TEXT = r"""@misc{Multilingual MMLU Benchmark Leaderboard2024,
+    author = {Bo Zeng and Tianqi Shi and Yefeng Liu and Lingfeng Ming and Xue Yang and Yiyu Wang and Yue Zhao and Chengyang Lyu and Huifeng Yin and Longyue Wang},
+    title = {Multilingual MMLU Benchmark Leaderboard},
+    year = {2024},
+    publisher = {Hugging Face},
+    howpublished = "\url{https://huggingface.co/spaces/StarscreamDeceptions/Multilingual-MMLU-Benchmark-Leaderboard}"
+    @article{hendrycks2020measuring,
+      title={Measuring massive multitask language understanding},
+      author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
+      journal={arXiv preprint arXiv:2009.03300},
+      year={2020}
+    }
 """
 EVALUATION_QUEUE_TEXT_ZH = """
 ## 提交模型前的一些良好实践
 模型失败时的处理
 如果你的模型出现在 FAILED 分类中，表示其执行停止。 首先确保你已经遵循了上述步骤。 如果一切都完成，检查你是否可以使用上面的命令在本地启动 EleutherAIHarness 来���试你的模型（你可以添加 --limit 来限制每个任务的示例数）。 """
+# CITATION_BUTTON_LABEL = "复制以下代码引用这些结果"
+# CITATION_BUTTON_TEXT = r"""
+# """
 LOGOS = [
+    # "logo/amsterdam-logo.png",
+    # "logo/cardiff-logo.png",
+    # "logo/coimbra-logo.png",
+    # "logo/dcu-logo.png",
+    # "logo/MBZU-logo.png",
+    # "logo/NAIST-logo.png",
+    # "logo/OSU-logo.png",
+    # "logo/rmit.png",
+    # "logo/sjtu-logo.png",
+    # "logo/tsinghua-logo.png",
+    # "logo/UGA-logo.png",
+    # "logo/um-logo.png"
+    "logo/all.png"
 ]