from dataclasses import dataclass from enum import Enum @dataclass class Task: benchmark: str metric: str col_name: str # Init: to update with your specific keys class Tasks(Enum): # task_key in the json file, metric_key in the json file, name to display in the leaderboard task0 = Task("BBH", "metric_name", "BBH") task1 = Task("GPQA", "metric_name", "GPQA") task2 = Task("IFEval", "metric_name", "IFEval") task3 = Task("MUSR", "metric_name", "MUSR") task4 = Task("GSM8K", "metric_name", "GSM8K") task5 = Task("MMMLU-fr", "metric_name", "MMMLU-fr") # Your leaderboard name TITLE = """

OpenLLM French leaderboard 🇫🇷

""" # What does your leaderboard evaluate? INTRODUCTION_TEXT = """ Bienvenue sur le Leaderboard des LLM en français, une plateforme pionnière dédiée à l'évaluation des grands modèles de langage (LLM) en français. Alors que les LLM multilingues progressent, ma mission est de mettre en lumière spécifiquement les modèles qui excellent en langue française, en fournissant des benchmarks qui stimulent les avancées dans les LLM en français et l'IA générative pour la langue française. Le Leaderboard utilise ce lien (https://huggingface.co/collections/le-leadboard/openllmfrenchleadboard-jeu-de-donnees-67126437539a23c65554fd88) pour ses benchmarks soigneusement sélectionnés. Les évaluations sont générées et vérifiées à la fois par GPT-4 et par annotation humaine, rendant ainsi ce Leaderboard l'outil le plus précieux et le plus précis pour l'évaluation des LLM en français. 🚀 Soumettez votre Modèle 🚀 Vous avez un LLM en français ? Soumettez-le pour évaluation (Actuellement manuelle, faute de ressources ! En espérant automatiser ce processus avec le soutien de la communauté !), en utilisant le Eleuther AI Language Model Evaluation Harness pour une analyse approfondie des performances. Apprenez-en plus et contribuez aux avancées de l'IA en français sur la page "À propos". Rejoignez l'avant-garde de la technologie linguistique en français. Soumettez votre modèle et faisons progresser ensemble les LLM en français ! """ # Which evaluations are you running? how can people reproduce what you have? LLM_BENCHMARKS_TEXT = f""" ## How it works ## Reproducibility I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adapted for Turkish datasets, to ensure our leaderboard results are both reliable and replicable. Please see https://github.com/malhajar17/lm-evaluation-harness_turkish for more information ## How to Reproduce Results: 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions. 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model: ```python lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr-v0.2 --output /workspace/Orbina/Orbita-v0.1 ``` 3) Report Results: The results file generated is then uploaded to the OpenLLM Turkish Leaderboard. ## Notes: - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness. - All the tests are using the same configuration used in the original OpenLLMLeadboard preciesly The tasks and few shots parameters are: - ARC: 25-shot, *arc-challenge* (`acc_norm`) - HellaSwag: 10-shot, *hellaswag* (`acc_norm`) - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`) - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`) - Winogrande: 5-shot, *winogrande* (`acc`) - GSM8k: 5-shot, *gsm8k* (`acc`) """ EVALUATION_QUEUE_TEXT = """ ## Some good practices before submitting a model ### 1) Make sure you can load your model and tokenizer using AutoClasses: ```python from transformers import AutoConfig, AutoModel, AutoTokenizer config = AutoConfig.from_pretrained("your model name", revision=revision) model = AutoModel.from_pretrained("your model name", revision=revision) tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) ``` If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. Note: make sure your model is public! Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! ### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! ### 3) Make sure your model has an open license! This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 ### 4) Fill up your model card When we add extra information about models to the leaderboard, it will be automatically taken from the model card ## In case of model failure If your model is displayed in the `FAILED` category, its execution stopped. Make sure you have followed the above steps first. If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" @misc{openllm-French-leaderboard, author = {Mohamad Alhajar}, title = {Open LLM French Leaderboard v0.2}, year = {2024}, publisher = {Mohamad Alhajar}, howpublished = "\url{https://huggingface.co/spaces/le-leadboard/OpenLLMFrenchLeaderboard}" } """