--- title: LeaderboardFinder emoji: 🐢 colorFrom: pink colorTo: gray sdk: gradio sdk_version: 4.22.0 app_file: app.py pinned: false --- If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here. # Categories ## Submission type Arenas are not concerned by this category. - `submission:automatic`: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention - `submission:semiautomatic`: the leaderboard requires the model owner to run evaluations on his side and submit the results - `submission:manual`: the leaderboard requires the leaderboard owner to run evaluations for new submissions - `submission:closed`: the leaderboard does not accept submissions at the moment ## Test set status Arenas are not concerned by this category. - `test:public`: all the test sets used are public, the evaluations are completely reproducible - `test:mix`: some test sets are public and some private - `test:private`: all the test sets used are private, the evaluations are hard to game - `test:rolling`: the test sets used change regularly through time and evaluation scores are refreshed ## Judges - `judge:auto`: evaluations are run automatically, using an evaluation suite such as `lm_eval` or `lighteval` - `judge:model`: evaluations are run using a model as a judge approach to rate answer - `judge:humans`: evaluations are done by humans to rate answer - this is an arena - `judge:vibe_check`: evaluations are done manually by one human ## Modalities Can be any (or several) of the following list: - `modality:text` - `modality:image` - `modality:video` - `modality:audio` A bit outside of usual modalities - `modality:tools`: requires added tool usage - mostly for assistant models - `modality:artefacts`: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings. ## Evaluation categories Can be any (or several) of the following list: - `eval:generation`: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...) - `eval:math` - `eval:code` - `eval:performance`: model performance (speed, energy consumption, ...) - `eval:safety`: safety, toxicity, bias evaluations ## Language You can indicate the languages covered by your benchmark like so: `language:mylanguage`. At the moment, we do not support language codes, please use the language name in English.