llm-leaderboard / README.md
Ludwig Stumpp
Fix
fbe8ba2

A newer version of the Streamlit SDK is available: 1.39.0

Upgrade
metadata
title: LLM-Leaderboard
emoji: πŸ†
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.37.1
app_file: streamlit_app.py
pinned: true
fullWidth: true
python_version: 3.10.10

πŸ† LLM-Leaderboard

A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
We refer to a model being "open" if it can be locally deployed and used for commercial purposes.

Interactive Dashboard

https://llm-leaderboard.streamlit.app/
https://huggingface.co/spaces/ludwigstumpp/llm-leaderboard

Leaderboard

Model Name Publisher Open? Chatbot Arena Elo HellaSwag (few-shot) HellaSwag (zero-shot) HellaSwag (one-shot) HumanEval-Python (pass@1) LAMBADA (zero-shot) LAMBADA (one-shot) MMLU (zero-shot) MMLU (few-shot) TriviaQA (zero-shot) TriviaQA (one-shot) WinoGrande (zero-shot) WinoGrande (one-shot) WinoGrande (few-shot)
alpaca-7b Stanford no 0.739 0.661
alpaca-13b Stanford no 1008
bloom-176b BigScience yes 0.744 0.155 0.299
cerebras-gpt-7b Cerebras yes 0.636 0.636 0.259 0.141
cerebras-gpt-13b Cerebras yes 0.635 0.635 0.258 0.146
chatglm-6b ChatGLM yes 985
chinchilla-70b DeepMind no 0.808 0.774 0.675 0.749
codex-12b / code-cushman-001 OpenAI no 0.317
codegen-16B-mono Salesforce yes 0.293
codegen-16B-multi Salesforce yes 0.183
codegx-13b Tsinghua University no 0.229
dolly-v2-12b Databricks yes 944 0.710 0.622
eleuther-pythia-7b EleutherAI yes 0.667 0.667 0.265 0.198 0.661
eleuther-pythia-12b EleutherAI yes 0.704 0.704 0.253 0.233 0.638
falcon-7b TII yes 0.781 0.350
falcon-40b TII yes 0.853 0.527
fastchat-t5-3b Lmsys.org yes 951
gal-120b Meta AI no 0.526
gpt-3-7b / curie OpenAI no 0.682 0.243
gpt-3-175b / davinci OpenAI no 0.793 0.789 0.439 0.702
gpt-3.5-175b / text-davinci-003 OpenAI no 0.822 0.834 0.481 0.762 0.569 0.758 0.816
gpt-3.5-175b / code-davinci-002 OpenAI no 0.463
gpt-4 OpenAI no 0.953 0.670 0.864 0.875
gpt4all-13b-snoozy Nomic AI yes 0.750 0.713
gpt-neox-20b EleutherAI yes 0.718 0.719 0.719 0.269 0.276 0.347
gpt-j-6b EleutherAI yes 0.663 0.683 0.683 0.261 0.249 0.234
koala-13b Berkeley BAIR no 1082 0.726 0.688
llama-7b Meta AI no 0.738 0.105 0.738 0.302 0.443 0.701
llama-13b Meta AI no 932 0.792 0.158 0.730
llama-33b Meta AI no 0.828 0.217 0.760
llama-65b Meta AI no 0.842 0.237 0.634 0.770
llama-2-70b Meta AI yes 0.873 0.698
mpt-7b MosaicML yes 0.761 0.702 0.296 0.343
oasst-pythia-12b Open Assistant yes 1065 0.681 0.650
opt-7b Meta AI no 0.677 0.677 0.251 0.227
opt-13b Meta AI no 0.692 0.692 0.257 0.282
opt-66b Meta AI no 0.745 0.276
opt-175b Meta AI no 0.791 0.318
palm-62b Google Research no 0.770
palm-540b Google Research no 0.838 0.834 0.836 0.262 0.779 0.818 0.693 0.814 0.811 0.837 0.851
palm-coder-540b Google Research no 0.359
palm-2-s Google Research no 0.820 0.807 0.752 0.779
palm-2-s* Google Research no 0.376
palm-2-m Google Research no 0.840 0.837 0.817 0.792
palm-2-l Google Research no 0.868 0.869 0.861 0.830
palm-2-l-instruct Google Research no 0.909
replit-code-v1-3b Replit yes 0.219
stablelm-base-alpha-7b Stability AI yes 0.412 0.533 0.251 0.049 0.501
stablelm-tuned-alpha-7b Stability AI no 858 0.536 0.548
starcoder-base-16b BigCode yes 0.304
starcoder-16b BigCode yes 0.336
vicuna-13b Lmsys.org no 1169

Benchmarks

Benchmark Name Author Link Description
Chatbot Arena Elo LMSYS https://lmsys.org/blog/2023-05-03-arena/ "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/)
HellaSwag Zellers et al. https://arxiv.org/abs/1905.07830v1 "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag)
HumanEval Chen et al. https://arxiv.org/abs/2107.03374v2 "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval)
LAMBADA Paperno et al. https://arxiv.org/abs/1606.06031 "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada)
MMLU Hendrycks et al. https://github.com/hendrycks/test "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu")
TriviaQA Joshi et al. https://arxiv.org/abs/1705.03551v2 "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2)
WinoGrande Sakaguchi et al. https://arxiv.org/abs/1907.10641v2 "A large-scale dataset of 44k [expert-crafted pronoun resolution] problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset." (Source: https://arxiv.org/abs/1907.10641v2)

How to Contribute

We are always happy for contributions! You can contribute by the following:

  • table work (don't forget the links):
    • filling missing entries
    • adding a new model as a new row to the leaderboard. Please keep alphabetic order.
    • adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
  • code work:
    • improving the existing code
    • requesting and implementing new features

Future Ideas

  • (TBD) add model year
  • (TBD) add model details:
    • #params
    • #tokens seen during training
    • length context window
    • architecture type (transformer-decoder, transformer-encoder, transformer-encoder-decoder, ...)

More Open LLMs

If you are interested in an overview about open llms for commercial use and finetuning, check out the open-llms repository.

Sources

The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.

Special thanks to the following pages:

Disclaimer

Above information may be wrong. If you want to use a published model for commercial use, please contact a lawyer.