Ludwig Stumpp commited on
Commit
a60d3ed
1 Parent(s): a3504d1

Rearrange and link to open-llms repo

Browse files
Files changed (1) hide show
  1. README.md +16 -12
README.md CHANGED
@@ -6,18 +6,6 @@ A joint community effort to create one central leaderboard for LLMs. Contributio
6
 
7
  https://llm-leaderboard.streamlit.app/
8
 
9
- ## How to Contribute
10
-
11
- We are always happy for contributions! You can contribute by the following:
12
-
13
- - table work (don't forget the links):
14
- - filling missing entries
15
- - adding a new model as a new row to the leaderboard. Please keep alphabetic order.
16
- - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
17
- - code work:
18
- - improving the existing code
19
- - requesting and implementing new features
20
-
21
  ## Leaderboard
22
 
23
  | Model Name | Commercial Use? | Chatbot Arena Elo | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | MMLU (zero-shot) | TriviaQA (zero-shot) |
@@ -68,6 +56,22 @@ We are always happy for contributions! You can contribute by the following:
68
  | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
69
  | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Sources
72
 
73
  The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.
 
6
 
7
  https://llm-leaderboard.streamlit.app/
8
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ## Leaderboard
10
 
11
  | Model Name | Commercial Use? | Chatbot Arena Elo | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | MMLU (zero-shot) | TriviaQA (zero-shot) |
 
56
  | MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
57
  | TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2 | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
58
 
59
+ ## How to Contribute
60
+
61
+ We are always happy for contributions! You can contribute by the following:
62
+
63
+ - table work (don't forget the links):
64
+ - filling missing entries
65
+ - adding a new model as a new row to the leaderboard. Please keep alphabetic order.
66
+ - adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table. Please keep alphabetic order.
67
+ - code work:
68
+ - improving the existing code
69
+ - requesting and implementing new features
70
+
71
+ ## More Open LLMs
72
+
73
+ If you are interested in an overview about open llms for commercial use and finetuning, check out the [open-llms](https://github.com/eugeneyan/open-llms) repository.
74
+
75
  ## Sources
76
 
77
  The results of this leaderboard are collected from the individual papers and published results of the model authors. For each reported value, the source is added as a link.