Spaces:
Paused
Paused
NimaBoscarino
commited on
Commit
•
5601a63
1
Parent(s):
ce824ba
Adjust description for TruthfulQA
Browse files- content.py +6 -2
content.py
CHANGED
@@ -1,4 +1,7 @@
|
|
1 |
CHANGELOG_TEXT = f"""
|
|
|
|
|
|
|
2 |
## [2023-06-12]
|
3 |
- Add Human & GPT-4 Evaluations
|
4 |
|
@@ -34,7 +37,8 @@ CHANGELOG_TEXT = f"""
|
|
34 |
- Display different queues for jobs that are RUNNING, PENDING, FINISHED status
|
35 |
|
36 |
## [2023-05-15]
|
37 |
-
- Fix a typo: from "TruthQA" to "
|
|
|
38 |
|
39 |
## [2023-05-10]
|
40 |
- Fix a bug that prevented auto-refresh
|
@@ -58,7 +62,7 @@ Evaluation is performed against 4 popular benchmarks:
|
|
58 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
59 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
60 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
61 |
-
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a
|
62 |
|
63 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
64 |
"""
|
|
|
1 |
CHANGELOG_TEXT = f"""
|
2 |
+
## [2023-06-13]
|
3 |
+
- Adjust description for TruthfulQA
|
4 |
+
|
5 |
## [2023-06-12]
|
6 |
- Add Human & GPT-4 Evaluations
|
7 |
|
|
|
37 |
- Display different queues for jobs that are RUNNING, PENDING, FINISHED status
|
38 |
|
39 |
## [2023-05-15]
|
40 |
+
- Fix a typo: from "TruthQA" to "
|
41 |
+
QA"
|
42 |
|
43 |
## [2023-05-10]
|
44 |
- Fix a bug that prevented auto-refresh
|
|
|
62 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
63 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
64 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
65 |
+
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
66 |
|
67 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
68 |
"""
|