Spaces:
Running
Running
pavlichenko
commited on
Commit
•
8ebb2ea
1
Parent(s):
34bef94
Update app.py
Browse files
app.py
CHANGED
@@ -24,6 +24,21 @@ We find it’s tricky to use open-source datasets of prompts due to the followin
|
|
24 |
|
25 |
To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
#### How Did We Set Up Human Evaluation
|
28 |
|
29 |
Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
|
|
|
24 |
|
25 |
To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
|
26 |
|
27 |
+
Distribution of prompts by categories:
|
28 |
+
|
29 |
+
* Brainstorming: 15.48%
|
30 |
+
* Chat: 1.59%
|
31 |
+
* Classification: 0.2%
|
32 |
+
* Closed QA: 3.77%
|
33 |
+
* Extraction: 0.6%
|
34 |
+
* Generation: 38.29%
|
35 |
+
* Open QA: 32.94%
|
36 |
+
* Rewrite: 5.16%
|
37 |
+
* Summarization: 1.98%
|
38 |
+
|
39 |
+
We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
|
40 |
+
|
41 |
+
|
42 |
#### How Did We Set Up Human Evaluation
|
43 |
|
44 |
Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
|