Spaces:
Running
Running
natolambert
commited on
Commit
•
702ff77
1
Parent(s):
7b96731
Update src/md.py
Browse files
src/md.py
CHANGED
@@ -6,22 +6,27 @@ A win is when the score for the chosen response is higher than the score for the
|
|
6 |
|
7 |
| Subset | Num. Samples (Pre-filtering, post-filtering) | Description |
|
8 |
| :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
|
9 |
-
| alpacaeval-easy | 805 | Great model vs poor model |
|
10 |
-
| alpacaeval-length | 805 | Good model vs low model, equal length |
|
11 |
-
| alpacaeval-hard | 805 | Great model vs baseline model |
|
12 |
| mt-bench-easy | 28, 28 | MT Bench 10s vs 1s |
|
13 |
| mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s |
|
14 |
| mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 |
|
15 |
-
| refusals-dangerous | 505 | Dangerous response vs no response |
|
16 |
-
| refusals-offensive | 704 | Offensive response vs no response |
|
17 |
| llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
|
18 |
| llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
|
19 |
| llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
|
20 |
| llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
|
21 |
| llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
|
22 |
-
| XSTest
|
23 |
-
|
|
24 |
-
| (
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
|
27 |
For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
|
|
|
6 |
|
7 |
| Subset | Num. Samples (Pre-filtering, post-filtering) | Description |
|
8 |
| :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
|
9 |
+
| alpacaeval-easy | 805, 100 | Great model vs poor model |
|
10 |
+
| alpacaeval-length | 805, 95 | Good model vs low model, equal length |
|
11 |
+
| alpacaeval-hard | 805, 95 | Great model vs baseline model |
|
12 |
| mt-bench-easy | 28, 28 | MT Bench 10s vs 1s |
|
13 |
| mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s |
|
14 |
| mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 |
|
15 |
+
| refusals-dangerous | 505, 100 | Dangerous response vs no response |
|
16 |
+
| refusals-offensive | 704, 100 | Offensive response vs no response |
|
17 |
| llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
|
18 |
| llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
|
19 |
| llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
|
20 |
| llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
|
21 |
| llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
|
22 |
+
| XSTest | 450, 404 | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) |
|
23 |
+
| do not answer | 939, 136 | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) |
|
24 |
+
| hep-cpp | 164 | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) |
|
25 |
+
| hep-go | 164 | Go code |
|
26 |
+
| hep-java | 164 | Java code |
|
27 |
+
| hep-js | 164 | Javascript code |
|
28 |
+
| hep-python | 164 | Python code |
|
29 |
+
| hep-rust | 164 | Rust code |
|
30 |
|
31 |
|
32 |
For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
|