Spaces:

allenai
/

reward-bench

Running

App Files Files Community

reward-bench / src /md.py

natolambert

major imporvements

31bff5a 8 months ago

raw

history blame

8.53 kB

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	## Overview

	We average over 4 core sections (per prompt weighting):
	1. Chat: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
	2. Chat Hard: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
	3. Safety: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
	4. Code: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)

	We include multiple types of reward models in this evaluation:
	1. Sequence Classifiers (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
	2. Custom Classifiers: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
	3. DPO: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
	4. Random: Random choice baseline.

	Others, such as Generative Judge are coming soon.

	### Subset Details

	Total number of the prompts is: 2538, filtered from 4676.

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :---------- \| :-----: \| :---------: \|
	\| alpacaeval-easy \| 805, 100 \| Great model vs poor model \|
	\| alpacaeval-length \| 805, 95 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805, 95 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505, 100 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704, 100 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| xstest-should-refuse \| 450, 250 \| False response dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| xstest-should-respond \| 450, 154 \| False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| do not answer \| 939, 136 \| [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) \|
	\| hep-cpp \| 164 \| C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) \|
	\| hep-go \| 164 \| Go code \|
	\| hep-java \| 164 \| Java code \|
	\| hep-js \| 164 \| Javascript code \|
	\| hep-python \| 164 \| Python code \|
	\| hep-rust \| 164 \| Rust code \|

	Lengths (mean, std. dev.) include the prompt

	\| subset \| length bias \| chosen_chars \| rejected_chars \| chosen_tokens \| rejected_tokens \| chosen_unique_tokens \| rejected_unique_tokens \|
	\|-----------------------\|-------------\|----------------\|------------------\|-----------------\|-------------------\|------------------------\|--------------------------\|
	\| alpacaeval-easy \| True \| 2283 (1138) \| 646 (482) \| 591 (303) \| 167 (139) \| 253 (117) \| 83 (46) \|
	\| alpacaeval-hard \| True \| 1590 (769) \| 526 (430) \| 412 (199) \| 137 (117) \| 173 (67) \| 71 (48) \|
	\| alpacaeval-length \| Neutral \| 2001 (1137) \| 2127 (1787) \| 511 (283) \| 597 (530) \| 192 (85) \| 189 (99) \|
	\| donotanswer \| False \| 755 (722) \| 1389 (695) \| 170 (161) \| 320 (164) \| 104 (82) \| 157 (73) \|
	\| hep-cpp \| Neutral \| 709 (341) \| 705 (342) \| 261 (125) \| 259 (125) \| 100 (29) \| 99 (29) \|
	\| hep-go \| Neutral \| 738 (361) \| 734 (361) \| 266 (118) \| 265 (118) \| 100 (29) \| 99 (29) \|
	\| hep-java \| Neutral \| 821 (393) \| 814 (390) \| 263 (123) \| 261 (122) \| 102 (30) \| 102 (30) \|
	\| hep-js \| Neutral \| 677 (341) \| 673 (339) \| 251 (129) \| 250 (128) \| 93 (29) \| 93 (29) \|
	\| hep-python \| Neutral \| 618 (301) \| 616 (300) \| 212 (98) \| 211 (98) \| 86 (26) \| 85 (26) \|
	\| hep-rust \| Neutral \| 666 (391) \| 660 (391) \| 221 (132) \| 219 (132) \| 95 (29) \| 95 (29) \|
	\| llmbar-adver-GPTInst \| False \| 735 (578) \| 1623 (1055) \| 170 (135) \| 377 (245) \| 93 (59) \| 179 (106) \|
	\| llmbar-adver-GPTOut \| Neutral \| 378 (339) \| 359 (319) \| 96 (81) \| 101 (94) \| 60 (45) \| 55 (41) \|
	\| llmbar-adver-manual \| False \| 666 (584) \| 1139 (866) \| 160 (134) \| 264 (194) \| 92 (63) \| 140 (90) \|
	\| llmbar-adver-neighbor \| False \| 287 (297) \| 712 (749) \| 70 (76) \| 173 (175) \| 43 (31) \| 91 (70) \|
	\| llmbar-natural \| Neutral \| 553 (644) \| 530 (597) \| 139 (162) \| 130 (140) \| 75 (71) \| 70 (62) \|
	\| mt-bench-easy \| False \| 1563 (720) \| 2129 (1520) \| 377 (159) \| 551 (415) \| 166 (55) \| 116 (62) \|
	\| mt-bench-hard \| False \| 1225 (499) \| 1471 (1016) \| 284 (116) \| 349 (234) \| 131 (45) \| 136 (58) \|
	\| mt-bench-med \| Neutral \| 1558 (729) \| 1733 (1312) \| 377 (170) \| 410 (311) \| 162 (58) \| 145 (88) \|
	\| refusals-dangerous \| False \| 597 (81) \| 1828 (547) \| 131 (20) \| 459 (136) \| 90 (12) \| 211 (50) \|
	\| refusals-offensive \| False \| 365 (116) \| 1092 (1146) \| 82 (25) \| 299 (278) \| 64 (15) \| 134 (101) \|
	\| xstest-should-refuse \| False \| 584 (419) \| 904 (493) \| 129 (89) \| 217 (115) \| 81 (47) \| 116 (53) \|
	\| xstest-should-respond \| True \| 771 (420) \| 466 (427) \| 189 (105) \| 107 (94) \| 104 (48) \| 67 (48) \|

	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""

	TOP_TEXT = """
	# RewardBench from AI2

	Evaluating the capabilities, safety, and pitfalls of reward models.

	[Code](https://github.com/allenai/herm) \| [Eval. Dataset](https://huggingface.co/datasets/ai2-adapt-dev/rm-benchmark-dev) \| [Existing Test Sets](https://huggingface.co/datasets/allenai/pref-test-sets) \| [Results](https://huggingface.co/datasets/ai2-adapt-dev/HERM-Results) \| Paper (coming soon)

	All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
	"""

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	## Overview

	We average over 4 core sections (per prompt weighting):
	1. Chat: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
	2. Chat Hard: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
	3. Safety: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
	4. Code: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)

	We include multiple types of reward models in this evaluation:
	1. Sequence Classifiers (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
	2. Custom Classifiers: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
	3. DPO: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
	4. Random: Random choice baseline.

	Others, such as Generative Judge are coming soon.

	### Subset Details

	Total number of the prompts is: 2538, filtered from 4676.

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :---------- \| :-----: \| :---------: \|
	\| alpacaeval-easy \| 805, 100 \| Great model vs poor model \|
	\| alpacaeval-length \| 805, 95 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805, 95 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505, 100 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704, 100 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| xstest-should-refuse \| 450, 250 \| False response dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| xstest-should-respond \| 450, 154 \| False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| do not answer \| 939, 136 \| [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) \|
	\| hep-cpp \| 164 \| C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) \|
	\| hep-go \| 164 \| Go code \|
	\| hep-java \| 164 \| Java code \|
	\| hep-js \| 164 \| Javascript code \|
	\| hep-python \| 164 \| Python code \|
	\| hep-rust \| 164 \| Rust code \|

	Lengths (mean, std. dev.) include the prompt

	\| subset \| length bias \| chosen_chars \| rejected_chars \| chosen_tokens \| rejected_tokens \| chosen_unique_tokens \| rejected_unique_tokens \|
	\|-----------------------\|-------------\|----------------\|------------------\|-----------------\|-------------------\|------------------------\|--------------------------\|
	\| alpacaeval-easy \| True \| 2283 (1138) \| 646 (482) \| 591 (303) \| 167 (139) \| 253 (117) \| 83 (46) \|
	\| alpacaeval-hard \| True \| 1590 (769) \| 526 (430) \| 412 (199) \| 137 (117) \| 173 (67) \| 71 (48) \|
	\| alpacaeval-length \| Neutral \| 2001 (1137) \| 2127 (1787) \| 511 (283) \| 597 (530) \| 192 (85) \| 189 (99) \|
	\| donotanswer \| False \| 755 (722) \| 1389 (695) \| 170 (161) \| 320 (164) \| 104 (82) \| 157 (73) \|
	\| hep-cpp \| Neutral \| 709 (341) \| 705 (342) \| 261 (125) \| 259 (125) \| 100 (29) \| 99 (29) \|
	\| hep-go \| Neutral \| 738 (361) \| 734 (361) \| 266 (118) \| 265 (118) \| 100 (29) \| 99 (29) \|
	\| hep-java \| Neutral \| 821 (393) \| 814 (390) \| 263 (123) \| 261 (122) \| 102 (30) \| 102 (30) \|
	\| hep-js \| Neutral \| 677 (341) \| 673 (339) \| 251 (129) \| 250 (128) \| 93 (29) \| 93 (29) \|
	\| hep-python \| Neutral \| 618 (301) \| 616 (300) \| 212 (98) \| 211 (98) \| 86 (26) \| 85 (26) \|
	\| hep-rust \| Neutral \| 666 (391) \| 660 (391) \| 221 (132) \| 219 (132) \| 95 (29) \| 95 (29) \|
	\| llmbar-adver-GPTInst \| False \| 735 (578) \| 1623 (1055) \| 170 (135) \| 377 (245) \| 93 (59) \| 179 (106) \|
	\| llmbar-adver-GPTOut \| Neutral \| 378 (339) \| 359 (319) \| 96 (81) \| 101 (94) \| 60 (45) \| 55 (41) \|
	\| llmbar-adver-manual \| False \| 666 (584) \| 1139 (866) \| 160 (134) \| 264 (194) \| 92 (63) \| 140 (90) \|
	\| llmbar-adver-neighbor \| False \| 287 (297) \| 712 (749) \| 70 (76) \| 173 (175) \| 43 (31) \| 91 (70) \|
	\| llmbar-natural \| Neutral \| 553 (644) \| 530 (597) \| 139 (162) \| 130 (140) \| 75 (71) \| 70 (62) \|
	\| mt-bench-easy \| False \| 1563 (720) \| 2129 (1520) \| 377 (159) \| 551 (415) \| 166 (55) \| 116 (62) \|
	\| mt-bench-hard \| False \| 1225 (499) \| 1471 (1016) \| 284 (116) \| 349 (234) \| 131 (45) \| 136 (58) \|
	\| mt-bench-med \| Neutral \| 1558 (729) \| 1733 (1312) \| 377 (170) \| 410 (311) \| 162 (58) \| 145 (88) \|
	\| refusals-dangerous \| False \| 597 (81) \| 1828 (547) \| 131 (20) \| 459 (136) \| 90 (12) \| 211 (50) \|
	\| refusals-offensive \| False \| 365 (116) \| 1092 (1146) \| 82 (25) \| 299 (278) \| 64 (15) \| 134 (101) \|
	\| xstest-should-refuse \| False \| 584 (419) \| 904 (493) \| 129 (89) \| 217 (115) \| 81 (47) \| 116 (53) \|
	\| xstest-should-respond \| True \| 771 (420) \| 466 (427) \| 189 (105) \| 107 (94) \| 104 (48) \| 67 (48) \|

	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""

	TOP_TEXT = """
	# RewardBench from AI2

	Evaluating the capabilities, safety, and pitfalls of reward models.

	[Code](https://github.com/allenai/herm) \| [Eval. Dataset](https://huggingface.co/datasets/ai2-adapt-dev/rm-benchmark-dev) \| [Existing Test Sets](https://huggingface.co/datasets/allenai/pref-test-sets) \| [Results](https://huggingface.co/datasets/ai2-adapt-dev/HERM-Results) \| Paper (coming soon)

	All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
	"""