sethuiyer
/

Nandine-7b

Text Generation

text-generation-inference

Model card Files Files and versions Community

Nandine-7b / EVAL.md

sethuiyer's picture

Create EVAL.md

22a3f05 verified 9 months ago

|

4.71 kB

Nous Benchmark

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
Nandine-7b	43.54	76.41	61.73	45.27	56.74

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	23.62	±	2.67
		acc_norm	22.05	±	2.61
agieval_logiqa_en	0	acc	37.94	±	1.90
		acc_norm	38.71	±	1.91
agieval_lsat_ar	0	acc	26.09	±	2.90
		acc_norm	22.61	±	2.76
agieval_lsat_lr	0	acc	47.45	±	2.21
		acc_norm	50.00	±	2.22
agieval_lsat_rc	0	acc	60.97	±	2.98
		acc_norm	59.85	±	2.99
agieval_sat_en	0	acc	77.18	±	2.93
		acc_norm	77.67	±	2.91
agieval_sat_en_without_passage	0	acc	45.63	±	3.48
		acc_norm	45.15	±	3.48
agieval_sat_math	0	acc	35.91	±	3.24
		acc_norm	32.27	±	3.16

Average: 43.54%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	63.74	±	1.40
		acc_norm	63.99	±	1.40
arc_easy	0	acc	85.94	±	0.71
		acc_norm	83.50	±	0.76
boolq	1	acc	87.80	±	0.57
hellaswag	0	acc	67.50	±	0.47
		acc_norm	85.31	±	0.35
openbookqa	0	acc	38.20	±	2.18
		acc_norm	49.40	±	2.24
piqa	0	acc	82.97	±	0.88
		acc_norm	84.33	±	0.85
winogrande	0	acc	80.51	±	1.11

Average: 76.41%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	45.78	±	1.74
		mc2	61.73	±	1.54

Average: 61.73%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	57.89	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	65.58	±	2.48
bigbench_disambiguation_qa	0	multiple_choice_grade	38.76	±	3.04
bigbench_geometric_shapes	0	multiple_choice_grade	20.06	±	2.12
		exact_str_match	5.85	±	1.24
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	30.20	±	2.06
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	20.71	±	1.53
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	52.67	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	43.60	±	2.22
bigbench_navigate	0	multiple_choice_grade	50.50	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	73.15	±	0.99
bigbench_ruin_names	0	multiple_choice_grade	46.65	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	25.25	±	1.38
bigbench_snarks	0	multiple_choice_grade	75.14	±	3.22
bigbench_sports_understanding	0	multiple_choice_grade	73.12	±	1.41
bigbench_temporal_sequences	0	multiple_choice_grade	47.20	±	1.58
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	23.04	±	1.19
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	18.69	±	0.93
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	52.67	±	2.89

Average: 45.27%

Average score: 56.74%

Elapsed time: 01:47:54