SOLAR-10.7b-Instruct-dpo

This model is a finetune of upstage/SOLAR-10.7B-Instruct-v1.0 using Intel/orca_dpo_pairs

Chat Template

This model follows the chatML chat template.

Evaluations

EQ Bench comparison with base model

These scores are the average of 3 iterations.

----Benchmark Complete---- + 2024-01-25 04:41:01 + Time taken: 236.1 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-dpo + Score (v2): 72.79 + Parseable: 165.67

Batch completed Time taken: 236.1 mins

as compared to the original model:

----Benchmark Complete---- + 2024-01-25 08:45:02 + Time taken: 244.0 mins + Prompt Format: ChatML + Model: upstage/SOLAR-10.7B-Instruct-v1.0 + Score (v2): 71.03 + Parseable: 165.67

Batch completed Time taken: 480.1 mins

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
SOLAR-10.7b-Instruct-dpo	47.57	74.3	72.73	45.76	60.09

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.56	±	2.81
		acc_norm	26.77	±	2.78
agieval_logiqa_en	0	acc	41.63	±	1.93
		acc_norm	41.32	±	1.93
agieval_lsat_ar	0	acc	25.22	±	2.87
		acc_norm	24.35	±	2.84
agieval_lsat_lr	0	acc	54.12	±	2.21
		acc_norm	54.31	±	2.21
agieval_lsat_rc	0	acc	68.77	±	2.83
		acc_norm	69.14	±	2.82
agieval_sat_en	0	acc	79.13	±	2.84
		acc_norm	79.13	±	2.84
agieval_sat_en_without_passage	0	acc	44.66	±	3.47
		acc_norm	44.66	±	3.47
agieval_sat_math	0	acc	40.45	±	3.32
		acc_norm	40.91	±	3.32

Average: 47.57%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	60.49	±	1.43
		acc_norm	63.74	±	1.40
arc_easy	0	acc	82.07	±	0.79
		acc_norm	79.92	±	0.82
boolq	1	acc	88.56	±	0.56
hellaswag	0	acc	68.47	±	0.46
		acc_norm	86.06	±	0.35
openbookqa	0	acc	36.20	±	2.15
		acc_norm	46.60	±	2.23
piqa	0	acc	79.38	±	0.94
		acc_norm	79.71	±	0.94
winogrande	0	acc	75.53	±	1.21

Average: 74.3%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	57.77	±	1.73
		mc2	72.73	±	1.49

Average: 72.73%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	55.26	±	3.62
bigbench_date_understanding	0	multiple_choice_grade	62.87	±	2.52
bigbench_disambiguation_qa	0	multiple_choice_grade	46.51	±	3.11
bigbench_geometric_shapes	0	multiple_choice_grade	25.63	±	2.31
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	28.00	±	2.01
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	20.57	±	1.53
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	46.67	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	41.80	±	2.21
bigbench_navigate	0	multiple_choice_grade	64.00	±	1.52
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	60.00	±	1.10
bigbench_ruin_names	0	multiple_choice_grade	39.96	±	2.32
bigbench_salient_translation_error_detection	0	multiple_choice_grade	47.90	±	1.58
bigbench_snarks	0	multiple_choice_grade	64.09	±	3.58
bigbench_sports_understanding	0	multiple_choice_grade	71.10	±	1.44
bigbench_temporal_sequences	0	multiple_choice_grade	59.90	±	1.55
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	24.96	±	1.22
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.89	±	0.92
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	46.67	±	2.89

Average: 45.76%

Average score: 60.09%

Elapsed time: 02:10:16

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	73.54
AI2 Reasoning Challenge (25-Shot)	71.76
HellaSwag (10-Shot)	88.08
MMLU (5-Shot)	66.06
TruthfulQA (0-shot)	71.98
Winogrande (5-shot)	82.32
GSM8k (5-shot)	61.03

Downloads last month: 94

Safetensors

Model size

10.7B params

Tensor type

FP16

Inference Examples

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for macadeliccc/SOLAR-10.7b-Instruct-dpo

Quantizations

3 models

Spaces using macadeliccc/SOLAR-10.7b-Instruct-dpo 6

Collection including macadeliccc/SOLAR-10.7b-Instruct-dpo

DPO fine tunes

Collection

3 items • Updated Jul 11

Evaluation results

normalized accuracy on AI2 Reasoning Challenge (25-Shot)
test set Open LLM Leaderboard

71.760
normalized accuracy on HellaSwag (10-Shot)
validation set Open LLM Leaderboard

88.080
accuracy on MMLU (5-Shot)
test set Open LLM Leaderboard

66.060
mc2 on TruthfulQA (0-shot)
validation set Open LLM Leaderboard

71.980
accuracy on Winogrande (5-shot)
validation set Open LLM Leaderboard

82.320
accuracy on GSM8k (5-shot)
test set Open LLM Leaderboard

61.030

View on Papers With Code