Technoculture/MT7Bi-alpha adapter merged with its Base Model (Meditron 7B)
Evaluations
Open LLM Leaderboard
Model Evaluation Benchmark
|
|
|
|
|
|
|
|
|
Category |
MT7Bi |
meditron-70b |
llama-2-70b |
med42-70b* |
meditron-7b |
llama-2-7b |
PMC-llama-7b |
|
Health |
|
81.8 |
69.1 |
83.6 |
27.3 |
16.4 |
3.6 |
|
Nutrition |
|
77.9 |
68.8 |
62.5 |
31.1 |
12.5 |
6.3 |
|
Psychology |
|
47.4 |
36.8 |
52.6 |
21.1 |
10.5 |
0.0 |
|
Science |
|
77.8 |
44.4 |
33.3 |
33.3 |
11.1 |
0.0 |
|
Avg |
|
71.2 |
54.8 |
58.0 |
28.3 |
12.6 |
2.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dataset |
MT7Bi |
meditron-70b |
llama-2-70b |
med42-70b* |
clinical-camel-70b* |
|
MMLU-Medical |
46.9 |
77.6 |
77.9 |
74.5 |
65.7 |
|
PubMedQA |
65.2 |
81.6 |
80.0 |
61.2 |
67.0 |
|
MedMCQA |
42.7 |
66.0 |
62.6 |
59.2 |
46.7 |
|
MedQA |
|
64.4 |
61.5 |
59.1 |
50.8 |
|
MedQA-4-Option |
44.3 |
70.2 |
63.8 |
63.9 |
56.8 |
|
Avg |
|
72.0 |
69.2 |
63.6 |
57.4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dataset |
meditron-7b |
llama-2-7b |
pmc-llama-7b |
Zephyr-7B-beta* |
Mistral-7B-instruct* |
MT7Bi |
MMLU-Medical |
54.2 |
53.7 |
56.4 |
63.3 |
60.0 |
46.9 |
PubMedQA |
74.4 |
61.8 |
59.2 |
46.0 |
17.8 |
65.2 |
MedMCQA |
59.2 |
54.4 |
57.6 |
43.0 |
40.2 |
42.7 |
MedQA |
47.9 |
44.0 |
42.4 |
42.8 |
32.4 |
|
MedQA-4-Option |
52.0 |
49.6 |
49.2 |
48.5 |
41.1 |
44.3 |
Avg |
57.5 |
52.7 |
53.0 |
48.7 |
38.3 |
|
|
|
|
|
|
|
|
Model Name |
ARC |
HellaSwag |
MMLU |
TruthfulQA |
Winogrande |
GSM8K |
Orca-2-7b |
78.4 |
76.1 |
53.7 |
52.4 |
74.2 |
47.2 |
LLAMA-2-7b |
43.2 |
77.1 |
44.4 |
38.7 |
69.5 |
16 |
MT7Bi-sft |
54.1 |
75.11 |
- |
43.08 |
72.14 |
15.54 |
ARC: 54.1%
Task |
Version |
Metric |
Value |
|
Stderr |
arc_challenge |
1 |
acc,none |
0.51 |
|
|
|
|
acc_stderr,none |
0.01 |
|
|
|
|
acc_norm,none |
0.54 |
|
|
|
|
acc_norm_stderr,none |
0.01 |
|
|
|
|
alias |
arc_challenge |
|
|
HellaSwag: 75.11%
Task |
Version |
Metric |
Value |
|
Stderr |
hellaswag |
1 |
acc,none |
0.57 |
|
|
|
|
acc_stderr,none |
0 |
|
|
|
|
acc_norm,none |
0.75 |
|
|
|
|
acc_norm_stderr,none |
0 |
|
|
|
|
alias |
hellaswag |
|
|
TruthfulQA: 43.08%
Task |
Version |
Metric |
Value |
|
Stderr |
truthfulqa |
N/A |
bleu_max,none |
18.31 |
|
|
|
|
bleu_max_stderr,none |
0.46 |
|
|
|
|
bleu_acc,none |
0.39 |
|
|
|
|
bleu_acc_stderr,none |
0 |
|
|
|
|
bleu_diff,none |
-1.63 |
|
|
|
|
bleu_diff_stderr,none |
0.39 |
|
|
|
|
rouge1_max,none |
41.99 |
|
|
|
|
rouge1_max_stderr,none |
0.71 |
|
|
|
|
rouge1_acc,none |
0.39 |
|
|
|
|
rouge1_acc_stderr,none |
0 |
|
|
|
|
rouge1_diff,none |
-2.88 |
|
|
|
|
rouge1_diff_stderr,none |
0.66 |
|
|
|
|
rouge2_max,none |
27.42 |
|
|
|
|
rouge2_max_stderr,none |
0.80 |
|
|
|
|
rouge2_acc,none |
0.32 |
|
|
|
|
rouge2_acc_stderr,none |
0 |
|
|
|
|
rouge2_diff,none |
-3.11 |
|
|
|
|
rouge2_diff_stderr,none |
0.78 |
|
|
|
|
rougeL_max,none |
38.81 |
|
|
|
|
rougeL_max_stderr,none |
0.71 |
|
|
|
|
rougeL_acc,none |
0.38 |
|
|
|
|
rougeL_acc_stderr,none |
0 |
|
|
|
|
rougeL_diff,none |
-3.01 |
|
|
|
|
rougeL_diff_stderr,none |
0.66 |
|
|
|
|
acc,none |
0.33 |
|
|
|
|
acc_stderr,none |
0.05 |
|
|
|
|
alias |
truthfulqa |
|
|
truthfulqa_gen |
3 |
bleu_max,none |
18.31 |
|
|
|
|
bleu_max_stderr,none |
0.68 |
|
|
|
|
bleu_acc,none |
0.39 |
|
|
|
|
bleu_acc_stderr,none |
0.02 |
|
|
|
|
bleu_diff,none |
-1.63 |
|
|
|
|
bleu_diff_stderr,none |
0.62 |
|
|
|
|
rouge1_max,none |
41.99 |
|
|
|
|
rouge1_max_stderr,none |
0.84 |
|
|
|
|
rouge1_acc,none |
0.39 |
|
|
|
|
rouge1_acc_stderr,none |
0.02 |
|
|
|
|
rouge1_diff,none |
-2.88 |
|
|
|
|
rouge1_diff_stderr,none |
0.81 |
|
|
|
|
rouge2_max,none |
27.42 |
|
|
|
|
rouge2_max_stderr,none |
0.89 |
|
|
|
|
rouge2_acc,none |
0.32 |
|
|
|
|
rouge2_acc_stderr,none |
0.02 |
|
|
|
|
rouge2_diff,none |
-3.11 |
|
|
|
|
rouge2_diff_stderr,none |
0.88 |
|
|
|
|
rougeL_max,none |
38.81 |
|
|
|
|
rougeL_max_stderr,none |
0.84 |
|
|
|
|
rougeL_acc,none |
0.38 |
|
|
|
|
rougeL_acc_stderr,none |
0.02 |
|
|
|
|
rougeL_diff,none |
-3.01 |
|
|
|
|
rougeL_diff_stderr,none |
0.82 |
|
|
|
|
alias |
- truthfulqa_gen |
|
|
truthfulqa_mc1 |
2 |
acc,none |
0.28 |
|
|
|
|
acc_stderr,none |
0.02 |
|
|
|
|
alias |
- truthfulqa_mc1 |
|
|
truthfulqa_mc2 |
2 |
acc,none |
0.43 |
|
|
|
|
acc_stderr,none |
0.01 |
|
|
|
|
alias |
- truthfulqa_mc2 |
|
|
Winogrande: 72.14%
Task |
Version |
Metric |
Value |
|
Stderr |
winogrande |
1 |
acc,none |
0.72 |
|
|
|
|
acc_stderr,none |
0.01 |
|
|
|
|
alias |
winogrande |
|
|
GSM8K: 15.54%
Task |
Version |
Metric |
Value |
|
Stderr |
gsm8k |
2 |
exact_match,get-answer |
0.16 |
|
|
|
|
exact_match_stderr,get-answer |
0.01 |
|
|
|
|
alias |
gsm8k |
|
|
Elapsed time: 04:06:36