mtasic85's picture
eval
b9182f3
|
raw
history blame
42.8 kB
metadata
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
language:
  - en
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - li
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - ns
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - rm
  - ro
  - ru
  - sa
  - si
  - sc
  - sd
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tn
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zu
datasets:
  - xu-song/cc100-samples
  - jordiclive/wikipedia-summary-dataset
  - JeanKaddour/minipile
  - badrex/llm-emoji-dataset
  - fblgit/simple-math
  - Gusarich/math-expressions-1m
  - AtlasUnified/atlas-math-sets
  - gair-prox/open-web-math-pro
  - bigcode/the-stack-smol-xs
  - rombodawg/code_bagel
  - AtlasUnified/Atlas-Reasoning
  - thesven/gsm8k-reasoning
  - AlgorithmicResearchGroup/math_reasoning_autoformalization_track
  - KingNish/reasoning-base-20k
  - SkunkworksAI/reasoning-0.01
  - Magpie-Align/Magpie-Reasoning-150K
tags:
  - litgpt
  - litdata

tangled-llama-b-128k-base-v0.1

logo

A pretrained language model based on the Llama model with about 62.9M parameters. This model has been trained on 10.6B (10,630,121,844) tokens from more than 31.3M (31,383,840) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

loss, val_loss

val_ppl

epoch

learning_rate

Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.1852 ± 0.0114
none 0 acc_norm 0.2201 ± 0.0121
gsm8k 3 flexible-extract 5 exact_match 0.0205 ± 0.0039
strict-match 5 exact_match 0.0000 ± 0.0000
hellaswag 1 none 0 acc 0.2628 ± 0.0044
none 0 acc_norm 0.2705 ± 0.0044
mmlu 2 none acc 0.2468 ± 0.0036
- humanities 2 none acc 0.2459 ± 0.0063
- formal_logic 1 none 0 acc 0.3175 ± 0.0416
- high_school_european_history 1 none 0 acc 0.2364 ± 0.0332
- high_school_us_history 1 none 0 acc 0.2304 ± 0.0296
- high_school_world_history 1 none 0 acc 0.2194 ± 0.0269
- international_law 1 none 0 acc 0.2479 ± 0.0394
- jurisprudence 1 none 0 acc 0.2315 ± 0.0408
- logical_fallacies 1 none 0 acc 0.2147 ± 0.0323
- moral_disputes 1 none 0 acc 0.2168 ± 0.0222
- moral_scenarios 1 none 0 acc 0.2726 ± 0.0149
- philosophy 1 none 0 acc 0.1865 ± 0.0221
- prehistory 1 none 0 acc 0.2191 ± 0.0230
- professional_law 1 none 0 acc 0.2490 ± 0.0110
- world_religions 1 none 0 acc 0.3450 ± 0.0365
- other 2 none acc 0.2385 ± 0.0076
- business_ethics 1 none 0 acc 0.2200 ± 0.0416
- clinical_knowledge 1 none 0 acc 0.2264 ± 0.0258
- college_medicine 1 none 0 acc 0.2601 ± 0.0335
- global_facts 1 none 0 acc 0.1900 ± 0.0394
- human_aging 1 none 0 acc 0.2422 ± 0.0288
- management 1 none 0 acc 0.2330 ± 0.0419
- marketing 1 none 0 acc 0.2821 ± 0.0295
- medical_genetics 1 none 0 acc 0.2900 ± 0.0456
- miscellaneous 1 none 0 acc 0.2388 ± 0.0152
- nutrition 1 none 0 acc 0.1993 ± 0.0229
- professional_accounting 1 none 0 acc 0.2270 ± 0.0250
- professional_medicine 1 none 0 acc 0.2610 ± 0.0267
- virology 1 none 0 acc 0.2349 ± 0.0330
- social sciences 2 none acc 0.2632 ± 0.0079
- econometrics 1 none 0 acc 0.2544 ± 0.0410
- high_school_geography 1 none 0 acc 0.1869 ± 0.0278
- high_school_government_and_politics 1 none 0 acc 0.2850 ± 0.0326
- high_school_macroeconomics 1 none 0 acc 0.3128 ± 0.0235
- high_school_microeconomics 1 none 0 acc 0.2773 ± 0.0291
- high_school_psychology 1 none 0 acc 0.2422 ± 0.0184
- human_sexuality 1 none 0 acc 0.2595 ± 0.0384
- professional_psychology 1 none 0 acc 0.2435 ± 0.0174
- public_relations 1 none 0 acc 0.2273 ± 0.0401
- security_studies 1 none 0 acc 0.3265 ± 0.0300
- sociology 1 none 0 acc 0.2537 ± 0.0308
- us_foreign_policy 1 none 0 acc 0.3000 ± 0.0461
- stem 2 none acc 0.2404 ± 0.0076
- abstract_algebra 1 none 0 acc 0.1700 ± 0.0378
- anatomy 1 none 0 acc 0.2074 ± 0.0350
- astronomy 1 none 0 acc 0.2105 ± 0.0332
- college_biology 1 none 0 acc 0.2153 ± 0.0344
- college_chemistry 1 none 0 acc 0.2000 ± 0.0402
- college_computer_science 1 none 0 acc 0.2300 ± 0.0423
- college_mathematics 1 none 0 acc 0.1700 ± 0.0378
- college_physics 1 none 0 acc 0.2647 ± 0.0439
- computer_security 1 none 0 acc 0.2700 ± 0.0446
- conceptual_physics 1 none 0 acc 0.2766 ± 0.0292
- electrical_engineering 1 none 0 acc 0.2552 ± 0.0363
- elementary_mathematics 1 none 0 acc 0.2566 ± 0.0225
- high_school_biology 1 none 0 acc 0.2097 ± 0.0232
- high_school_chemistry 1 none 0 acc 0.2611 ± 0.0309
- high_school_computer_science 1 none 0 acc 0.2600 ± 0.0441
- high_school_mathematics 1 none 0 acc 0.2111 ± 0.0249
- high_school_physics 1 none 0 acc 0.2517 ± 0.0354
- high_school_statistics 1 none 0 acc 0.3056 ± 0.0314
- machine_learning 1 none 0 acc 0.2857 ± 0.0429
truthfulqa_mc2 2 none 0 acc 0.5010 ± 0.0159
winogrande 1 none 0 acc 0.5130 ± 0.0140
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2468 ± 0.0036
- humanities 2 none acc 0.2459 ± 0.0063
- other 2 none acc 0.2385 ± 0.0076
- social sciences 2 none acc 0.2632 ± 0.0079
- stem 2 none acc 0.2404 ± 0.0076
litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
leaderboard N/A
- leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm 0.4680 ± 0.0316
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm 0.5187 ± 0.0366
- leaderboard_bbh_date_understanding 1 none 3 acc_norm 0.1880 ± 0.0248
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm 0.3440 ± 0.0301
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm 0.4720 ± 0.0316
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm 0.1200 ± 0.0206
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm 0.5240 ± 0.0316
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm 0.2160 ± 0.0261
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm 0.1400 ± 0.0220
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm 0.3200 ± 0.0296
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm 0.2360 ± 0.0269
- leaderboard_bbh_navigate 1 none 3 acc_norm 0.4200 ± 0.0313
- leaderboard_bbh_object_counting 1 none 3 acc_norm 0.1000 ± 0.0190
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm 0.1575 ± 0.0303
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm 0.0920 ± 0.0183
- leaderboard_bbh_ruin_names 1 none 3 acc_norm 0.2480 ± 0.0274
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm 0.1200 ± 0.0206
- leaderboard_bbh_snarks 1 none 3 acc_norm 0.4888 ± 0.0376
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm 0.4600 ± 0.0316
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm 0.2440 ± 0.0272
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm 0.1560 ± 0.0230
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm 0.0960 ± 0.0187
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm 0.3800 ± 0.0308
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm 0.4720 ± 0.0316
- leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.1970 ± 0.0283
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2509 ± 0.0186
- leaderboard_gpqa_main 1 none 0 acc_norm 0.2589 ± 0.0207
- leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.2650 ± N/A
none 0 inst_level_strict_acc 0.2530 ± N/A
none 0 prompt_level_loose_acc 0.1590 ± 0.0157
none 0 prompt_level_strict_acc 0.1553 ± 0.0156
- leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_counting_and_prob_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_geometry_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_intermediate_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_num_theory_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_prealgebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_precalculus_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_mmlu_pro 0.1 none 5 acc 0.1174 ± 0.0029
- leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm 0.5160 ± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm 0.2695 ± 0.0278
- leaderboard_musr_team_allocation 1 none 0 acc_norm 0.3480 ± 0.0302
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.0205 ± 0.0039
strict-match 5 exact_match 0.0000 ± 0.0000
mathqa 1 none 0 acc 0.2010 ± 0.0073
none 0 acc_norm 0.2077 ± 0.0074
litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2468 ± 0.0036
- humanities 2 none acc 0.2459 ± 0.0063
- formal_logic 1 none 0 acc 0.3175 ± 0.0416
- high_school_european_history 1 none 0 acc 0.2364 ± 0.0332
- high_school_us_history 1 none 0 acc 0.2304 ± 0.0296
- high_school_world_history 1 none 0 acc 0.2194 ± 0.0269
- international_law 1 none 0 acc 0.2479 ± 0.0394
- jurisprudence 1 none 0 acc 0.2315 ± 0.0408
- logical_fallacies 1 none 0 acc 0.2147 ± 0.0323
- moral_disputes 1 none 0 acc 0.2168 ± 0.0222
- moral_scenarios 1 none 0 acc 0.2726 ± 0.0149
- philosophy 1 none 0 acc 0.1865 ± 0.0221
- prehistory 1 none 0 acc 0.2191 ± 0.0230
- professional_law 1 none 0 acc 0.2490 ± 0.0110
- world_religions 1 none 0 acc 0.3450 ± 0.0365
- other 2 none acc 0.2385 ± 0.0076
- business_ethics 1 none 0 acc 0.2200 ± 0.0416
- clinical_knowledge 1 none 0 acc 0.2264 ± 0.0258
- college_medicine 1 none 0 acc 0.2601 ± 0.0335
- global_facts 1 none 0 acc 0.1900 ± 0.0394
- human_aging 1 none 0 acc 0.2422 ± 0.0288
- management 1 none 0 acc 0.2330 ± 0.0419
- marketing 1 none 0 acc 0.2821 ± 0.0295
- medical_genetics 1 none 0 acc 0.2900 ± 0.0456
- miscellaneous 1 none 0 acc 0.2388 ± 0.0152
- nutrition 1 none 0 acc 0.1993 ± 0.0229
- professional_accounting 1 none 0 acc 0.2270 ± 0.0250
- professional_medicine 1 none 0 acc 0.2610 ± 0.0267
- virology 1 none 0 acc 0.2349 ± 0.0330
- social sciences 2 none acc 0.2632 ± 0.0079
- econometrics 1 none 0 acc 0.2544 ± 0.0410
- high_school_geography 1 none 0 acc 0.1869 ± 0.0278
- high_school_government_and_politics 1 none 0 acc 0.2850 ± 0.0326
- high_school_macroeconomics 1 none 0 acc 0.3128 ± 0.0235
- high_school_microeconomics 1 none 0 acc 0.2773 ± 0.0291
- high_school_psychology 1 none 0 acc 0.2422 ± 0.0184
- human_sexuality 1 none 0 acc 0.2595 ± 0.0384
- professional_psychology 1 none 0 acc 0.2435 ± 0.0174
- public_relations 1 none 0 acc 0.2273 ± 0.0401
- security_studies 1 none 0 acc 0.3265 ± 0.0300
- sociology 1 none 0 acc 0.2537 ± 0.0308
- us_foreign_policy 1 none 0 acc 0.3000 ± 0.0461
- stem 2 none acc 0.2404 ± 0.0076
- abstract_algebra 1 none 0 acc 0.1700 ± 0.0378
- anatomy 1 none 0 acc 0.2074 ± 0.0350
- astronomy 1 none 0 acc 0.2105 ± 0.0332
- college_biology 1 none 0 acc 0.2153 ± 0.0344
- college_chemistry 1 none 0 acc 0.2000 ± 0.0402
- college_computer_science 1 none 0 acc 0.2300 ± 0.0423
- college_mathematics 1 none 0 acc 0.1700 ± 0.0378
- college_physics 1 none 0 acc 0.2647 ± 0.0439
- computer_security 1 none 0 acc 0.2700 ± 0.0446
- conceptual_physics 1 none 0 acc 0.2766 ± 0.0292
- electrical_engineering 1 none 0 acc 0.2552 ± 0.0363
- elementary_mathematics 1 none 0 acc 0.2566 ± 0.0225
- high_school_biology 1 none 0 acc 0.2097 ± 0.0232
- high_school_chemistry 1 none 0 acc 0.2611 ± 0.0309
- high_school_computer_science 1 none 0 acc 0.2600 ± 0.0441
- high_school_mathematics 1 none 0 acc 0.2111 ± 0.0249
- high_school_physics 1 none 0 acc 0.2517 ± 0.0354
- high_school_statistics 1 none 0 acc 0.3056 ± 0.0314
- machine_learning 1 none 0 acc 0.2857 ± 0.0429
mmlu_pro 2 custom-extract exact_match 0.0000 ± 0.0000
- biology 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- business 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- chemistry 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- computer_science 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- economics 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- engineering 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- health 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- history 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- law 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- math 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- other 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- philosophy 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- physics 1 custom-extract 5 exact_match 0.0000 ± 0.0000
- psychology 1 custom-extract 5 exact_match 0.0000 ± 0.0000
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2468 ± 0.0036
- humanities 2 none acc 0.2459 ± 0.0063
- other 2 none acc 0.2385 ± 0.0076
- social sciences 2 none acc 0.2632 ± 0.0079
- stem 2 none acc 0.2404 ± 0.0076
mmlu_pro 2 custom-extract exact_match 0.0000 ± 0.0000
litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.1852 ± 0.0114
none 0 acc_norm 0.2201 ± 0.0121
boolq 2 none 0 acc 0.4446 ± 0.0087
gpqa_diamond_cot_n_shot 2 flexible-extract 0 exact_match 0.0859 ± 0.0200
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_diamond_cot_zeroshot 1 flexible-extract 0 exact_match 0.0606 ± 0.0170
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_diamond_generative_n_shot 2 flexible-extract 0 exact_match 0.1717 ± 0.0269
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_diamond_n_shot 2 none 0 acc 0.2677 ± 0.0315
none 0 acc_norm 0.2677 ± 0.0315
gpqa_diamond_zeroshot 1 none 0 acc 0.1970 ± 0.0283
none 0 acc_norm 0.1970 ± 0.0283
gpqa_extended_cot_n_shot 2 flexible-extract 0 exact_match 0.0971 ± 0.0127
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_extended_cot_zeroshot 1 flexible-extract 0 exact_match 0.0696 ± 0.0109
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_extended_generative_n_shot 2 flexible-extract 0 exact_match 0.1502 ± 0.0153
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_extended_n_shot 2 none 0 acc 0.2399 ± 0.0183
none 0 acc_norm 0.2399 ± 0.0183
gpqa_extended_zeroshot 1 none 0 acc 0.2473 ± 0.0185
none 0 acc_norm 0.2473 ± 0.0185
gpqa_main_cot_n_shot 2 flexible-extract 0 exact_match 0.1116 ± 0.0149
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_main_cot_zeroshot 1 flexible-extract 0 exact_match 0.0625 ± 0.0114
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_main_generative_n_shot 2 flexible-extract 0 exact_match 0.1384 ± 0.0163
strict-match 0 exact_match 0.0000 ± 0.0000
gpqa_main_n_shot 2 none 0 acc 0.2388 ± 0.0202
none 0 acc_norm 0.2388 ± 0.0202
gpqa_main_zeroshot 1 none 0 acc 0.2500 ± 0.0205
none 0 acc_norm 0.2500 ± 0.0205
hellaswag 1 none 0 acc 0.2628 ± 0.0044
none 0 acc_norm 0.2705 ± 0.0044
openbookqa 1 none 0 acc 0.1360 ± 0.0153
none 0 acc_norm 0.2620 ± 0.0197
piqa 1 none 0 acc 0.5550 ± 0.0116
none 0 acc_norm 0.5528 ± 0.0116
truthfulqa_mc2 2 none 0 acc 0.5010 ± 0.0159
winogrande 1 none 0 acc 0.5130 ± 0.0140
litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
qasper_bool 1 none 0 f1 0.8966 ± 0.0166
qasper_freeform 2 none 0 f1_abstractive 0.0597 ± 0.0052
wikitext 2 none 0 bits_per_byte 2.2154 ± N/A
none 0 byte_perplexity 4.6441 ± N/A
none 0 word_perplexity 3683.1019 ± N/A

Continued Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-contrain-quick/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.1894 ± 0.0115
none 0 acc_norm 0.2193 ± 0.0121
gsm8k 3 flexible-extract 5 exact_match 0.0182 ± 0.0037
strict-match 5 exact_match 0.0000 ± 0.0000
hellaswag 1 none 0 acc 0.2638 ± 0.0044
none 0 acc_norm 0.2655 ± 0.0044
mmlu 2 none acc 0.2376 ± 0.0036
- humanities 2 none acc 0.2438 ± 0.0063
- formal_logic 1 none 0 acc 0.2222 ± 0.0372
- high_school_european_history 1 none 0 acc 0.2485 ± 0.0337
- high_school_us_history 1 none 0 acc 0.2304 ± 0.0296
- high_school_world_history 1 none 0 acc 0.2489 ± 0.0281
- international_law 1 none 0 acc 0.2397 ± 0.0390
- jurisprudence 1 none 0 acc 0.2407 ± 0.0413
- logical_fallacies 1 none 0 acc 0.2025 ± 0.0316
- moral_disputes 1 none 0 acc 0.1965 ± 0.0214
- moral_scenarios 1 none 0 acc 0.2726 ± 0.0149
- philosophy 1 none 0 acc 0.1897 ± 0.0223
- prehistory 1 none 0 acc 0.2191 ± 0.0230
- professional_law 1 none 0 acc 0.2529 ± 0.0111
- world_religions 1 none 0 acc 0.3158 ± 0.0357
- other 2 none acc 0.2407 ± 0.0077
- business_ethics 1 none 0 acc 0.2600 ± 0.0441
- clinical_knowledge 1 none 0 acc 0.2302 ± 0.0259
- college_medicine 1 none 0 acc 0.2370 ± 0.0324
- global_facts 1 none 0 acc 0.1900 ± 0.0394
- human_aging 1 none 0 acc 0.3004 ± 0.0308
- management 1 none 0 acc 0.1845 ± 0.0384
- marketing 1 none 0 acc 0.2863 ± 0.0296
- medical_genetics 1 none 0 acc 0.3000 ± 0.0461
- miscellaneous 1 none 0 acc 0.2375 ± 0.0152
- nutrition 1 none 0 acc 0.2353 ± 0.0243
- professional_accounting 1 none 0 acc 0.2305 ± 0.0251
- professional_medicine 1 none 0 acc 0.2096 ± 0.0247
- virology 1 none 0 acc 0.2289 ± 0.0327
- social sciences 2 none acc 0.2382 ± 0.0077
- econometrics 1 none 0 acc 0.2368 ± 0.0400
- high_school_geography 1 none 0 acc 0.1818 ± 0.0275
- high_school_government_and_politics 1 none 0 acc 0.2280 ± 0.0303
- high_school_macroeconomics 1 none 0 acc 0.2410 ± 0.0217
- high_school_microeconomics 1 none 0 acc 0.2479 ± 0.0280
- high_school_psychology 1 none 0 acc 0.2055 ± 0.0173
- human_sexuality 1 none 0 acc 0.2824 ± 0.0395
- professional_psychology 1 none 0 acc 0.2565 ± 0.0177
- public_relations 1 none 0 acc 0.2091 ± 0.0390
- security_studies 1 none 0 acc 0.2694 ± 0.0284
- sociology 1 none 0 acc 0.2438 ± 0.0304
- us_foreign_policy 1 none 0 acc 0.2900 ± 0.0456
- stem 2 none acc 0.2249 ± 0.0074
- abstract_algebra 1 none 0 acc 0.1800 ± 0.0386
- anatomy 1 none 0 acc 0.1704 ± 0.0325
- astronomy 1 none 0 acc 0.2105 ± 0.0332
- college_biology 1 none 0 acc 0.2500 ± 0.0362
- college_chemistry 1 none 0 acc 0.1900 ± 0.0394
- college_computer_science 1 none 0 acc 0.2600 ± 0.0441
- college_mathematics 1 none 0 acc 0.2000 ± 0.0402
- college_physics 1 none 0 acc 0.2353 ± 0.0422
- computer_security 1 none 0 acc 0.2800 ± 0.0451
- conceptual_physics 1 none 0 acc 0.2596 ± 0.0287
- electrical_engineering 1 none 0 acc 0.2345 ± 0.0353
- elementary_mathematics 1 none 0 acc 0.2434 ± 0.0221
- high_school_biology 1 none 0 acc 0.1871 ± 0.0222
- high_school_chemistry 1 none 0 acc 0.2118 ± 0.0287
- high_school_computer_science 1 none 0 acc 0.2600 ± 0.0441
- high_school_mathematics 1 none 0 acc 0.2222 ± 0.0253
- high_school_physics 1 none 0 acc 0.1921 ± 0.0322
- high_school_statistics 1 none 0 acc 0.2130 ± 0.0279
- machine_learning 1 none 0 acc 0.3036 ± 0.0436
truthfulqa_mc2 2 none 0 acc 0.4931 ± 0.0161
winogrande 1 none 0 acc 0.5012 ± 0.0141
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2376 ± 0.0036
- humanities 2 none acc 0.2438 ± 0.0063
- other 2 none acc 0.2407 ± 0.0077
- social sciences 2 none acc 0.2382 ± 0.0077
- stem 2 none acc 0.2249 ± 0.0074
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-contrain-math/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.0182 ± 0.0037
strict-match 5 exact_match 0.0000 ± 0.0000
mathqa 1 none 0 acc 0.2124 ± 0.0075
none 0 acc_norm 0.2137 ± 0.0075