metadata

license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
language:
  - en
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - li
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - ns
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - rm
  - ro
  - ru
  - sa
  - si
  - sc
  - sd
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tn
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zu
datasets:
  - xu-song/cc100-samples
  - jordiclive/wikipedia-summary-dataset
  - JeanKaddour/minipile
  - badrex/llm-emoji-dataset
  - fblgit/simple-math
  - Gusarich/math-expressions-1m
  - AtlasUnified/atlas-math-sets
  - gair-prox/open-web-math-pro
  - bigcode/the-stack-smol-xs
  - rombodawg/code_bagel
  - AtlasUnified/Atlas-Reasoning
  - thesven/gsm8k-reasoning
  - AlgorithmicResearchGroup/math_reasoning_autoformalization_track
  - KingNish/reasoning-base-20k
  - SkunkworksAI/reasoning-0.01
  - Magpie-Align/Magpie-Reasoning-150K
tags:
  - litgpt
  - litdata

tangled-llama-b-128k-base-v0.1

A pretrained language model based on the Llama model with about 62.9M parameters. This model has been trained on 10.6B (10,630,121,844) tokens from more than 31.3M (31,383,840) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.1852	±	0.0114
		none	0	acc_norm	↑	0.2201	±	0.0121
gsm8k	3	flexible-extract	5	exact_match	↑	0.0205	±	0.0039
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag	1	none	0	acc	↑	0.2628	±	0.0044
		none	0	acc_norm	↑	0.2705	±	0.0044
mmlu	2	none		acc	↑	0.2468	±	0.0036
- humanities	2	none		acc	↑	0.2459	±	0.0063
- formal_logic	1	none	0	acc	↑	0.3175	±	0.0416
- high_school_european_history	1	none	0	acc	↑	0.2364	±	0.0332
- high_school_us_history	1	none	0	acc	↑	0.2304	±	0.0296
- high_school_world_history	1	none	0	acc	↑	0.2194	±	0.0269
- international_law	1	none	0	acc	↑	0.2479	±	0.0394
- jurisprudence	1	none	0	acc	↑	0.2315	±	0.0408
- logical_fallacies	1	none	0	acc	↑	0.2147	±	0.0323
- moral_disputes	1	none	0	acc	↑	0.2168	±	0.0222
- moral_scenarios	1	none	0	acc	↑	0.2726	±	0.0149
- philosophy	1	none	0	acc	↑	0.1865	±	0.0221
- prehistory	1	none	0	acc	↑	0.2191	±	0.0230
- professional_law	1	none	0	acc	↑	0.2490	±	0.0110
- world_religions	1	none	0	acc	↑	0.3450	±	0.0365
- other	2	none		acc	↑	0.2385	±	0.0076
- business_ethics	1	none	0	acc	↑	0.2200	±	0.0416
- clinical_knowledge	1	none	0	acc	↑	0.2264	±	0.0258
- college_medicine	1	none	0	acc	↑	0.2601	±	0.0335
- global_facts	1	none	0	acc	↑	0.1900	±	0.0394
- human_aging	1	none	0	acc	↑	0.2422	±	0.0288
- management	1	none	0	acc	↑	0.2330	±	0.0419
- marketing	1	none	0	acc	↑	0.2821	±	0.0295
- medical_genetics	1	none	0	acc	↑	0.2900	±	0.0456
- miscellaneous	1	none	0	acc	↑	0.2388	±	0.0152
- nutrition	1	none	0	acc	↑	0.1993	±	0.0229
- professional_accounting	1	none	0	acc	↑	0.2270	±	0.0250
- professional_medicine	1	none	0	acc	↑	0.2610	±	0.0267
- virology	1	none	0	acc	↑	0.2349	±	0.0330
- social sciences	2	none		acc	↑	0.2632	±	0.0079
- econometrics	1	none	0	acc	↑	0.2544	±	0.0410
- high_school_geography	1	none	0	acc	↑	0.1869	±	0.0278
- high_school_government_and_politics	1	none	0	acc	↑	0.2850	±	0.0326
- high_school_macroeconomics	1	none	0	acc	↑	0.3128	±	0.0235
- high_school_microeconomics	1	none	0	acc	↑	0.2773	±	0.0291
- high_school_psychology	1	none	0	acc	↑	0.2422	±	0.0184
- human_sexuality	1	none	0	acc	↑	0.2595	±	0.0384
- professional_psychology	1	none	0	acc	↑	0.2435	±	0.0174
- public_relations	1	none	0	acc	↑	0.2273	±	0.0401
- security_studies	1	none	0	acc	↑	0.3265	±	0.0300
- sociology	1	none	0	acc	↑	0.2537	±	0.0308
- us_foreign_policy	1	none	0	acc	↑	0.3000	±	0.0461
- stem	2	none		acc	↑	0.2404	±	0.0076
- abstract_algebra	1	none	0	acc	↑	0.1700	±	0.0378
- anatomy	1	none	0	acc	↑	0.2074	±	0.0350
- astronomy	1	none	0	acc	↑	0.2105	±	0.0332
- college_biology	1	none	0	acc	↑	0.2153	±	0.0344
- college_chemistry	1	none	0	acc	↑	0.2000	±	0.0402
- college_computer_science	1	none	0	acc	↑	0.2300	±	0.0423
- college_mathematics	1	none	0	acc	↑	0.1700	±	0.0378
- college_physics	1	none	0	acc	↑	0.2647	±	0.0439
- computer_security	1	none	0	acc	↑	0.2700	±	0.0446
- conceptual_physics	1	none	0	acc	↑	0.2766	±	0.0292
- electrical_engineering	1	none	0	acc	↑	0.2552	±	0.0363
- elementary_mathematics	1	none	0	acc	↑	0.2566	±	0.0225
- high_school_biology	1	none	0	acc	↑	0.2097	±	0.0232
- high_school_chemistry	1	none	0	acc	↑	0.2611	±	0.0309
- high_school_computer_science	1	none	0	acc	↑	0.2600	±	0.0441
- high_school_mathematics	1	none	0	acc	↑	0.2111	±	0.0249
- high_school_physics	1	none	0	acc	↑	0.2517	±	0.0354
- high_school_statistics	1	none	0	acc	↑	0.3056	±	0.0314
- machine_learning	1	none	0	acc	↑	0.2857	±	0.0429
truthfulqa_mc2	2	none	0	acc	↑	0.5010	±	0.0159
winogrande	1	none	0	acc	↑	0.5130	±	0.0140

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2468	±	0.0036
- humanities	2	none	acc	↑	0.2459	±	0.0063
- other	2	none	acc	↑	0.2385	±	0.0076
- social sciences	2	none	acc	↑	0.2632	±	0.0079
- stem	2	none	acc	↑	0.2404	±	0.0076

litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A
- leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.4680	±	0.0316
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5187	±	0.0366
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.1880	±	0.0248
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.3440	±	0.0301
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.4720	±	0.0316
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.1200	±	0.0206
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.5240	±	0.0316
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.2160	±	0.0261
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.1400	±	0.0220
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.3200	±	0.0296
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.4200	±	0.0313
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.1000	±	0.0190
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.1575	±	0.0303
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.0920	±	0.0183
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.2480	±	0.0274
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.1200	±	0.0206
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.4888	±	0.0376
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2440	±	0.0272
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.1560	±	0.0230
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.0960	±	0.0187
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3800	±	0.0308
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.4720	±	0.0316
- leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.1970	±	0.0283
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2509	±	0.0186
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2589	±	0.0207
- leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.2650	±	N/A
		none	0	inst_level_strict_acc	↑	0.2530	±	N/A
		none	0	prompt_level_loose_acc	↑	0.1590	±	0.0157
		none	0	prompt_level_strict_acc	↑	0.1553	±	0.0156
- leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_counting_and_prob_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_geometry_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_intermediate_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_num_theory_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_prealgebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_precalculus_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.1174	±	0.0029
- leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5160	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.2695	±	0.0278
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3480	±	0.0302

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0205	±	0.0039
		strict-match	5	exact_match	↑	0.0000	±	0.0000
mathqa	1	none	0	acc	↑	0.2010	±	0.0073
		none	0	acc_norm	↑	0.2077	±	0.0074

litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.2468	±	0.0036
- humanities	2	none		acc	↑	0.2459	±	0.0063
- formal_logic	1	none	0	acc	↑	0.3175	±	0.0416
- high_school_european_history	1	none	0	acc	↑	0.2364	±	0.0332
- high_school_us_history	1	none	0	acc	↑	0.2304	±	0.0296
- high_school_world_history	1	none	0	acc	↑	0.2194	±	0.0269
- international_law	1	none	0	acc	↑	0.2479	±	0.0394
- jurisprudence	1	none	0	acc	↑	0.2315	±	0.0408
- logical_fallacies	1	none	0	acc	↑	0.2147	±	0.0323
- moral_disputes	1	none	0	acc	↑	0.2168	±	0.0222
- moral_scenarios	1	none	0	acc	↑	0.2726	±	0.0149
- philosophy	1	none	0	acc	↑	0.1865	±	0.0221
- prehistory	1	none	0	acc	↑	0.2191	±	0.0230
- professional_law	1	none	0	acc	↑	0.2490	±	0.0110
- world_religions	1	none	0	acc	↑	0.3450	±	0.0365
- other	2	none		acc	↑	0.2385	±	0.0076
- business_ethics	1	none	0	acc	↑	0.2200	±	0.0416
- clinical_knowledge	1	none	0	acc	↑	0.2264	±	0.0258
- college_medicine	1	none	0	acc	↑	0.2601	±	0.0335
- global_facts	1	none	0	acc	↑	0.1900	±	0.0394
- human_aging	1	none	0	acc	↑	0.2422	±	0.0288
- management	1	none	0	acc	↑	0.2330	±	0.0419
- marketing	1	none	0	acc	↑	0.2821	±	0.0295
- medical_genetics	1	none	0	acc	↑	0.2900	±	0.0456
- miscellaneous	1	none	0	acc	↑	0.2388	±	0.0152
- nutrition	1	none	0	acc	↑	0.1993	±	0.0229
- professional_accounting	1	none	0	acc	↑	0.2270	±	0.0250
- professional_medicine	1	none	0	acc	↑	0.2610	±	0.0267
- virology	1	none	0	acc	↑	0.2349	±	0.0330
- social sciences	2	none		acc	↑	0.2632	±	0.0079
- econometrics	1	none	0	acc	↑	0.2544	±	0.0410
- high_school_geography	1	none	0	acc	↑	0.1869	±	0.0278
- high_school_government_and_politics	1	none	0	acc	↑	0.2850	±	0.0326
- high_school_macroeconomics	1	none	0	acc	↑	0.3128	±	0.0235
- high_school_microeconomics	1	none	0	acc	↑	0.2773	±	0.0291
- high_school_psychology	1	none	0	acc	↑	0.2422	±	0.0184
- human_sexuality	1	none	0	acc	↑	0.2595	±	0.0384
- professional_psychology	1	none	0	acc	↑	0.2435	±	0.0174
- public_relations	1	none	0	acc	↑	0.2273	±	0.0401
- security_studies	1	none	0	acc	↑	0.3265	±	0.0300
- sociology	1	none	0	acc	↑	0.2537	±	0.0308
- us_foreign_policy	1	none	0	acc	↑	0.3000	±	0.0461
- stem	2	none		acc	↑	0.2404	±	0.0076
- abstract_algebra	1	none	0	acc	↑	0.1700	±	0.0378
- anatomy	1	none	0	acc	↑	0.2074	±	0.0350
- astronomy	1	none	0	acc	↑	0.2105	±	0.0332
- college_biology	1	none	0	acc	↑	0.2153	±	0.0344
- college_chemistry	1	none	0	acc	↑	0.2000	±	0.0402
- college_computer_science	1	none	0	acc	↑	0.2300	±	0.0423
- college_mathematics	1	none	0	acc	↑	0.1700	±	0.0378
- college_physics	1	none	0	acc	↑	0.2647	±	0.0439
- computer_security	1	none	0	acc	↑	0.2700	±	0.0446
- conceptual_physics	1	none	0	acc	↑	0.2766	±	0.0292
- electrical_engineering	1	none	0	acc	↑	0.2552	±	0.0363
- elementary_mathematics	1	none	0	acc	↑	0.2566	±	0.0225
- high_school_biology	1	none	0	acc	↑	0.2097	±	0.0232
- high_school_chemistry	1	none	0	acc	↑	0.2611	±	0.0309
- high_school_computer_science	1	none	0	acc	↑	0.2600	±	0.0441
- high_school_mathematics	1	none	0	acc	↑	0.2111	±	0.0249
- high_school_physics	1	none	0	acc	↑	0.2517	±	0.0354
- high_school_statistics	1	none	0	acc	↑	0.3056	±	0.0314
- machine_learning	1	none	0	acc	↑	0.2857	±	0.0429
mmlu_pro	2	custom-extract		exact_match	↑	0.0000	±	0.0000
- biology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- business	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- chemistry	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- computer_science	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- economics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- engineering	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- health	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- history	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- law	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- math	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- other	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- philosophy	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- physics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- psychology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2468	±	0.0036
- humanities	2	none	acc	↑	0.2459	±	0.0063
- other	2	none	acc	↑	0.2385	±	0.0076
- social sciences	2	none	acc	↑	0.2632	±	0.0079
- stem	2	none	acc	↑	0.2404	±	0.0076
mmlu_pro	2	custom-extract	exact_match	↑	0.0000	±	0.0000

litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
arc_challenge	1	none	acc	↑	0.1852	±	0.0114
		none	acc_norm	↑	0.2201	±	0.0121
boolq	2	none	acc	↑	0.4446	±	0.0087
gpqa_diamond_cot_n_shot	2	flexible-extract	exact_match	↑	0.0859	±	0.0200
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0606	±	0.0170
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_generative_n_shot	2	flexible-extract	exact_match	↑	0.1717	±	0.0269
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_n_shot	2	none	acc	↑	0.2677	±	0.0315
		none	acc_norm	↑	0.2677	±	0.0315
gpqa_diamond_zeroshot	1	none	acc	↑	0.1970	±	0.0283
		none	acc_norm	↑	0.1970	±	0.0283
gpqa_extended_cot_n_shot	2	flexible-extract	exact_match	↑	0.0971	±	0.0127
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0696	±	0.0109
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_generative_n_shot	2	flexible-extract	exact_match	↑	0.1502	±	0.0153
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_n_shot	2	none	acc	↑	0.2399	±	0.0183
		none	acc_norm	↑	0.2399	±	0.0183
gpqa_extended_zeroshot	1	none	acc	↑	0.2473	±	0.0185
		none	acc_norm	↑	0.2473	±	0.0185
gpqa_main_cot_n_shot	2	flexible-extract	exact_match	↑	0.1116	±	0.0149
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_cot_zeroshot	1	flexible-extract	exact_match	↑	0.0625	±	0.0114
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_generative_n_shot	2	flexible-extract	exact_match	↑	0.1384	±	0.0163
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_n_shot	2	none	acc	↑	0.2388	±	0.0202
		none	acc_norm	↑	0.2388	±	0.0202
gpqa_main_zeroshot	1	none	acc	↑	0.2500	±	0.0205
		none	acc_norm	↑	0.2500	±	0.0205
hellaswag	1	none	acc	↑	0.2628	±	0.0044
		none	acc_norm	↑	0.2705	±	0.0044
openbookqa	1	none	acc	↑	0.1360	±	0.0153
		none	acc_norm	↑	0.2620	±	0.0197
piqa	1	none	acc	↑	0.5550	±	0.0116
		none	acc_norm	↑	0.5528	±	0.0116
truthfulqa_mc2	2	none	acc	↑	0.5010	±	0.0159
winogrande	1	none	acc	↑	0.5130	±	0.0140

litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
qasper_bool	1	none	f1	↑	0.8966	±	0.0166
qasper_freeform	2	none	f1_abstractive	↑	0.0597	±	0.0052
wikitext	2	none	bits_per_byte	↓	2.2154	±	N/A
		none	byte_perplexity	↓	4.6441	±	N/A
		none	word_perplexity	↓	3683.1019	±	N/A

Continued Pretrain Evaluation

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-contrain-quick/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.1894	±	0.0115
		none	0	acc_norm	↑	0.2193	±	0.0121
gsm8k	3	flexible-extract	5	exact_match	↑	0.0182	±	0.0037
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag	1	none	0	acc	↑	0.2638	±	0.0044
		none	0	acc_norm	↑	0.2655	±	0.0044
mmlu	2	none		acc	↑	0.2376	±	0.0036
- humanities	2	none		acc	↑	0.2438	±	0.0063
- formal_logic	1	none	0	acc	↑	0.2222	±	0.0372
- high_school_european_history	1	none	0	acc	↑	0.2485	±	0.0337
- high_school_us_history	1	none	0	acc	↑	0.2304	±	0.0296
- high_school_world_history	1	none	0	acc	↑	0.2489	±	0.0281
- international_law	1	none	0	acc	↑	0.2397	±	0.0390
- jurisprudence	1	none	0	acc	↑	0.2407	±	0.0413
- logical_fallacies	1	none	0	acc	↑	0.2025	±	0.0316
- moral_disputes	1	none	0	acc	↑	0.1965	±	0.0214
- moral_scenarios	1	none	0	acc	↑	0.2726	±	0.0149
- philosophy	1	none	0	acc	↑	0.1897	±	0.0223
- prehistory	1	none	0	acc	↑	0.2191	±	0.0230
- professional_law	1	none	0	acc	↑	0.2529	±	0.0111
- world_religions	1	none	0	acc	↑	0.3158	±	0.0357
- other	2	none		acc	↑	0.2407	±	0.0077
- business_ethics	1	none	0	acc	↑	0.2600	±	0.0441
- clinical_knowledge	1	none	0	acc	↑	0.2302	±	0.0259
- college_medicine	1	none	0	acc	↑	0.2370	±	0.0324
- global_facts	1	none	0	acc	↑	0.1900	±	0.0394
- human_aging	1	none	0	acc	↑	0.3004	±	0.0308
- management	1	none	0	acc	↑	0.1845	±	0.0384
- marketing	1	none	0	acc	↑	0.2863	±	0.0296
- medical_genetics	1	none	0	acc	↑	0.3000	±	0.0461
- miscellaneous	1	none	0	acc	↑	0.2375	±	0.0152
- nutrition	1	none	0	acc	↑	0.2353	±	0.0243
- professional_accounting	1	none	0	acc	↑	0.2305	±	0.0251
- professional_medicine	1	none	0	acc	↑	0.2096	±	0.0247
- virology	1	none	0	acc	↑	0.2289	±	0.0327
- social sciences	2	none		acc	↑	0.2382	±	0.0077
- econometrics	1	none	0	acc	↑	0.2368	±	0.0400
- high_school_geography	1	none	0	acc	↑	0.1818	±	0.0275
- high_school_government_and_politics	1	none	0	acc	↑	0.2280	±	0.0303
- high_school_macroeconomics	1	none	0	acc	↑	0.2410	±	0.0217
- high_school_microeconomics	1	none	0	acc	↑	0.2479	±	0.0280
- high_school_psychology	1	none	0	acc	↑	0.2055	±	0.0173
- human_sexuality	1	none	0	acc	↑	0.2824	±	0.0395
- professional_psychology	1	none	0	acc	↑	0.2565	±	0.0177
- public_relations	1	none	0	acc	↑	0.2091	±	0.0390
- security_studies	1	none	0	acc	↑	0.2694	±	0.0284
- sociology	1	none	0	acc	↑	0.2438	±	0.0304
- us_foreign_policy	1	none	0	acc	↑	0.2900	±	0.0456
- stem	2	none		acc	↑	0.2249	±	0.0074
- abstract_algebra	1	none	0	acc	↑	0.1800	±	0.0386
- anatomy	1	none	0	acc	↑	0.1704	±	0.0325
- astronomy	1	none	0	acc	↑	0.2105	±	0.0332
- college_biology	1	none	0	acc	↑	0.2500	±	0.0362
- college_chemistry	1	none	0	acc	↑	0.1900	±	0.0394
- college_computer_science	1	none	0	acc	↑	0.2600	±	0.0441
- college_mathematics	1	none	0	acc	↑	0.2000	±	0.0402
- college_physics	1	none	0	acc	↑	0.2353	±	0.0422
- computer_security	1	none	0	acc	↑	0.2800	±	0.0451
- conceptual_physics	1	none	0	acc	↑	0.2596	±	0.0287
- electrical_engineering	1	none	0	acc	↑	0.2345	±	0.0353
- elementary_mathematics	1	none	0	acc	↑	0.2434	±	0.0221
- high_school_biology	1	none	0	acc	↑	0.1871	±	0.0222
- high_school_chemistry	1	none	0	acc	↑	0.2118	±	0.0287
- high_school_computer_science	1	none	0	acc	↑	0.2600	±	0.0441
- high_school_mathematics	1	none	0	acc	↑	0.2222	±	0.0253
- high_school_physics	1	none	0	acc	↑	0.1921	±	0.0322
- high_school_statistics	1	none	0	acc	↑	0.2130	±	0.0279
- machine_learning	1	none	0	acc	↑	0.3036	±	0.0436
truthfulqa_mc2	2	none	0	acc	↑	0.4931	±	0.0161
winogrande	1	none	0	acc	↑	0.5012	±	0.0141

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2376	±	0.0036
- humanities	2	none	acc	↑	0.2438	±	0.0063
- other	2	none	acc	↑	0.2407	±	0.0077
- social sciences	2	none	acc	↑	0.2382	±	0.0077
- stem	2	none	acc	↑	0.2249	±	0.0074

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-contrain-math/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0182	±	0.0037
		strict-match	5	exact_match	↑	0.0000	±	0.0000
mathqa	1	none	0	acc	↑	0.2124	±	0.0075
		none	0	acc_norm	↑	0.2137	±	0.0075