tangled-llama-b-128k-base-v0.1
A pretrained language model based on the Llama model with about 62.9M parameters. This model has been trained on 10.6B (10,630,121,844
) tokens from more than 31.3M (31,383,840
) dataset rows.
This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072
) tokens, it was pretrained with sequences of 2K (2048
) tokens.
The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.
loss, val_loss
val_ppl
epoch
learning_rate
Pretrain Evaluation
lm-evaluation-harness
litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.1852 |
± |
0.0114 |
|
|
none |
0 |
acc_norm |
↑ |
0.2201 |
± |
0.0121 |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0205 |
± |
0.0039 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2628 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2705 |
± |
0.0044 |
mmlu |
2 |
none |
|
acc |
↑ |
0.2468 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0063 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.3175 |
± |
0.0416 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2364 |
± |
0.0332 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2304 |
± |
0.0296 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2194 |
± |
0.0269 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.2479 |
± |
0.0394 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2315 |
± |
0.0408 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2147 |
± |
0.0323 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.2168 |
± |
0.0222 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2726 |
± |
0.0149 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.1865 |
± |
0.0221 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2191 |
± |
0.0230 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2490 |
± |
0.0110 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.3450 |
± |
0.0365 |
- other |
2 |
none |
|
acc |
↑ |
0.2385 |
± |
0.0076 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2264 |
± |
0.0258 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2601 |
± |
0.0335 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.2422 |
± |
0.0288 |
- management |
1 |
none |
0 |
acc |
↑ |
0.2330 |
± |
0.0419 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.2821 |
± |
0.0295 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.2900 |
± |
0.0456 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2388 |
± |
0.0152 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.1993 |
± |
0.0229 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2270 |
± |
0.0250 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.2610 |
± |
0.0267 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2349 |
± |
0.0330 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2632 |
± |
0.0079 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2544 |
± |
0.0410 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.1869 |
± |
0.0278 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2850 |
± |
0.0326 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.3128 |
± |
0.0235 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2773 |
± |
0.0291 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2422 |
± |
0.0184 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2595 |
± |
0.0384 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2435 |
± |
0.0174 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2273 |
± |
0.0401 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.3265 |
± |
0.0300 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2537 |
± |
0.0308 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.3000 |
± |
0.0461 |
- stem |
2 |
none |
|
acc |
↑ |
0.2404 |
± |
0.0076 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.1700 |
± |
0.0378 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.2074 |
± |
0.0350 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2105 |
± |
0.0332 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2153 |
± |
0.0344 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2000 |
± |
0.0402 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2300 |
± |
0.0423 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.1700 |
± |
0.0378 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.2647 |
± |
0.0439 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.2700 |
± |
0.0446 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2766 |
± |
0.0292 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2552 |
± |
0.0363 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2566 |
± |
0.0225 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.2097 |
± |
0.0232 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2611 |
± |
0.0309 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2111 |
± |
0.0249 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.2517 |
± |
0.0354 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.3056 |
± |
0.0314 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.2857 |
± |
0.0429 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.5010 |
± |
0.0159 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.5130 |
± |
0.0140 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2468 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0063 |
- other |
2 |
none |
|
acc |
↑ |
0.2385 |
± |
0.0076 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2632 |
± |
0.0079 |
- stem |
2 |
none |
|
acc |
↑ |
0.2404 |
± |
0.0076 |
litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
leaderboard |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh |
N/A |
|
|
|
|
|
|
|
- leaderboard_bbh_boolean_expressions |
1 |
none |
3 |
acc_norm |
↑ |
0.4680 |
± |
0.0316 |
- leaderboard_bbh_causal_judgement |
1 |
none |
3 |
acc_norm |
↑ |
0.5187 |
± |
0.0366 |
- leaderboard_bbh_date_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.1880 |
± |
0.0248 |
- leaderboard_bbh_disambiguation_qa |
1 |
none |
3 |
acc_norm |
↑ |
0.3440 |
± |
0.0301 |
- leaderboard_bbh_formal_fallacies |
1 |
none |
3 |
acc_norm |
↑ |
0.4720 |
± |
0.0316 |
- leaderboard_bbh_geometric_shapes |
1 |
none |
3 |
acc_norm |
↑ |
0.1200 |
± |
0.0206 |
- leaderboard_bbh_hyperbaton |
1 |
none |
3 |
acc_norm |
↑ |
0.5240 |
± |
0.0316 |
- leaderboard_bbh_logical_deduction_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.2160 |
± |
0.0261 |
- leaderboard_bbh_logical_deduction_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1400 |
± |
0.0220 |
- leaderboard_bbh_logical_deduction_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3200 |
± |
0.0296 |
- leaderboard_bbh_movie_recommendation |
1 |
none |
3 |
acc_norm |
↑ |
0.2360 |
± |
0.0269 |
- leaderboard_bbh_navigate |
1 |
none |
3 |
acc_norm |
↑ |
0.4200 |
± |
0.0313 |
- leaderboard_bbh_object_counting |
1 |
none |
3 |
acc_norm |
↑ |
0.1000 |
± |
0.0190 |
- leaderboard_bbh_penguins_in_a_table |
1 |
none |
3 |
acc_norm |
↑ |
0.1575 |
± |
0.0303 |
- leaderboard_bbh_reasoning_about_colored_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.0920 |
± |
0.0183 |
- leaderboard_bbh_ruin_names |
1 |
none |
3 |
acc_norm |
↑ |
0.2480 |
± |
0.0274 |
- leaderboard_bbh_salient_translation_error_detection |
1 |
none |
3 |
acc_norm |
↑ |
0.1200 |
± |
0.0206 |
- leaderboard_bbh_snarks |
1 |
none |
3 |
acc_norm |
↑ |
0.4888 |
± |
0.0376 |
- leaderboard_bbh_sports_understanding |
1 |
none |
3 |
acc_norm |
↑ |
0.4600 |
± |
0.0316 |
- leaderboard_bbh_temporal_sequences |
1 |
none |
3 |
acc_norm |
↑ |
0.2440 |
± |
0.0272 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.1560 |
± |
0.0230 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.0960 |
± |
0.0187 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects |
1 |
none |
3 |
acc_norm |
↑ |
0.3800 |
± |
0.0308 |
- leaderboard_bbh_web_of_lies |
1 |
none |
3 |
acc_norm |
↑ |
0.4720 |
± |
0.0316 |
- leaderboard_gpqa |
N/A |
|
|
|
|
|
|
|
- leaderboard_gpqa_diamond |
1 |
none |
0 |
acc_norm |
↑ |
0.1970 |
± |
0.0283 |
- leaderboard_gpqa_extended |
1 |
none |
0 |
acc_norm |
↑ |
0.2509 |
± |
0.0186 |
- leaderboard_gpqa_main |
1 |
none |
0 |
acc_norm |
↑ |
0.2589 |
± |
0.0207 |
- leaderboard_ifeval |
3 |
none |
0 |
inst_level_loose_acc |
↑ |
0.2650 |
± |
N/A |
|
|
none |
0 |
inst_level_strict_acc |
↑ |
0.2530 |
± |
N/A |
|
|
none |
0 |
prompt_level_loose_acc |
↑ |
0.1590 |
± |
0.0157 |
|
|
none |
0 |
prompt_level_strict_acc |
↑ |
0.1553 |
± |
0.0156 |
- leaderboard_math_hard |
N/A |
|
|
|
|
|
|
|
- leaderboard_math_algebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_counting_and_prob_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_geometry_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_intermediate_algebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_num_theory_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_prealgebra_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_math_precalculus_hard |
1 |
none |
4 |
exact_match |
↑ |
0.0000 |
± |
0 |
- leaderboard_mmlu_pro |
0.1 |
none |
5 |
acc |
↑ |
0.1174 |
± |
0.0029 |
- leaderboard_musr |
N/A |
|
|
|
|
|
|
|
- leaderboard_musr_murder_mysteries |
1 |
none |
0 |
acc_norm |
↑ |
0.5160 |
± |
0.0317 |
- leaderboard_musr_object_placements |
1 |
none |
0 |
acc_norm |
↑ |
0.2695 |
± |
0.0278 |
- leaderboard_musr_team_allocation |
1 |
none |
0 |
acc_norm |
↑ |
0.3480 |
± |
0.0302 |
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0205 |
± |
0.0039 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
mathqa |
1 |
none |
0 |
acc |
↑ |
0.2010 |
± |
0.0073 |
|
|
none |
0 |
acc_norm |
↑ |
0.2077 |
± |
0.0074 |
litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2468 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0063 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.3175 |
± |
0.0416 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2364 |
± |
0.0332 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2304 |
± |
0.0296 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2194 |
± |
0.0269 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.2479 |
± |
0.0394 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2315 |
± |
0.0408 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2147 |
± |
0.0323 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.2168 |
± |
0.0222 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2726 |
± |
0.0149 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.1865 |
± |
0.0221 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2191 |
± |
0.0230 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2490 |
± |
0.0110 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.3450 |
± |
0.0365 |
- other |
2 |
none |
|
acc |
↑ |
0.2385 |
± |
0.0076 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2200 |
± |
0.0416 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2264 |
± |
0.0258 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2601 |
± |
0.0335 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.2422 |
± |
0.0288 |
- management |
1 |
none |
0 |
acc |
↑ |
0.2330 |
± |
0.0419 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.2821 |
± |
0.0295 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.2900 |
± |
0.0456 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2388 |
± |
0.0152 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.1993 |
± |
0.0229 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2270 |
± |
0.0250 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.2610 |
± |
0.0267 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2349 |
± |
0.0330 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2632 |
± |
0.0079 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2544 |
± |
0.0410 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.1869 |
± |
0.0278 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2850 |
± |
0.0326 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.3128 |
± |
0.0235 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2773 |
± |
0.0291 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2422 |
± |
0.0184 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2595 |
± |
0.0384 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2435 |
± |
0.0174 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2273 |
± |
0.0401 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.3265 |
± |
0.0300 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2537 |
± |
0.0308 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.3000 |
± |
0.0461 |
- stem |
2 |
none |
|
acc |
↑ |
0.2404 |
± |
0.0076 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.1700 |
± |
0.0378 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.2074 |
± |
0.0350 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2105 |
± |
0.0332 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2153 |
± |
0.0344 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2000 |
± |
0.0402 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2300 |
± |
0.0423 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.1700 |
± |
0.0378 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.2647 |
± |
0.0439 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.2700 |
± |
0.0446 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2766 |
± |
0.0292 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2552 |
± |
0.0363 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2566 |
± |
0.0225 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.2097 |
± |
0.0232 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2611 |
± |
0.0309 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2111 |
± |
0.0249 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.2517 |
± |
0.0354 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.3056 |
± |
0.0314 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.2857 |
± |
0.0429 |
mmlu_pro |
2 |
custom-extract |
|
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- biology |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- business |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- chemistry |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- computer_science |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- economics |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- engineering |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- health |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- history |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- law |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- math |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- other |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- philosophy |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- physics |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
- psychology |
1 |
custom-extract |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2468 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2459 |
± |
0.0063 |
- other |
2 |
none |
|
acc |
↑ |
0.2385 |
± |
0.0076 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2632 |
± |
0.0079 |
- stem |
2 |
none |
|
acc |
↑ |
0.2404 |
± |
0.0076 |
mmlu_pro |
2 |
custom-extract |
|
exact_match |
↑ |
0.0000 |
± |
0.0000 |
litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.1852 |
± |
0.0114 |
|
|
none |
0 |
acc_norm |
↑ |
0.2201 |
± |
0.0121 |
boolq |
2 |
none |
0 |
acc |
↑ |
0.4446 |
± |
0.0087 |
gpqa_diamond_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0859 |
± |
0.0200 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0606 |
± |
0.0170 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.1717 |
± |
0.0269 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_diamond_n_shot |
2 |
none |
0 |
acc |
↑ |
0.2677 |
± |
0.0315 |
|
|
none |
0 |
acc_norm |
↑ |
0.2677 |
± |
0.0315 |
gpqa_diamond_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.1970 |
± |
0.0283 |
|
|
none |
0 |
acc_norm |
↑ |
0.1970 |
± |
0.0283 |
gpqa_extended_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.0971 |
± |
0.0127 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0696 |
± |
0.0109 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.1502 |
± |
0.0153 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_extended_n_shot |
2 |
none |
0 |
acc |
↑ |
0.2399 |
± |
0.0183 |
|
|
none |
0 |
acc_norm |
↑ |
0.2399 |
± |
0.0183 |
gpqa_extended_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.2473 |
± |
0.0185 |
|
|
none |
0 |
acc_norm |
↑ |
0.2473 |
± |
0.0185 |
gpqa_main_cot_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.1116 |
± |
0.0149 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_cot_zeroshot |
1 |
flexible-extract |
0 |
exact_match |
↑ |
0.0625 |
± |
0.0114 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_generative_n_shot |
2 |
flexible-extract |
0 |
exact_match |
↑ |
0.1384 |
± |
0.0163 |
|
|
strict-match |
0 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
gpqa_main_n_shot |
2 |
none |
0 |
acc |
↑ |
0.2388 |
± |
0.0202 |
|
|
none |
0 |
acc_norm |
↑ |
0.2388 |
± |
0.0202 |
gpqa_main_zeroshot |
1 |
none |
0 |
acc |
↑ |
0.2500 |
± |
0.0205 |
|
|
none |
0 |
acc_norm |
↑ |
0.2500 |
± |
0.0205 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2628 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2705 |
± |
0.0044 |
openbookqa |
1 |
none |
0 |
acc |
↑ |
0.1360 |
± |
0.0153 |
|
|
none |
0 |
acc_norm |
↑ |
0.2620 |
± |
0.0197 |
piqa |
1 |
none |
0 |
acc |
↑ |
0.5550 |
± |
0.0116 |
|
|
none |
0 |
acc_norm |
↑ |
0.5528 |
± |
0.0116 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.5010 |
± |
0.0159 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.5130 |
± |
0.0140 |
litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
qasper_bool |
1 |
none |
0 |
f1 |
↑ |
0.8966 |
± |
0.0166 |
qasper_freeform |
2 |
none |
0 |
f1_abstractive |
↑ |
0.0597 |
± |
0.0052 |
wikitext |
2 |
none |
0 |
bits_per_byte |
↓ |
2.2154 |
± |
N/A |
|
|
none |
0 |
byte_perplexity |
↓ |
4.6441 |
± |
N/A |
|
|
none |
0 |
word_perplexity |
↓ |
3683.1019 |
± |
N/A |
Continued Pretrain Evaluation
lm-evaluation-harness
litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-contrain-quick/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
arc_challenge |
1 |
none |
0 |
acc |
↑ |
0.1894 |
± |
0.0115 |
|
|
none |
0 |
acc_norm |
↑ |
0.2193 |
± |
0.0121 |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0182 |
± |
0.0037 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
hellaswag |
1 |
none |
0 |
acc |
↑ |
0.2638 |
± |
0.0044 |
|
|
none |
0 |
acc_norm |
↑ |
0.2655 |
± |
0.0044 |
mmlu |
2 |
none |
|
acc |
↑ |
0.2376 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2438 |
± |
0.0063 |
- formal_logic |
1 |
none |
0 |
acc |
↑ |
0.2222 |
± |
0.0372 |
- high_school_european_history |
1 |
none |
0 |
acc |
↑ |
0.2485 |
± |
0.0337 |
- high_school_us_history |
1 |
none |
0 |
acc |
↑ |
0.2304 |
± |
0.0296 |
- high_school_world_history |
1 |
none |
0 |
acc |
↑ |
0.2489 |
± |
0.0281 |
- international_law |
1 |
none |
0 |
acc |
↑ |
0.2397 |
± |
0.0390 |
- jurisprudence |
1 |
none |
0 |
acc |
↑ |
0.2407 |
± |
0.0413 |
- logical_fallacies |
1 |
none |
0 |
acc |
↑ |
0.2025 |
± |
0.0316 |
- moral_disputes |
1 |
none |
0 |
acc |
↑ |
0.1965 |
± |
0.0214 |
- moral_scenarios |
1 |
none |
0 |
acc |
↑ |
0.2726 |
± |
0.0149 |
- philosophy |
1 |
none |
0 |
acc |
↑ |
0.1897 |
± |
0.0223 |
- prehistory |
1 |
none |
0 |
acc |
↑ |
0.2191 |
± |
0.0230 |
- professional_law |
1 |
none |
0 |
acc |
↑ |
0.2529 |
± |
0.0111 |
- world_religions |
1 |
none |
0 |
acc |
↑ |
0.3158 |
± |
0.0357 |
- other |
2 |
none |
|
acc |
↑ |
0.2407 |
± |
0.0077 |
- business_ethics |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- clinical_knowledge |
1 |
none |
0 |
acc |
↑ |
0.2302 |
± |
0.0259 |
- college_medicine |
1 |
none |
0 |
acc |
↑ |
0.2370 |
± |
0.0324 |
- global_facts |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- human_aging |
1 |
none |
0 |
acc |
↑ |
0.3004 |
± |
0.0308 |
- management |
1 |
none |
0 |
acc |
↑ |
0.1845 |
± |
0.0384 |
- marketing |
1 |
none |
0 |
acc |
↑ |
0.2863 |
± |
0.0296 |
- medical_genetics |
1 |
none |
0 |
acc |
↑ |
0.3000 |
± |
0.0461 |
- miscellaneous |
1 |
none |
0 |
acc |
↑ |
0.2375 |
± |
0.0152 |
- nutrition |
1 |
none |
0 |
acc |
↑ |
0.2353 |
± |
0.0243 |
- professional_accounting |
1 |
none |
0 |
acc |
↑ |
0.2305 |
± |
0.0251 |
- professional_medicine |
1 |
none |
0 |
acc |
↑ |
0.2096 |
± |
0.0247 |
- virology |
1 |
none |
0 |
acc |
↑ |
0.2289 |
± |
0.0327 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2382 |
± |
0.0077 |
- econometrics |
1 |
none |
0 |
acc |
↑ |
0.2368 |
± |
0.0400 |
- high_school_geography |
1 |
none |
0 |
acc |
↑ |
0.1818 |
± |
0.0275 |
- high_school_government_and_politics |
1 |
none |
0 |
acc |
↑ |
0.2280 |
± |
0.0303 |
- high_school_macroeconomics |
1 |
none |
0 |
acc |
↑ |
0.2410 |
± |
0.0217 |
- high_school_microeconomics |
1 |
none |
0 |
acc |
↑ |
0.2479 |
± |
0.0280 |
- high_school_psychology |
1 |
none |
0 |
acc |
↑ |
0.2055 |
± |
0.0173 |
- human_sexuality |
1 |
none |
0 |
acc |
↑ |
0.2824 |
± |
0.0395 |
- professional_psychology |
1 |
none |
0 |
acc |
↑ |
0.2565 |
± |
0.0177 |
- public_relations |
1 |
none |
0 |
acc |
↑ |
0.2091 |
± |
0.0390 |
- security_studies |
1 |
none |
0 |
acc |
↑ |
0.2694 |
± |
0.0284 |
- sociology |
1 |
none |
0 |
acc |
↑ |
0.2438 |
± |
0.0304 |
- us_foreign_policy |
1 |
none |
0 |
acc |
↑ |
0.2900 |
± |
0.0456 |
- stem |
2 |
none |
|
acc |
↑ |
0.2249 |
± |
0.0074 |
- abstract_algebra |
1 |
none |
0 |
acc |
↑ |
0.1800 |
± |
0.0386 |
- anatomy |
1 |
none |
0 |
acc |
↑ |
0.1704 |
± |
0.0325 |
- astronomy |
1 |
none |
0 |
acc |
↑ |
0.2105 |
± |
0.0332 |
- college_biology |
1 |
none |
0 |
acc |
↑ |
0.2500 |
± |
0.0362 |
- college_chemistry |
1 |
none |
0 |
acc |
↑ |
0.1900 |
± |
0.0394 |
- college_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- college_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2000 |
± |
0.0402 |
- college_physics |
1 |
none |
0 |
acc |
↑ |
0.2353 |
± |
0.0422 |
- computer_security |
1 |
none |
0 |
acc |
↑ |
0.2800 |
± |
0.0451 |
- conceptual_physics |
1 |
none |
0 |
acc |
↑ |
0.2596 |
± |
0.0287 |
- electrical_engineering |
1 |
none |
0 |
acc |
↑ |
0.2345 |
± |
0.0353 |
- elementary_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2434 |
± |
0.0221 |
- high_school_biology |
1 |
none |
0 |
acc |
↑ |
0.1871 |
± |
0.0222 |
- high_school_chemistry |
1 |
none |
0 |
acc |
↑ |
0.2118 |
± |
0.0287 |
- high_school_computer_science |
1 |
none |
0 |
acc |
↑ |
0.2600 |
± |
0.0441 |
- high_school_mathematics |
1 |
none |
0 |
acc |
↑ |
0.2222 |
± |
0.0253 |
- high_school_physics |
1 |
none |
0 |
acc |
↑ |
0.1921 |
± |
0.0322 |
- high_school_statistics |
1 |
none |
0 |
acc |
↑ |
0.2130 |
± |
0.0279 |
- machine_learning |
1 |
none |
0 |
acc |
↑ |
0.3036 |
± |
0.0436 |
truthfulqa_mc2 |
2 |
none |
0 |
acc |
↑ |
0.4931 |
± |
0.0161 |
winogrande |
1 |
none |
0 |
acc |
↑ |
0.5012 |
± |
0.0141 |
Groups |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
mmlu |
2 |
none |
|
acc |
↑ |
0.2376 |
± |
0.0036 |
- humanities |
2 |
none |
|
acc |
↑ |
0.2438 |
± |
0.0063 |
- other |
2 |
none |
|
acc |
↑ |
0.2407 |
± |
0.0077 |
- social sciences |
2 |
none |
|
acc |
↑ |
0.2382 |
± |
0.0077 |
- stem |
2 |
none |
|
acc |
↑ |
0.2249 |
± |
0.0074 |
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-contrain-math/' --batch_size 4 --dtype 'bfloat16' out/contrain/final/
Tasks |
Version |
Filter |
n-shot |
Metric |
|
Value |
|
Stderr |
gsm8k |
3 |
flexible-extract |
5 |
exact_match |
↑ |
0.0182 |
± |
0.0037 |
|
|
strict-match |
5 |
exact_match |
↑ |
0.0000 |
± |
0.0000 |
mathqa |
1 |
none |
0 |
acc |
↑ |
0.2124 |
± |
0.0075 |
|
|
none |
0 |
acc_norm |
↑ |
0.2137 |
± |
0.0075 |