NanoExperiment-Models
Models
Arch. |
Act. |
V. |
H. |
I. |
L. |
A. |
K. |
Tie |
LLaMA |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Qwen2 |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Mistral |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Gemma |
GeGLU(Tanh) |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Gemma2 |
GeGLU(Tanh) |
2K |
256 |
768 |
2 |
8 |
4 |
True |
OLMo |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Cohere |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
Phi |
GeGLU |
2K |
256 |
1024 |
2 |
8 |
4 |
True |
StarCoder2 |
GeGLU(Tanh) |
2K |
256 |
768 |
2 |
8 |
4 |
True |
StableLM |
SwiGLU |
2K |
256 |
768 |
2 |
8 |
4 |
True |
GPT2 |
GeGLU |
2K |
256 |
1024 |
2 |
8 |
8 |
True |
GPT-J |
GeGLU |
2K |
256 |
1024 |
2 |
4 |
4 |
True |
GPT-NeoX |
GeGLU |
2K |
256 |
1024 |
2 |
8 |
8 |
True |
Bloom |
GeGLU |
2K |
256 |
1024 |
2 |
8 |
8 |
True |
MPT |
GeGLU |
2K |
256 |
1024 |
2 |
8 |
8 |
True |
RWKV |
- |
2K |
256 |
1024 |
2 |
- |
- |
True |
Experimental Setup
|
Value |
Batch Size |
1024 |
Grad Acc Steps |
1 |
Max LR |
1.5 * 10^-3 |
LR Scheduler |
Trapezoidal / Cosine |
Warmup Ratio |
0.01 |
Decay Ratio |
0.35 |
Decay Progress |
Exponential |
Min Decay LR |
0.01 * Max LR |
Optimizer |
AdamW |
Weight Decay |
0.1 |
Max Grad Norm |
1.0 |
Num Epochs |
1 |
FP16 |
True |
Device |
Tesla-V100-SXM2-32GB |
Seed |
3407 |
Results
Trapezoidal v.s. Cosine
Arch. |
Training Speed (it/s) |
Total Loss |
Final Loss (Last 10 steps Avg.) |
Trapezoidal |
Cosine |
Trapezoidal |
Cosine |
LLaMA |
4.35 |
1.5734 |
1.5626 |
1.2784 |
1.2855 |
Qwen2 |
4.41 |
1.5735 |
1.5565 |
1.2760 |
1.2943 |
Mistral |
4.44 |
1.5756 |
1.5645 |
1.2787 |
1.3004 |
Gemma |
1.79 |
1.3894 |
1.3737 |
1.0841 |
1.1010 |
Gemma2 |
1.59 |
1.3754 |
1.3597 |
1.0601 |
1.0752 |
OLMo |
5.00 |
1.6011 |
1.5855 |
1.2857 |
1.3039 |
Cohere |
4.04 |
2.1327 |
2.1152 |
1.6244 |
1.6593 |
Phi |
5.78 |
1.7525 |
1.7419 |
1.4770 |
1.4876 |
StarCoder2 |
3.01 |
1.6125 |
1.6498 |
1.3044 |
1.3718 |
StableLM |
5.06 |
1.5835 |
1.5905 |
1.2662 |
1.2998 |
GPT2 |
3.53 |
2.1100 |
2.1081 |
1.8236 |
1.8508 |
GPT-J |
3.06 |
1.7198 |
1.6976 |
1.4503 |
1.4541 |
GPT-NeoX |
5.06 |
1.7233 |
1.6981 |
1.4400 |
1.4303 |
Bloom |
3.33 |
1.6910 |
1.6704 |
1.3690 |
1.3774 |
MPT |
4.39 |
1.6466 |
1.6317 |
1.3443 |
1.3550 |
RWKV |
0.72 |
3.0151 |
3.0810 |
1.8569 |
1.9628 |
Avg. |
- |
1.755 |
1.749 |
1.389 |
1.413 |
BF16 & FP16
Arch. |
Total Loss |
Final Loss (Last 10 steps Avg.) |
FP16 |
BF16 |
FP16 |
BF16 |
LLaMA |
1.5734 |
1.5714 |
1.2784 |
1.2758 |
Qwen2 |
1.5735 |
1.5675 |
1.2760 |
1.2764 |
Mistral |
1.5756 |
1.5694 |
1.2787 |
1.2740 |
OLMo |
1.6011 |
1.6059 |
1.2857 |
1.2901 |
Cohere |
2.1327 |
2.1112 |
1.6244 |
1.6346 |
Optimizers
Batch Size |
1 |
2 |
4 |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
Peak Mem (MB) |
adamw_torch |
601 |
605 |
633 |
707 |
857 |
1255 |
1637 |
2201 |
3787 |
6945 |
13293 |
adamw_bnb_8bit |
589 |
595 |
625 |
699 |
849 |
1241 |
1625 |
2187 |
3773 |
6935 |
13283 |
adamw_hf |
597 |
603 |
633 |
707 |
857 |
1251 |
1635 |
2197 |
3783 |
6941 |
13293 |
lion_32bit |
591 |
597 |
627 |
701 |
851 |
1243 |
1627 |
2191 |
3777 |
6937 |
13285 |
Citation
@misc{NanoExperiment,
title={NanoExperiment},
url={https://huggingface.co/Mxode/NanoExperiment-Models},
author={Mxode},
month={September},
year={2024}
}