NanoExperiment-Models

Models

Arch.	Act.	V.	H.	I.	L.	A.	K.	Tie
LLaMA	SwiGLU	2K	256	768	2	8	4	True
Qwen2	SwiGLU	2K	256	768	2	8	4	True
Mistral	SwiGLU	2K	256	768	2	8	4	True
Gemma	GeGLU(Tanh)	2K	256	768	2	8	4	True
Gemma2	GeGLU(Tanh)	2K	256	768	2	8	4	True
OLMo	SwiGLU	2K	256	768	2	8	4	True
Cohere	SwiGLU	2K	256	768	2	8	4	True
Phi	GeGLU	2K	256	1024	2	8	4	True
StarCoder2	GeGLU(Tanh)	2K	256	768	2	8	4	True
StableLM	SwiGLU	2K	256	768	2	8	4	True
GPT2	GeGLU	2K	256	1024	2	8	8	True
GPT-J	GeGLU	2K	256	1024	2	4	4	True
GPT-NeoX	GeGLU	2K	256	1024	2	8	8	True
Bloom	GeGLU	2K	256	1024	2	8	8	True
MPT	GeGLU	2K	256	1024	2	8	8	True
RWKV	-	2K	256	1024	2	-	-	True

Experimental Setup

	Value
Batch Size	1024
Grad Acc Steps	1
Max LR	1.5 * 10^-3
LR Scheduler	Trapezoidal / Cosine
Warmup Ratio	0.01
Decay Ratio	0.35
Decay Progress	Exponential
Min Decay LR	0.01 * Max LR
Optimizer	AdamW
Weight Decay	0.1
Max Grad Norm	1.0
Num Epochs	1
FP16	True
Device	Tesla-V100-SXM2-32GB
Seed	3407

Results

Trapezoidal v.s. Cosine

Arch.	Training Speed (it/s)	Total Loss		Final Loss (Last 10 steps Avg.)
Arch.	Training Speed (it/s)	Trapezoidal	Cosine	Trapezoidal	Cosine
LLaMA	4.35	1.5734	1.5626	1.2784	1.2855
Qwen2	4.41	1.5735	1.5565	1.2760	1.2943
Mistral	4.44	1.5756	1.5645	1.2787	1.3004
Gemma	1.79	1.3894	1.3737	1.0841	1.1010
Gemma2	1.59	1.3754	1.3597	1.0601	1.0752
OLMo	5.00	1.6011	1.5855	1.2857	1.3039
Cohere	4.04	2.1327	2.1152	1.6244	1.6593
Phi	5.78	1.7525	1.7419	1.4770	1.4876
StarCoder2	3.01	1.6125	1.6498	1.3044	1.3718
StableLM	5.06	1.5835	1.5905	1.2662	1.2998
GPT2	3.53	2.1100	2.1081	1.8236	1.8508
GPT-J	3.06	1.7198	1.6976	1.4503	1.4541
GPT-NeoX	5.06	1.7233	1.6981	1.4400	1.4303
Bloom	3.33	1.6910	1.6704	1.3690	1.3774
MPT	4.39	1.6466	1.6317	1.3443	1.3550
RWKV	0.72	3.0151	3.0810	1.8569	1.9628
Avg.	-	1.755	1.749	1.389	1.413

BF16 & FP16

Arch.	Total Loss		Final Loss (Last 10 steps Avg.)
Arch.	FP16	BF16	FP16	BF16
LLaMA	1.5734	1.5714	1.2784	1.2758
Qwen2	1.5735	1.5675	1.2760	1.2764
Mistral	1.5756	1.5694	1.2787	1.2740
OLMo	1.6011	1.6059	1.2857	1.2901
Cohere	2.1327	2.1112	1.6244	1.6346

Optimizers

Batch Size		1	2	4	8	16	32	64	128	256	512	1024
Peak Mem (MB)	adamw_torch	601	605	633	707	857	1255	1637	2201	3787	6945	13293
	adamw_bnb_8bit	589	595	625	699	849	1241	1625	2187	3773	6935	13283
	adamw_hf	597	603	633	707	857	1251	1635	2197	3783	6941	13293
	lion_32bit	591	597	627	701	851	1243	1627	2191	3777	6937	13285

Citation

@misc{NanoExperiment,
    title={NanoExperiment},
    url={https://huggingface.co/Mxode/NanoExperiment-Models},
    author={Mxode},
    month={September},
    year={2024}
}

Mxode
/

NanoExperiment-Models

NanoExperiment-Models

Models

Experimental Setup

Results

Trapezoidal v.s. Cosine

BF16 & FP16

Optimizers

Citation

Dataset used to train Mxode/NanoExperiment-Models

Collection including Mxode/NanoExperiment-Models

NanoExperiment