NanoExperiment-Models

Models

Arch. Act. V. H. I. L. A. K. Tie
LLaMA SwiGLU 2K 256 768 2 8 4 True
Qwen2 SwiGLU 2K 256 768 2 8 4 True
Mistral SwiGLU 2K 256 768 2 8 4 True
Gemma GeGLU(Tanh) 2K 256 768 2 8 4 True
Gemma2 GeGLU(Tanh) 2K 256 768 2 8 4 True
OLMo SwiGLU 2K 256 768 2 8 4 True
Cohere SwiGLU 2K 256 768 2 8 4 True
Phi GeGLU 2K 256 1024 2 8 4 True
StarCoder2 GeGLU(Tanh) 2K 256 768 2 8 4 True
StableLM SwiGLU 2K 256 768 2 8 4 True
GPT2 GeGLU 2K 256 1024 2 8 8 True
GPT-J GeGLU 2K 256 1024 2 4 4 True
GPT-NeoX GeGLU 2K 256 1024 2 8 8 True
Bloom GeGLU 2K 256 1024 2 8 8 True
MPT GeGLU 2K 256 1024 2 8 8 True
RWKV - 2K 256 1024 2 - - True

Experimental Setup

Value
Batch Size 1024
Grad Acc Steps 1
Max LR 1.5 * 10^-3
LR Scheduler Trapezoidal / Cosine
Warmup Ratio 0.01
Decay Ratio 0.35
Decay Progress Exponential
Min Decay LR 0.01 * Max LR
Optimizer AdamW
Weight Decay 0.1
Max Grad Norm 1.0
Num Epochs 1
FP16 True
Device Tesla-V100-SXM2-32GB
Seed 3407

Results

Trapezoidal v.s. Cosine

Arch. Training Speed (it/s) Total Loss Final Loss (Last 10 steps Avg.)
Trapezoidal Cosine Trapezoidal Cosine
LLaMA 4.35 1.5734 1.5626 1.2784 1.2855
Qwen2 4.41 1.5735 1.5565 1.2760 1.2943
Mistral 4.44 1.5756 1.5645 1.2787 1.3004
Gemma 1.79 1.3894 1.3737 1.0841 1.1010
Gemma2 1.59 1.3754 1.3597 1.0601 1.0752
OLMo 5.00 1.6011 1.5855 1.2857 1.3039
Cohere 4.04 2.1327 2.1152 1.6244 1.6593
Phi 5.78 1.7525 1.7419 1.4770 1.4876
StarCoder2 3.01 1.6125 1.6498 1.3044 1.3718
StableLM 5.06 1.5835 1.5905 1.2662 1.2998
GPT2 3.53 2.1100 2.1081 1.8236 1.8508
GPT-J 3.06 1.7198 1.6976 1.4503 1.4541
GPT-NeoX 5.06 1.7233 1.6981 1.4400 1.4303
Bloom 3.33 1.6910 1.6704 1.3690 1.3774
MPT 4.39 1.6466 1.6317 1.3443 1.3550
RWKV 0.72 3.0151 3.0810 1.8569 1.9628
Avg. - 1.755 1.749 1.389 1.413

BF16 & FP16

Arch. Total Loss Final Loss (Last 10 steps Avg.)
FP16 BF16 FP16 BF16
LLaMA 1.5734 1.5714 1.2784 1.2758
Qwen2 1.5735 1.5675 1.2760 1.2764
Mistral 1.5756 1.5694 1.2787 1.2740
OLMo 1.6011 1.6059 1.2857 1.2901
Cohere 2.1327 2.1112 1.6244 1.6346

Optimizers

Batch Size 1 2 4 8 16 32 64 128 256 512 1024
Peak Mem (MB) adamw_torch 601 605 633 707 857 1255 1637 2201 3787 6945 13293
adamw_bnb_8bit 589 595 625 699 849 1241 1625 2187 3773 6935 13283
adamw_hf 597 603 633 707 857 1251 1635 2197 3783 6941 13293
lion_32bit 591 597 627 701 851 1243 1627 2191 3777 6937 13285

Citation

@misc{NanoExperiment,
    title={NanoExperiment},
    url={https://huggingface.co/Mxode/NanoExperiment-Models},
    author={Mxode},
    month={September},
    year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Mxode/NanoExperiment-Models

Collection including Mxode/NanoExperiment-Models