tFINE-base-300m

An encoder-decoder (T5 architecture) pretrained with nanoT5:

  • tokenizer: sentencepiece BPE w/ byte fallback, 48k vocab (from vocab scaling laws)
  • data: fineweb-edu-dedup split of HuggingFaceTB/smollm-corpus
  • context length: 1024 ctx

details

Detailed info, including training logs, configs, and checkpoints can be found under checkpoints/ in this repo.

Expand hyperparameter overview
  1. Model:

    • Dropout rate: 0.0
    • Activations: silu, gated-silu
    • torch compile: true
  2. Data processing:

    • Input length: 1024
    • MLM probability: 0.15
  3. Optimization:

    • Optimizer: AdamW with scaling
    • Base learning rate: 0.008
    • Batch size: 120
    • Total training steps: 80,000
    • Warmup steps: 10,000
    • Learning rate scheduler: Cosine
    • Weight decay: 0.0001
    • Gradient clipping: 1.0
    • Gradient accumulation steps: 24
    • Final cosine learning rate: 1e-5
  4. Hardware:

    • Device: RTX 4080
    • Precision: bfloat16, tf32

plots

training loss

loss

Expand grad and weights L2 norm plots

grad norm

grad

weights norm

weights


Downloads last month
23
Safetensors
Model size
301M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for pszemraj/tFINE-base-300m

Finetunes
1 model

Dataset used to train pszemraj/tFINE-base-300m

Collection including pszemraj/tFINE-base-300m