README / README.md
rdiehlmartinez's picture
Update README.md
53c0dc3 verified
|
raw
history blame
4.28 kB
metadata
title: README
emoji: πŸ“ˆ
colorFrom: red
colorTo: yellow
sdk: static
pinned: false

πŸ“ˆ Pico: Tiny Language Models for Learning Dynamics Research

Pico consists of two key components:

  1. Pre-trained Model Suite (hosted here on HuggingFace)
  2. Training Framework (available on GitHub)

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.

πŸ€— HuggingFace Resources (You Are Here)

🚧 Coming Soon! Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our GitHub repository for updates!

Pre-trained Model Suite (Releasing January 2025)

Our complete suite of models from 1M to 1B parameters:

  • pico-tiny (1M parameters)
  • pico-small (10M parameters)
  • pico-medium (100M parameters)
  • pico-large (500M parameters)
  • pico-xl (1B parameters)

Each model includes:

  • Complete training checkpoints
  • Saved activations and gradients
  • Pre-computed evaluation perplexity scores

Available Datasets

  1. pretokenized-dolma

    • 420B tokens of pre-processed text
    • Cleaned and shuffled DOLMA corpus
  2. pretokenized-dolma-tiny

    • Smaller version for quick experiments
  3. pretokenized-eval-batch

    • Batch of eval data for generating model activations

πŸ”§ GitHub Training Framework

Want to train your own suite of models? Visit our GitHub repository to:

  • Train models with custom architectures
  • Experiment with different training regimes
  • Modify checkpoint saving behavior
  • Implement custom evaluation metrics

The training framework makes it easy to:

  1. Train multiple models of different sizes
  2. Ensure consistent training across all models
  3. Save rich checkpoint data for learning dynamics analysis
  4. Compare learning dynamics across scales

πŸ› οΈ Using the Resources

Using Pre-trained Models (HuggingFace)

from transformers import AutoModelForCausalLM

# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")

# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "pico-lm/pico-small",
    revision="step-xyz"
)

Training Your Own Suite (GitHub)

# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh

# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters

# Train your suite
python train.py --config configs/train.yaml

πŸ“Š Model Details

Architecture

All models (both pre-trained and self-trained) use:

  • LLAMA-style transformer
  • RMSNorm for normalization
  • RoPE positional embeddings
  • Multi-head attention with KV-cache
  • SwiGLU activation function

Training Configuration

Standard configuration (customizable in GitHub training):

  • Batch size: 1024
  • Learning rate: 1e-3
  • Weight decay: 0.1
  • Gradient clipping: 1.0
  • Mixed precision training

πŸ”¬ Research Applications

Perfect for researchers studying:

  • Learning dynamics across model scales
  • Mechanistic interpretability
  • Architecture and training effects
  • Emergent model behaviors

Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.

🀝 Contributing

Contributions welcome on both platforms:

  • HuggingFace: Model weights, datasets, and evaluation results
  • GitHub: Training framework improvements, analysis tools, and documentation

πŸ“« Contact

πŸ” Citation

@software{pico2024,
    author = {Martinez, Richard Diehl},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}