metadata

title: README
emoji: 📈
colorFrom: red
colorTo: yellow
sdk: static
pinned: false

Pico: Tiny Language Models for Learning Dynamics Research

Pico consists of two key components:

Pre-trained Model Suite (hosted here on HuggingFace)
Training Framework (available on GitHub)

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.

🤗 HuggingFace Resources (You Are Here)

🚧 Coming Soon! Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our GitHub repository for updates!

Pre-trained Model Suite (Releasing January 2025)

Our complete suite of models from 1M to 1B parameters:

pico-tiny (1M parameters)
pico-small (10M parameters)
pico-medium (100M parameters)
pico-large (500M parameters)
pico-xl (1B parameters)

Each model includes:

Complete training checkpoints
Saved activations and gradients
Pre-computed evaluation perplexity scores

Available Datasets

pretokenized-dolma
- 420B tokens of pre-processed text
- Cleaned and shuffled DOLMA corpus
pretokenized-dolma-tiny
- Smaller version for quick experiments
pretokenized-eval-batch
- Batch of eval data for generating model activations

🔧 GitHub Training Framework

Want to train your own suite of models? Visit our GitHub repository to:

Train models with custom architectures
Experiment with different training regimes
Modify checkpoint saving behavior
Implement custom evaluation metrics

The training framework makes it easy to:

Train multiple models of different sizes
Ensure consistent training across all models
Save rich checkpoint data for learning dynamics analysis
Compare learning dynamics across scales

🛠️ Using the Resources

Using Pre-trained Models (HuggingFace)

from transformers import AutoModelForCausalLM

# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")

# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "pico-lm/pico-small",
    revision="step-xyz"
)

Training Your Own Suite (GitHub)

# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh

# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters

# Train your suite
python train.py --config configs/train.yaml

📊 Model Details

Architecture

All models (both pre-trained and self-trained) use:

LLAMA-style transformer
RMSNorm for normalization
RoPE positional embeddings
Multi-head attention with KV-cache
SwiGLU activation function

Training Configuration

Standard configuration (customizable in GitHub training):

Batch size: 1024
Learning rate: 1e-3
Weight decay: 0.1
Gradient clipping: 1.0
Mixed precision training

🔬 Research Applications

Perfect for researchers studying:

Learning dynamics across model scales
Mechanistic interpretability
Architecture and training effects
Emergent model behaviors

Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.

🤝 Contributing

Contributions welcome on both platforms:

HuggingFace: Model weights, datasets, and evaluation results
GitHub: Training framework improvements, analysis tools, and documentation

📫 Contact

GitHub: rdiehlmartinez/pico
Author: Richard Diehl Martinez

🔍 Citation

@software{pico2024,
    author = {Diehl Martinez, Richard},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}