title: README
emoji: π
colorFrom: red
colorTo: yellow
sdk: static
pinned: false
π Pico: Tiny Language Models for Learning Dynamics Research
Pico consists of two key components:
- Pre-trained Model Suite (hosted here on HuggingFace)
- Training Framework (available on GitHub)
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
π€ HuggingFace Resources (You Are Here)
π§ Coming Soon! Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our GitHub repository for updates!
Pre-trained Model Suite (Releasing January 2025)
Our complete suite of models from 1M to 1B parameters:
- pico-tiny (1M parameters)
- pico-small (10M parameters)
- pico-medium (100M parameters)
- pico-large (500M parameters)
- pico-xl (1B parameters)
Each model includes:
- Complete training checkpoints
- Saved activations and gradients
- Pre-computed evaluation perplexity scores
Available Datasets
-
- 420B tokens of pre-processed text
- Cleaned and shuffled DOLMA corpus
-
- Smaller version for quick experiments
-
- Batch of eval data for generating model activations
π§ GitHub Training Framework
Want to train your own suite of models? Visit our GitHub repository to:
- Train models with custom architectures
- Experiment with different training regimes
- Modify checkpoint saving behavior
- Implement custom evaluation metrics
The training framework makes it easy to:
- Train multiple models of different sizes
- Ensure consistent training across all models
- Save rich checkpoint data for learning dynamics analysis
- Compare learning dynamics across scales
π οΈ Using the Resources
Using Pre-trained Models (HuggingFace)
from transformers import AutoModelForCausalLM
# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
"pico-lm/pico-small",
revision="step-xyz"
)
Training Your Own Suite (GitHub)
# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh
# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters
# Train your suite
python train.py --config configs/train.yaml
π Model Details
Architecture
All models (both pre-trained and self-trained) use:
- LLAMA-style transformer
- RMSNorm for normalization
- RoPE positional embeddings
- Multi-head attention with KV-cache
- SwiGLU activation function
Training Configuration
Standard configuration (customizable in GitHub training):
- Batch size: 1024
- Learning rate: 1e-3
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision training
π¬ Research Applications
Perfect for researchers studying:
- Learning dynamics across model scales
- Mechanistic interpretability
- Architecture and training effects
- Emergent model behaviors
Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
π€ Contributing
Contributions welcome on both platforms:
- HuggingFace: Model weights, datasets, and evaluation results
- GitHub: Training framework improvements, analysis tools, and documentation
π« Contact
- GitHub: rdiehlmartinez/pico
- Author: Richard Diehl Martinez
π Citation
@software{pico2024,
author = {Martinez, Richard Diehl},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
}