--- title: README emoji: 📈 colorFrom: red colorTo: yellow sdk: static pinned: false --- # Pico: A Lightweight Framework for Studying Learning Dynamics Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our [website](https://www.picolm.io/) for more information. Pico consists of two key components: 1. **Pre-trained Model Suite** (hosted here on HuggingFace) 2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico)) This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch. ## 🤗 HuggingFace Resources (You Are Here) > 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates! ### Pre-trained Model Suite (Releasing January 2025) Our complete suite of models from 1M to 1B parameters: - **pico-tiny** (1M parameters) - **pico-small** (10M parameters) - **pico-medium** (100M parameters) - **pico-large** (500M parameters) - **pico-xl** (1B parameters) Each model includes: - Complete training checkpoints - Saved activations and gradients - Pre-computed evaluation perplexity scores ### Available Datasets 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)** - 420B tokens of pre-processed text - Cleaned and shuffled DOLMA corpus 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)** - Smaller version for quick experiments 3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)** - Batch of eval data for generating model activations ## 🔧 GitHub Training Framework Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to: - Train models with custom architectures - Experiment with different training regimes - Modify checkpoint saving behavior - Implement custom evaluation metrics The training framework makes it easy to: 1. Train multiple models of different sizes 2. Ensure consistent training across all models 3. Save rich checkpoint data for learning dynamics analysis 4. Compare learning dynamics across scales ## 🛠️ Using the Resources ### Using Pre-trained Models (HuggingFace) ```python from transformers import AutoModelForCausalLM # Load our pre-trained model model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small") # Access specific checkpoint model = AutoModelForCausalLM.from_pretrained( "pico-lm/pico-small", revision="step-xyz" ) ``` ### Training Your Own Suite (GitHub) ```bash # Clone the repository git clone https://github.com/rdiehlmartinez/pico.git && cd pico source setup.sh # Configure your model suite # Edit configs/train.yaml to specify model sizes and training parameters # Train your suite python train.py --config configs/train.yaml ``` ## 📊 Model Details ### Architecture All models (both pre-trained and self-trained) use: - LLAMA-style transformer - RMSNorm for normalization - RoPE positional embeddings - Multi-head attention with KV-cache - SwiGLU activation function ### Training Configuration Standard configuration (customizable in GitHub training): - Batch size: 1024 - Learning rate: 1e-3 - Weight decay: 0.1 - Gradient clipping: 1.0 - Mixed precision training ## 🔬 Research Applications Perfect for researchers studying: - Learning dynamics across model scales - Mechanistic interpretability - Architecture and training effects - Emergent model behaviors Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research. ## 🤝 Contributing Contributions welcome on both platforms: - **HuggingFace**: Model weights, datasets, and evaluation results - **GitHub**: Training framework improvements, analysis tools, and documentation ## 📫 Contact - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico) - Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com) ## 🔍 Citation ```bibtex @software{pico2024, author = {Diehl Martinez, Richard}, title = {Pico: Framework for Training Tiny Language Models}, year = {2024}, } ```