Organization Card

Pico: A Lightweight Framework for Studying Learning Dynamics

Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes. Visit our website for more information.

Pico consists of two key components:

Pre-trained Model Suite (hosted here on HuggingFace)
Training Framework (available on GitHub)

This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.

🤗 HuggingFace Resources (You Are Here)

🚧 Coming Soon! Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our GitHub repository for updates!

Pre-trained Model Suite (Releasing January 2025)

Our complete suite of models from 1M to 1B parameters:

pico-tiny (1M parameters)
pico-small (10M parameters)
pico-medium (100M parameters)
pico-large (500M parameters)
pico-xl (1B parameters)

Each model includes:

Complete training checkpoints
Saved activations and gradients
Pre-computed evaluation perplexity scores

Available Datasets

pretokenized-dolma
- 420B tokens of pre-processed text
- Cleaned and shuffled DOLMA corpus
pretokenized-dolma-tiny
- Smaller version for quick experiments
pretokenized-eval-batch
- Batch of eval data for generating model activations

🔧 GitHub Training Framework

Want to train your own suite of models? Visit our GitHub repository to:

Train models with custom architectures
Experiment with different training regimes
Modify checkpoint saving behavior
Implement custom evaluation metrics

The training framework makes it easy to:

Train multiple models of different sizes
Ensure consistent training across all models
Save rich checkpoint data for learning dynamics analysis
Compare learning dynamics across scales

🛠️ Using the Resources

Using Pre-trained Models (HuggingFace)

from transformers import AutoModelForCausalLM

# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")

# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "pico-lm/pico-small",
    revision="step-xyz"
)

Training Your Own Suite (GitHub)

# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh

# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters

# Train your suite
python train.py --config configs/train.yaml

📊 Model Details

Architecture

All models (both pre-trained and self-trained) use:

LLAMA-style transformer
RMSNorm for normalization
RoPE positional embeddings
Multi-head attention with KV-cache
SwiGLU activation function

Training Configuration

Standard configuration (customizable in GitHub training):

Batch size: 1024
Learning rate: 1e-3
Weight decay: 0.1
Gradient clipping: 1.0
Mixed precision training

🔬 Research Applications

Perfect for researchers studying:

Learning dynamics across model scales
Mechanistic interpretability
Architecture and training effects
Emergent model behaviors

Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.

🤝 Contributing

Contributions welcome on both platforms:

HuggingFace: Model weights, datasets, and evaluation results
GitHub: Training framework improvements, analysis tools, and documentation

📫 Contact

GitHub: rdiehlmartinez/pico
Author: Richard Diehl Martinez

🔍 Citation

@software{pico2024,
    author = {Diehl Martinez, Richard},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}

spaces 1

Running

🤗

Perplexity

models 1

pico-lm/demo

Updated 8 days ago • 4

datasets 3

Pico Language Model

AI & ML interests

Recent Activity

Pico: A Lightweight Framework for Studying Learning Dynamics

🤗 HuggingFace Resources (You Are Here)

Pre-trained Model Suite (Releasing January 2025)

Available Datasets

🔧 GitHub Training Framework

🛠️ Using the Resources

Using Pre-trained Models (HuggingFace)

Training Your Own Suite (GitHub)

📊 Model Details

Architecture

Training Configuration

🔬 Research Applications

🤝 Contributing

📫 Contact

🔍 Citation

spaces 1

Perplexity

models 1

pico-lm/demo

datasets 3

pico-lm/pretokenized-dolma-tinsy

pico-lm/pretokenized-paloma-tinsy

pico-lm/pretokenized-dolma

AI & ML interests

Recent Activity

Team members 3

Pico: A Lightweight Framework for Studying Learning Dynamics

🤗 HuggingFace Resources (You Are Here)

Pre-trained Model Suite (Releasing January 2025)

Available Datasets

🔧 GitHub Training Framework

🛠️ Using the Resources

Using Pre-trained Models (HuggingFace)

Training Your Own Suite (GitHub)

📊 Model Details

Architecture

Training Configuration

🔬 Research Applications

🤝 Contributing

📫 Contact

🔍 Citation

spaces 1

Perplexity

models 1

datasets 3 Sort: Recently updated

datasets 3