README / README.md
rdiehlmartinez's picture
Update README.md
53c0dc3 verified
|
raw
history blame
4.28 kB
---
title: README
emoji: πŸ“ˆ
colorFrom: red
colorTo: yellow
sdk: static
pinned: false
---
# πŸ“ˆ Pico: Tiny Language Models for Learning Dynamics Research
Pico consists of two key components:
1. **Pre-trained Model Suite** (hosted here on HuggingFace)
2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
## πŸ€— HuggingFace Resources (You Are Here)
> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
### Pre-trained Model Suite (Releasing January 2025)
Our complete suite of models from 1M to 1B parameters:
- **pico-tiny** (1M parameters)
- **pico-small** (10M parameters)
- **pico-medium** (100M parameters)
- **pico-large** (500M parameters)
- **pico-xl** (1B parameters)
Each model includes:
- Complete training checkpoints
- Saved activations and gradients
- Pre-computed evaluation perplexity scores
### Available Datasets
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
- 420B tokens of pre-processed text
- Cleaned and shuffled DOLMA corpus
2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
- Smaller version for quick experiments
3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
- Batch of eval data for generating model activations
## πŸ”§ GitHub Training Framework
Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
- Train models with custom architectures
- Experiment with different training regimes
- Modify checkpoint saving behavior
- Implement custom evaluation metrics
The training framework makes it easy to:
1. Train multiple models of different sizes
2. Ensure consistent training across all models
3. Save rich checkpoint data for learning dynamics analysis
4. Compare learning dynamics across scales
## πŸ› οΈ Using the Resources
### Using Pre-trained Models (HuggingFace)
```python
from transformers import AutoModelForCausalLM
# Load our pre-trained model
model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
# Access specific checkpoint
model = AutoModelForCausalLM.from_pretrained(
"pico-lm/pico-small",
revision="step-xyz"
)
```
### Training Your Own Suite (GitHub)
```bash
# Clone the repository
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
source setup.sh
# Configure your model suite
# Edit configs/train.yaml to specify model sizes and training parameters
# Train your suite
python train.py --config configs/train.yaml
```
## πŸ“Š Model Details
### Architecture
All models (both pre-trained and self-trained) use:
- LLAMA-style transformer
- RMSNorm for normalization
- RoPE positional embeddings
- Multi-head attention with KV-cache
- SwiGLU activation function
### Training Configuration
Standard configuration (customizable in GitHub training):
- Batch size: 1024
- Learning rate: 1e-3
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision training
## πŸ”¬ Research Applications
Perfect for researchers studying:
- Learning dynamics across model scales
- Mechanistic interpretability
- Architecture and training effects
- Emergent model behaviors
Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
## 🀝 Contributing
Contributions welcome on both platforms:
- **HuggingFace**: Model weights, datasets, and evaluation results
- **GitHub**: Training framework improvements, analysis tools, and documentation
## πŸ“« Contact
- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)
## πŸ” Citation
```bibtex
@software{pico2024,
author = {Martinez, Richard Diehl},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
}
```