Spaces:

pico-lm
/

README

Running

App Files Files Community

rdiehlmartinez commited on about 1 month ago

Commit

8ec16ff

•

1 Parent(s): e948f36

Updating README

Browse files

Files changed (1) hide show

README.md +113 -29

README.md CHANGED Viewed

@@ -1,52 +1,136 @@
 ---
 title: README
-emoji: 🎯
 colorFrom: red
 colorTo: yellow
 sdk: static
 pinned: false
 ---
-# 🎯 Pico: Tiny Language Models for Learning Dynamics Research
-Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models.
-## 🔑 Key Features
-- **Simple Architecture**: Clean, modular implementation of core transformer components
-- **Educational Focus**: Well-documented code with clear references to academic papers
-- **Research Ready**: Built-in tools for analyzing model learning dynamics
-- **Efficient Training**: Pre-tokenized dataset and optimized training loop
-- **Modern Stack**: Built with PyTorch Lightning, Wandb, and HuggingFace integrations
-## 🏗️ Core Components
-- **RMSNorm** for stable layer normalization
-- **Rotary Positional Embeddings (RoPE)** for position encoding
-- **Multi-head attention** with KV-cache support
-- **SwiGLU activation** function
-- **Residual connections** throughout
-## 📚 References
-Our implementation draws inspiration from and builds upon:
-- [LLAMA](https://arxiv.org/abs/2302.13971)
-- [RoPE](https://arxiv.org/abs/2104.09864)
-- [SwiGLU](https://arxiv.org/abs/2002.05202)
-## 🤝 Contributing
-We welcome contributions! Whether it's:
-- Adding new features
-- Improving documentation
-- Fixing bugs
-- Sharing experimental results
-## 📝 License
-Apache 2.0 License
 ## 📫 Contact
 - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
-- Author: Richard Diehl Martinez

 ---
 title: README
+emoji: 📈
 colorFrom: red
 colorTo: yellow
 sdk: static
 pinned: false
 ---
+# 📈 Pico: Tiny Language Models for Learning Dynamics Research
+Pico consists of two key components:
+1. **Pre-trained Model Suite** (hosted here on HuggingFace)
+2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))
+This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
+## 🤗 HuggingFace Resources (You Are Here)
+> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
+### Pre-trained Model Suite (Releasing January 2025)
+Our complete suite of models from 1M to 1B parameters:
+- **pico-tiny** (1M parameters)
+- **pico-small** (10M parameters)
+- **pico-medium** (100M parameters)
+- **pico-large** (500M parameters)
+- **pico-xl** (1B parameters)
+Each model includes:
+- Complete training checkpoints
+- Saved activations and gradients
+- Pre-computed evaluation perplexity scores
+### Available Datasets
+1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
+   - 420B tokens of pre-processed text
+   - Cleaned and shuffled DOLMA corpus
+2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
+   - Smaller version for quick experiments
+3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
+   - Batch of eval data for generating model activations
+## 🔧 GitHub Training Framework
+Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
+- Train models with custom architectures
+- Experiment with different training regimes
+- Modify checkpoint saving behavior
+- Implement custom evaluation metrics
+The training framework makes it easy to:
+1. Train multiple models of different sizes
+2. Ensure consistent training across all models
+3. Save rich checkpoint data for learning dynamics analysis
+4. Compare learning dynamics across scales
+## 🛠️ Using the Resources
+### Using Pre-trained Models (HuggingFace)
+```python
+from transformers import AutoModelForCausalLM
+# Load our pre-trained model
+model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
+# Access specific checkpoint
+model = AutoModelForCausalLM.from_pretrained(
+    "pico-lm/pico-small",
+    revision="step-xyz"
+)
+```
+### Training Your Own Suite (GitHub)
+```bash
+# Clone the repository
+git clone https://github.com/rdiehlmartinez/pico.git
+# Configure your model suite
+# Edit configs/train.yaml to specify model sizes and training parameters
+# Train your suite
+python train.py --config configs/train.yaml
+```
+## 📊 Model Details
+### Architecture
+All models (both pre-trained and self-trained) use:
+- LLAMA-style transformer
+- RMSNorm for normalization
+- RoPE positional embeddings
+- Multi-head attention with KV-cache
+- SwiGLU activation function
+### Training Configuration
+Standard configuration (customizable in GitHub training):
+- Batch size: 1024
+- Learning rate: 1e-3
+- Weight decay: 0.1
+- Gradient clipping: 1.0
+- Mixed precision training
+## 🔬 Research Applications
+Perfect for researchers studying:
+- Learning dynamics across model scales
+- Mechanistic interpretability
+- Architecture and training effects
+- Emergent model behaviors
+Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
+## 🤝 Contributing
+Contributions welcome on both platforms:
+- **HuggingFace**: Model weights, datasets, and evaluation results
+- **GitHub**: Training framework improvements, analysis tools, and documentation
 ## 📫 Contact
+- HuggingFace: [pico-lm](https://huggingface.co/pico-lm)
 - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
+- Author: Richard Diehl Martinez
+## 🔍 Citation
+```bibtex
+@software{pico2024,
+    author = {Martinez, Richard Diehl},
+    title = {Pico: Framework for Training Tiny Language Models},
+    year = {2024},
+}
+```