rdiehlmartinez commited on
Commit
8ec16ff
β€’
1 Parent(s): e948f36

Updating README

Browse files
Files changed (1) hide show
  1. README.md +113 -29
README.md CHANGED
@@ -1,52 +1,136 @@
1
  ---
2
  title: README
3
- emoji: 🎯
4
  colorFrom: red
5
  colorTo: yellow
6
  sdk: static
7
  pinned: false
8
  ---
9
 
10
- # 🎯 Pico: Tiny Language Models for Learning Dynamics Research
11
 
12
- Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models.
 
 
13
 
14
- ## πŸ”‘ Key Features
15
 
16
- - **Simple Architecture**: Clean, modular implementation of core transformer components
17
- - **Educational Focus**: Well-documented code with clear references to academic papers
18
- - **Research Ready**: Built-in tools for analyzing model learning dynamics
19
- - **Efficient Training**: Pre-tokenized dataset and optimized training loop
20
- - **Modern Stack**: Built with PyTorch Lightning, Wandb, and HuggingFace integrations
21
 
22
- ## πŸ—οΈ Core Components
23
 
24
- - **RMSNorm** for stable layer normalization
25
- - **Rotary Positional Embeddings (RoPE)** for position encoding
26
- - **Multi-head attention** with KV-cache support
27
- - **SwiGLU activation** function
28
- - **Residual connections** throughout
 
 
29
 
30
- ## πŸ“š References
 
 
 
31
 
32
- Our implementation draws inspiration from and builds upon:
33
- - [LLAMA](https://arxiv.org/abs/2302.13971)
34
- - [RoPE](https://arxiv.org/abs/2104.09864)
35
- - [SwiGLU](https://arxiv.org/abs/2002.05202)
36
 
37
- ## 🀝 Contributing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- We welcome contributions! Whether it's:
40
- - Adding new features
41
- - Improving documentation
42
- - Fixing bugs
43
- - Sharing experimental results
44
 
45
- ## πŸ“ License
 
 
46
 
47
- Apache 2.0 License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## πŸ“« Contact
50
 
 
51
  - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
52
- - Author: Richard Diehl Martinez
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: README
3
+ emoji: πŸ“ˆ
4
  colorFrom: red
5
  colorTo: yellow
6
  sdk: static
7
  pinned: false
8
  ---
9
 
10
+ # πŸ“ˆ Pico: Tiny Language Models for Learning Dynamics Research
11
 
12
+ Pico consists of two key components:
13
+ 1. **Pre-trained Model Suite** (hosted here on HuggingFace)
14
+ 2. **Training Framework** (available on [GitHub](https://github.com/rdiehlmartinez/pico))
15
 
16
+ This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the infrastructure to train your own model suites from scratch.
17
 
18
+ ## πŸ€— HuggingFace Resources (You Are Here)
 
 
 
 
19
 
20
+ > 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released here in January 2025. Watch this space or star our [GitHub repository](https://github.com/rdiehlmartinez/pico) for updates!
21
 
22
+ ### Pre-trained Model Suite (Releasing January 2025)
23
+ Our complete suite of models from 1M to 1B parameters:
24
+ - **pico-tiny** (1M parameters)
25
+ - **pico-small** (10M parameters)
26
+ - **pico-medium** (100M parameters)
27
+ - **pico-large** (500M parameters)
28
+ - **pico-xl** (1B parameters)
29
 
30
+ Each model includes:
31
+ - Complete training checkpoints
32
+ - Saved activations and gradients
33
+ - Pre-computed evaluation perplexity scores
34
 
35
+ ### Available Datasets
36
+ 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
37
+ - 420B tokens of pre-processed text
38
+ - Cleaned and shuffled DOLMA corpus
39
 
40
+ 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
41
+ - Smaller version for quick experiments
42
+
43
+ 3. **[pretokenized-eval-batch](https://huggingface.co/datasets/pico-lm/pretokenized-eval-batch)**
44
+ - Batch of eval data for generating model activations
45
+
46
+ ## πŸ”§ GitHub Training Framework
47
+
48
+ Want to train your own suite of models? Visit our [GitHub repository](https://github.com/rdiehlmartinez/pico) to:
49
+ - Train models with custom architectures
50
+ - Experiment with different training regimes
51
+ - Modify checkpoint saving behavior
52
+ - Implement custom evaluation metrics
53
+
54
+ The training framework makes it easy to:
55
+ 1. Train multiple models of different sizes
56
+ 2. Ensure consistent training across all models
57
+ 3. Save rich checkpoint data for learning dynamics analysis
58
+ 4. Compare learning dynamics across scales
59
+
60
+ ## πŸ› οΈ Using the Resources
61
+
62
+ ### Using Pre-trained Models (HuggingFace)
63
+ ```python
64
+ from transformers import AutoModelForCausalLM
65
+
66
+ # Load our pre-trained model
67
+ model = AutoModelForCausalLM.from_pretrained("pico-lm/pico-small")
68
+
69
+ # Access specific checkpoint
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ "pico-lm/pico-small",
72
+ revision="step-xyz"
73
+ )
74
+ ```
75
+
76
+ ### Training Your Own Suite (GitHub)
77
+ ```bash
78
+ # Clone the repository
79
+ git clone https://github.com/rdiehlmartinez/pico.git
80
 
81
+ # Configure your model suite
82
+ # Edit configs/train.yaml to specify model sizes and training parameters
 
 
 
83
 
84
+ # Train your suite
85
+ python train.py --config configs/train.yaml
86
+ ```
87
 
88
+ ## πŸ“Š Model Details
89
+
90
+ ### Architecture
91
+ All models (both pre-trained and self-trained) use:
92
+ - LLAMA-style transformer
93
+ - RMSNorm for normalization
94
+ - RoPE positional embeddings
95
+ - Multi-head attention with KV-cache
96
+ - SwiGLU activation function
97
+
98
+ ### Training Configuration
99
+ Standard configuration (customizable in GitHub training):
100
+ - Batch size: 1024
101
+ - Learning rate: 1e-3
102
+ - Weight decay: 0.1
103
+ - Gradient clipping: 1.0
104
+ - Mixed precision training
105
+
106
+ ## πŸ”¬ Research Applications
107
+
108
+ Perfect for researchers studying:
109
+ - Learning dynamics across model scales
110
+ - Mechanistic interpretability
111
+ - Architecture and training effects
112
+ - Emergent model behaviors
113
+
114
+ Whether using our pre-trained models or training your own suite, Pico provides the tools needed for in-depth learning dynamics research.
115
+
116
+ ## 🀝 Contributing
117
+
118
+ Contributions welcome on both platforms:
119
+ - **HuggingFace**: Model weights, datasets, and evaluation results
120
+ - **GitHub**: Training framework improvements, analysis tools, and documentation
121
 
122
  ## πŸ“« Contact
123
 
124
+ - HuggingFace: [pico-lm](https://huggingface.co/pico-lm)
125
  - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
126
+ - Author: Richard Diehl Martinez
127
+
128
+ ## πŸ” Citation
129
+
130
+ ```bibtex
131
+ @software{pico2024,
132
+ author = {Martinez, Richard Diehl},
133
+ title = {Pico: Framework for Training Tiny Language Models},
134
+ year = {2024},
135
+ }
136
+ ```