lvwerra HF staff commited on
Commit
aa2cfb8
·
verified ·
1 Parent(s): 845fe94

picotron (#4)

Browse files

- add picotron code snippets (2a9ca3dfd0ebe633c4292131fda4f8f04c0c6cd5)
- remove old files (8d5f9163eef8b39aa7d00c5e216ce34b33666770)
- Merge branch 'pr/4' into picotron-snippets (0cde7daaa5c1f757066d978a57053edb0e4f92d0)

blog-export-headrs.html DELETED
@@ -1,192 +0,0 @@
1
- <h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
2
-
3
- <h2>TL;DR</h2>
4
-
5
- <h2>First Steps: Training on one GPU</h2>
6
-
7
- <h3>Memory usage in Transformers</h3>
8
-
9
- <h4>Memory profiling a training step</h4>
10
-
11
- <h4>Weights/grads/optimizer states memory</h4>
12
-
13
- <h4>Activations memory</h4>
14
-
15
- <h3><strong>Activation recomputation</strong></h3>
16
-
17
- <h3>Gradient accumulation</h3>
18
-
19
- <h2>Data Parallelism</h2>
20
-
21
- <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
22
-
23
- <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
24
-
25
- <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
26
-
27
- <h3>Revisit global batch size</h3>
28
-
29
- <h3>Our journey up to now</h3>
30
-
31
- <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
32
-
33
- <h4>Memory usage revisited</h4>
34
-
35
- <h4>ZeRO-1: Partitioning Optimizer States</h4>
36
-
37
- <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
38
-
39
- <h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
40
-
41
- <h2>Tensor Parallelism</h2>
42
-
43
- <h3>Tensor Parallelism in a Transformer Block</h3>
44
-
45
- <h3>Sequence Parallelism</h3>
46
-
47
- <h2>Context Parallelism</h2>
48
-
49
- <h3>Introducing Context Parallelism</h3>
50
-
51
- <h3>Discovering Ring Attention</h3>
52
-
53
- <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
54
-
55
- <h2></h2>
56
-
57
- <h2>Pipeline Parallelism</h2>
58
-
59
- <h3>Splitting layers on various nodes - All forward, all backward</h3>
60
-
61
- <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
62
-
63
- <h3>Interleaving stages</h3>
64
-
65
- <h3>Zero Bubble and DualPipe</h3>
66
-
67
- <h2>Expert parallelism</h2>
68
-
69
- <h2>5D parallelism in a nutshell</h2>
70
-
71
- <h2>How to Find the Best Training Configuration</h2>
72
-
73
- <h2>Diving in the GPUs – fusing, threading, mixing</h2>
74
-
75
- <h4>A primer on GPU</h4>
76
-
77
- <h3>How to improve performance with Kernels ?</h3>
78
-
79
- <h4>Memory Coalescing</h4>
80
-
81
- <h4>Tiling</h4>
82
-
83
- <h4>Thread Coarsening</h4>
84
-
85
- <h4>Minimizing Control Divergence</h4>
86
-
87
- <h3>Flash Attention 1-3</h3>
88
-
89
- <h3>Fused Kernels</h3>
90
-
91
- <h3>Mixed Precision Training</h3>
92
-
93
- <h4>FP16 and BF16 training</h4>
94
-
95
- <h4>FP8 pretraining</h4>
96
-
97
- <h2>Conclusion</h2>
98
-
99
- <h3>What you learned</h3>
100
-
101
- <h3>What we learned</h3>
102
-
103
- <h3>What’s next?</h3>
104
-
105
- <h2>References</h2>
106
-
107
- <h3>Landmark LLM Scaling Papers</h3>
108
-
109
- <h3>Training Frameworks</h3>
110
-
111
- <h3>Debugging</h3>
112
-
113
- <h3>Distribution Techniques</h3>
114
-
115
- <h3>CUDA Kernels</h3>
116
-
117
- <h3>Hardware</h3>
118
-
119
- <h3>Others</h3>
120
-
121
- <h2>Appendix</h2>
122
-
123
- <h3>A0: Parallel Programming Crash Course</h3>
124
-
125
- <h4>Broadcast</h4>
126
-
127
- <h4>Reduce &amp; AllReduce</h4>
128
-
129
- <h4><strong>A quick focus on Ring All-Reduce</strong></h4>
130
-
131
- <h4>Gather &amp; AllGather</h4>
132
-
133
- <h4>Scatter &amp; ReduceScatter</h4>
134
-
135
- <h4>Barrier</h4>
136
-
137
- <h4>NCCL: NVIDIA Collective Communications Library</h4>
138
-
139
- <h3>A1: Profiling</h3>
140
-
141
- <h4>Kernels</h4>
142
-
143
- <h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
144
-
145
- <h2>include <torch/extension.h></h2>
146
-
147
- <h2>include <cuda.h></h2>
148
-
149
- <h2>include <cuda_runtime.h></h2>
150
-
151
- <h2>Load and compile the CUDA extension</h2>
152
-
153
- <h2>Define input tensors</h2>
154
-
155
- <h2>Run the CUDA kernel</h2>
156
-
157
- <h3>A2: TP Backward pass</h3>
158
-
159
- <h3>A3: ZeRO-R</h3>
160
-
161
- <h4>$P_a:$ Partitioned Activation Checkpointing</h4>
162
-
163
- <h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
164
-
165
- <h4><strong>$M_D$: Memory Defragmentation</strong></h4>
166
-
167
- <h4>Communication Analysis of ZeRO-R</h4>
168
-
169
- <h3>A5. Memory profile</h3>
170
-
171
- <h2>Set up optimizer</h2>
172
-
173
- <h3>TP: Practical PyTorch Implementation</h3>
174
-
175
- <h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
176
-
177
- <h2>core logic of Column Parallel linear</h2>
178
-
179
- <h4>Gelu code</h4>
180
-
181
- <h4>Interconnect</h4>
182
-
183
- <h3>How to profile your code</h3>
184
-
185
- <h3>Formulas for compute / comms the balanhe balance</h3>
186
-
187
- <h3>Integrating Context Parallelism with TP/SP</h3>
188
-
189
- <h3>The nanotron FP8 recipe</h3>
190
-
191
- <h2>Overlapping computation and communication</h2>
192
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
blog-export.html DELETED
The diff for this file is too large to render. See raw diff
 
blog-export.md DELETED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -474,13 +474,9 @@
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
477
- <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
478
-
479
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
480
 
481
- <p>TODO: embed naive DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60</a></p>
482
-
483
- <p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
484
 
485
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
486
 
@@ -510,7 +506,18 @@
510
 
511
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
512
 
513
- <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. </p>
 
 
 
 
 
 
 
 
 
 
 
514
 
515
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
516
 
@@ -519,6 +526,18 @@
519
 
520
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
521
 
 
 
 
 
 
 
 
 
 
 
 
 
522
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
523
 
524
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
@@ -749,12 +768,36 @@
749
 
750
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
751
 
 
 
 
 
 
 
 
 
 
 
 
752
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
753
 
754
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
755
 
756
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
758
  <h3>Tensor Parallelism in a Transformer Block</h3>
759
 
760
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
@@ -924,10 +967,6 @@
924
  </tr>
925
  </tbody>
926
  </table>
927
-
928
- <p>You can find an example of implementation of both column and row linear TP in picotron:
929
-
930
- <a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
931
 
932
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
933
 
@@ -1102,8 +1141,17 @@
1102
 
1103
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1104
 
1105
- <p>You can find the full implementation of the AFAB pipeline in picotron: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L54-L83</p>
1106
 
 
 
 
 
 
 
 
 
 
1107
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1108
 
1109
  <d-math block>
@@ -1132,8 +1180,17 @@
1132
 
1133
  <p>Here is the example training loop from the above gist:</p>
1134
 
1135
- <p>You can find the full implementation in picotron as well: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L85-L145</p>
1136
 
 
 
 
 
 
 
 
 
 
1137
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1138
 
1139
  <h3>Interleaving stages</h3>
 
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
 
 
477
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
478
 
479
+ <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
 
 
480
 
481
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
482
 
 
506
 
507
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
508
 
509
+ <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
510
+
511
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
512
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
513
+ 👉 Naive DP implementation with overlap in Picotron (Click to expand)
514
+ </summary>
515
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
516
+ <script
517
+ src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
518
+ </script>
519
+ </div>
520
+ </details>
521
 
522
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
523
 
 
526
 
527
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
528
 
529
+ <p>Here's the code implementation with bucketing:</p>
530
+
531
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
532
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
533
+ 👉 Bucket DP implementation in Picotron (Click to expand)
534
+ </summary>
535
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
536
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
537
+ </script>
538
+ </div>
539
+ </details>
540
+
541
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
542
 
543
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
 
768
 
769
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
770
 
771
+ <p>Here's the code implementation of column wise tensor parallelism:</p>
772
+
773
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
774
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
775
+ 👉 Column parallel TP implementation in Picotron (Click to expand)
776
+ </summary>
777
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
778
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
779
+ </div>
780
+ </details>
781
+
782
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
783
 
784
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
785
 
786
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
787
 
788
+ <p>Here's the implementation for row-wise tensor parallelism:</p>
789
+
790
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
791
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
792
+ 👉 Row parallel TP implementation in Picotron (Click to expand)
793
+ </summary>
794
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
795
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
796
+ </div>
797
+ </details>
798
+
799
+ <p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
800
+
801
  <h3>Tensor Parallelism in a Transformer Block</h3>
802
 
803
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
 
967
  </tr>
968
  </tbody>
969
  </table>
 
 
 
 
970
 
971
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
972
 
 
1141
 
1142
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1143
 
1144
+ <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1145
 
1146
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1147
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1148
+ 👉 AFAB PP implementation in Picotron (Click to expand)
1149
+ </summary>
1150
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1151
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1152
+ </div>
1153
+ </details>
1154
+
1155
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1156
 
1157
  <d-math block>
 
1180
 
1181
  <p>Here is the example training loop from the above gist:</p>
1182
 
1183
+ <p>You can find the full implementation in picotron as well:</p>
1184
 
1185
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1186
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1187
+ 👉 1F1B PP implementation in Picotron (Click to expand)
1188
+ </summary>
1189
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1190
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1191
+ </div>
1192
+ </details>
1193
+
1194
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1195
 
1196
  <h3>Interleaving stages</h3>
dist/style.css CHANGED
@@ -20,7 +20,6 @@
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
23
-
24
  .plotly_caption {
25
  font-style: italic;
26
  margin-top: 10px;
 
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
 
23
  .plotly_caption {
24
  font-style: italic;
25
  margin-top: 10px;
src/index.html CHANGED
@@ -474,13 +474,9 @@
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
477
- <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
478
-
479
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
480
 
481
- <p>TODO: embed naive DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60</a></p>
482
-
483
- <p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
484
 
485
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
486
 
@@ -510,7 +506,18 @@
510
 
511
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
512
 
513
- <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. </p>
 
 
 
 
 
 
 
 
 
 
 
514
 
515
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
516
 
@@ -519,6 +526,18 @@
519
 
520
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
521
 
 
 
 
 
 
 
 
 
 
 
 
 
522
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
523
 
524
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
@@ -749,12 +768,36 @@
749
 
750
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
751
 
 
 
 
 
 
 
 
 
 
 
 
752
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
753
 
754
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
755
 
756
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
758
  <h3>Tensor Parallelism in a Transformer Block</h3>
759
 
760
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
@@ -924,10 +967,6 @@
924
  </tr>
925
  </tbody>
926
  </table>
927
-
928
- <p>You can find an example of implementation of both column and row linear TP in picotron:
929
-
930
- <a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
931
 
932
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
933
 
@@ -1102,8 +1141,17 @@
1102
 
1103
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1104
 
1105
- <p>You can find the full implementation of the AFAB pipeline in picotron: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L54-L83</p>
1106
 
 
 
 
 
 
 
 
 
 
1107
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1108
 
1109
  <d-math block>
@@ -1132,8 +1180,17 @@
1132
 
1133
  <p>Here is the example training loop from the above gist:</p>
1134
 
1135
- <p>You can find the full implementation in picotron as well: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L85-L145</p>
1136
 
 
 
 
 
 
 
 
 
 
1137
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1138
 
1139
  <h3>Interleaving stages</h3>
 
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
 
 
477
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
478
 
479
+ <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
 
 
480
 
481
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
482
 
 
506
 
507
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
508
 
509
+ <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
510
+
511
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
512
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
513
+ 👉 Naive DP implementation with overlap in Picotron (Click to expand)
514
+ </summary>
515
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
516
+ <script
517
+ src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
518
+ </script>
519
+ </div>
520
+ </details>
521
 
522
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
523
 
 
526
 
527
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
528
 
529
+ <p>Here's the code implementation with bucketing:</p>
530
+
531
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
532
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
533
+ 👉 Bucket DP implementation in Picotron (Click to expand)
534
+ </summary>
535
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
536
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
537
+ </script>
538
+ </div>
539
+ </details>
540
+
541
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
542
 
543
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
 
768
 
769
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
770
 
771
+ <p>Here's the code implementation of column wise tensor parallelism:</p>
772
+
773
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
774
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
775
+ 👉 Column parallel TP implementation in Picotron (Click to expand)
776
+ </summary>
777
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
778
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
779
+ </div>
780
+ </details>
781
+
782
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
783
 
784
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
785
 
786
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
787
 
788
+ <p>Here's the implementation for row-wise tensor parallelism:</p>
789
+
790
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
791
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
792
+ 👉 Row parallel TP implementation in Picotron (Click to expand)
793
+ </summary>
794
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
795
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
796
+ </div>
797
+ </details>
798
+
799
+ <p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
800
+
801
  <h3>Tensor Parallelism in a Transformer Block</h3>
802
 
803
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
 
967
  </tr>
968
  </tbody>
969
  </table>
 
 
 
 
970
 
971
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
972
 
 
1141
 
1142
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1143
 
1144
+ <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1145
 
1146
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1147
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1148
+ 👉 AFAB PP implementation in Picotron (Click to expand)
1149
+ </summary>
1150
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1151
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1152
+ </div>
1153
+ </details>
1154
+
1155
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1156
 
1157
  <d-math block>
 
1180
 
1181
  <p>Here is the example training loop from the above gist:</p>
1182
 
1183
+ <p>You can find the full implementation in picotron as well:</p>
1184
 
1185
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1186
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1187
+ 👉 1F1B PP implementation in Picotron (Click to expand)
1188
+ </summary>
1189
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1190
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1191
+ </div>
1192
+ </details>
1193
+
1194
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1195
 
1196
  <h3>Interleaving stages</h3>
src/style.css CHANGED
@@ -20,7 +20,6 @@
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
23
-
24
  .plotly_caption {
25
  font-style: italic;
26
  margin-top: 10px;
 
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
 
23
  .plotly_caption {
24
  font-style: italic;
25
  margin-top: 10px;