Spaces:
Running
Running
picotron (#4)
Browse files- add picotron code snippets (2a9ca3dfd0ebe633c4292131fda4f8f04c0c6cd5)
- remove old files (8d5f9163eef8b39aa7d00c5e216ce34b33666770)
- Merge branch 'pr/4' into picotron-snippets (0cde7daaa5c1f757066d978a57053edb0e4f92d0)
- blog-export-headrs.html +0 -192
- blog-export.html +0 -0
- blog-export.md +0 -0
- dist/index.html +69 -12
- dist/style.css +0 -1
- src/index.html +69 -12
- src/style.css +0 -1
blog-export-headrs.html
DELETED
@@ -1,192 +0,0 @@
|
|
1 |
-
<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
|
2 |
-
|
3 |
-
<h2>TL;DR</h2>
|
4 |
-
|
5 |
-
<h2>First Steps: Training on one GPU</h2>
|
6 |
-
|
7 |
-
<h3>Memory usage in Transformers</h3>
|
8 |
-
|
9 |
-
<h4>Memory profiling a training step</h4>
|
10 |
-
|
11 |
-
<h4>Weights/grads/optimizer states memory</h4>
|
12 |
-
|
13 |
-
<h4>Activations memory</h4>
|
14 |
-
|
15 |
-
<h3><strong>Activation recomputation</strong></h3>
|
16 |
-
|
17 |
-
<h3>Gradient accumulation</h3>
|
18 |
-
|
19 |
-
<h2>Data Parallelism</h2>
|
20 |
-
|
21 |
-
<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
|
22 |
-
|
23 |
-
<h4><strong>Second optimization:</strong> Bucketing gradients</h4>
|
24 |
-
|
25 |
-
<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
|
26 |
-
|
27 |
-
<h3>Revisit global batch size</h3>
|
28 |
-
|
29 |
-
<h3>Our journey up to now</h3>
|
30 |
-
|
31 |
-
<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
|
32 |
-
|
33 |
-
<h4>Memory usage revisited</h4>
|
34 |
-
|
35 |
-
<h4>ZeRO-1: Partitioning Optimizer States</h4>
|
36 |
-
|
37 |
-
<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
|
38 |
-
|
39 |
-
<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
|
40 |
-
|
41 |
-
<h2>Tensor Parallelism</h2>
|
42 |
-
|
43 |
-
<h3>Tensor Parallelism in a Transformer Block</h3>
|
44 |
-
|
45 |
-
<h3>Sequence Parallelism</h3>
|
46 |
-
|
47 |
-
<h2>Context Parallelism</h2>
|
48 |
-
|
49 |
-
<h3>Introducing Context Parallelism</h3>
|
50 |
-
|
51 |
-
<h3>Discovering Ring Attention</h3>
|
52 |
-
|
53 |
-
<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
|
54 |
-
|
55 |
-
<h2></h2>
|
56 |
-
|
57 |
-
<h2>Pipeline Parallelism</h2>
|
58 |
-
|
59 |
-
<h3>Splitting layers on various nodes - All forward, all backward</h3>
|
60 |
-
|
61 |
-
<h3>One-forward-one-backward and LLama 3.1 schemes</h3>
|
62 |
-
|
63 |
-
<h3>Interleaving stages</h3>
|
64 |
-
|
65 |
-
<h3>Zero Bubble and DualPipe</h3>
|
66 |
-
|
67 |
-
<h2>Expert parallelism</h2>
|
68 |
-
|
69 |
-
<h2>5D parallelism in a nutshell</h2>
|
70 |
-
|
71 |
-
<h2>How to Find the Best Training Configuration</h2>
|
72 |
-
|
73 |
-
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
74 |
-
|
75 |
-
<h4>A primer on GPU</h4>
|
76 |
-
|
77 |
-
<h3>How to improve performance with Kernels ?</h3>
|
78 |
-
|
79 |
-
<h4>Memory Coalescing</h4>
|
80 |
-
|
81 |
-
<h4>Tiling</h4>
|
82 |
-
|
83 |
-
<h4>Thread Coarsening</h4>
|
84 |
-
|
85 |
-
<h4>Minimizing Control Divergence</h4>
|
86 |
-
|
87 |
-
<h3>Flash Attention 1-3</h3>
|
88 |
-
|
89 |
-
<h3>Fused Kernels</h3>
|
90 |
-
|
91 |
-
<h3>Mixed Precision Training</h3>
|
92 |
-
|
93 |
-
<h4>FP16 and BF16 training</h4>
|
94 |
-
|
95 |
-
<h4>FP8 pretraining</h4>
|
96 |
-
|
97 |
-
<h2>Conclusion</h2>
|
98 |
-
|
99 |
-
<h3>What you learned</h3>
|
100 |
-
|
101 |
-
<h3>What we learned</h3>
|
102 |
-
|
103 |
-
<h3>What’s next?</h3>
|
104 |
-
|
105 |
-
<h2>References</h2>
|
106 |
-
|
107 |
-
<h3>Landmark LLM Scaling Papers</h3>
|
108 |
-
|
109 |
-
<h3>Training Frameworks</h3>
|
110 |
-
|
111 |
-
<h3>Debugging</h3>
|
112 |
-
|
113 |
-
<h3>Distribution Techniques</h3>
|
114 |
-
|
115 |
-
<h3>CUDA Kernels</h3>
|
116 |
-
|
117 |
-
<h3>Hardware</h3>
|
118 |
-
|
119 |
-
<h3>Others</h3>
|
120 |
-
|
121 |
-
<h2>Appendix</h2>
|
122 |
-
|
123 |
-
<h3>A0: Parallel Programming Crash Course</h3>
|
124 |
-
|
125 |
-
<h4>Broadcast</h4>
|
126 |
-
|
127 |
-
<h4>Reduce & AllReduce</h4>
|
128 |
-
|
129 |
-
<h4><strong>A quick focus on Ring All-Reduce</strong></h4>
|
130 |
-
|
131 |
-
<h4>Gather & AllGather</h4>
|
132 |
-
|
133 |
-
<h4>Scatter & ReduceScatter</h4>
|
134 |
-
|
135 |
-
<h4>Barrier</h4>
|
136 |
-
|
137 |
-
<h4>NCCL: NVIDIA Collective Communications Library</h4>
|
138 |
-
|
139 |
-
<h3>A1: Profiling</h3>
|
140 |
-
|
141 |
-
<h4>Kernels</h4>
|
142 |
-
|
143 |
-
<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
|
144 |
-
|
145 |
-
<h2>include <torch/extension.h></h2>
|
146 |
-
|
147 |
-
<h2>include <cuda.h></h2>
|
148 |
-
|
149 |
-
<h2>include <cuda_runtime.h></h2>
|
150 |
-
|
151 |
-
<h2>Load and compile the CUDA extension</h2>
|
152 |
-
|
153 |
-
<h2>Define input tensors</h2>
|
154 |
-
|
155 |
-
<h2>Run the CUDA kernel</h2>
|
156 |
-
|
157 |
-
<h3>A2: TP Backward pass</h3>
|
158 |
-
|
159 |
-
<h3>A3: ZeRO-R</h3>
|
160 |
-
|
161 |
-
<h4>$P_a:$ Partitioned Activation Checkpointing</h4>
|
162 |
-
|
163 |
-
<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
|
164 |
-
|
165 |
-
<h4><strong>$M_D$: Memory Defragmentation</strong></h4>
|
166 |
-
|
167 |
-
<h4>Communication Analysis of ZeRO-R</h4>
|
168 |
-
|
169 |
-
<h3>A5. Memory profile</h3>
|
170 |
-
|
171 |
-
<h2>Set up optimizer</h2>
|
172 |
-
|
173 |
-
<h3>TP: Practical PyTorch Implementation</h3>
|
174 |
-
|
175 |
-
<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
|
176 |
-
|
177 |
-
<h2>core logic of Column Parallel linear</h2>
|
178 |
-
|
179 |
-
<h4>Gelu code</h4>
|
180 |
-
|
181 |
-
<h4>Interconnect</h4>
|
182 |
-
|
183 |
-
<h3>How to profile your code</h3>
|
184 |
-
|
185 |
-
<h3>Formulas for compute / comms the balanhe balance</h3>
|
186 |
-
|
187 |
-
<h3>Integrating Context Parallelism with TP/SP</h3>
|
188 |
-
|
189 |
-
<h3>The nanotron FP8 recipe</h3>
|
190 |
-
|
191 |
-
<h2>Overlapping computation and communication</h2>
|
192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
blog-export.html
DELETED
The diff for this file is too large to render.
See raw diff
|
|
blog-export.md
DELETED
The diff for this file is too large to render.
See raw diff
|
|
dist/index.html
CHANGED
@@ -474,13 +474,9 @@
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
477 |
-
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
478 |
-
|
479 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
480 |
|
481 |
-
<p>
|
482 |
-
|
483 |
-
<p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
|
484 |
|
485 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
486 |
|
@@ -510,7 +506,18 @@
|
|
510 |
|
511 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
512 |
|
513 |
-
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
|
515 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
516 |
|
@@ -519,6 +526,18 @@
|
|
519 |
|
520 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
521 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
522 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
523 |
|
524 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
@@ -749,12 +768,36 @@
|
|
749 |
|
750 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
751 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
752 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
753 |
|
754 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
755 |
|
756 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
757 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
758 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
759 |
|
760 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
@@ -924,10 +967,6 @@
|
|
924 |
</tr>
|
925 |
</tbody>
|
926 |
</table>
|
927 |
-
|
928 |
-
<p>You can find an example of implementation of both column and row linear TP in picotron:
|
929 |
-
|
930 |
-
<a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
|
931 |
|
932 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
933 |
|
@@ -1102,8 +1141,17 @@
|
|
1102 |
|
1103 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1104 |
|
1105 |
-
<p>You can find the full implementation of the AFAB pipeline in picotron
|
1106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1107 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1108 |
|
1109 |
<d-math block>
|
@@ -1132,8 +1180,17 @@
|
|
1132 |
|
1133 |
<p>Here is the example training loop from the above gist:</p>
|
1134 |
|
1135 |
-
<p>You can find the full implementation in picotron as well
|
1136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1137 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1138 |
|
1139 |
<h3>Interleaving stages</h3>
|
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
|
|
|
|
477 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
478 |
|
479 |
+
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
|
|
|
|
480 |
|
481 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
482 |
|
|
|
506 |
|
507 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
508 |
|
509 |
+
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
|
510 |
+
|
511 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
512 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
513 |
+
👉 Naive DP implementation with overlap in Picotron (Click to expand)
|
514 |
+
</summary>
|
515 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
516 |
+
<script
|
517 |
+
src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
|
518 |
+
</script>
|
519 |
+
</div>
|
520 |
+
</details>
|
521 |
|
522 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
523 |
|
|
|
526 |
|
527 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
528 |
|
529 |
+
<p>Here's the code implementation with bucketing:</p>
|
530 |
+
|
531 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
532 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
533 |
+
👉 Bucket DP implementation in Picotron (Click to expand)
|
534 |
+
</summary>
|
535 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
536 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
|
537 |
+
</script>
|
538 |
+
</div>
|
539 |
+
</details>
|
540 |
+
|
541 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
542 |
|
543 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
|
|
768 |
|
769 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
770 |
|
771 |
+
<p>Here's the code implementation of column wise tensor parallelism:</p>
|
772 |
+
|
773 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
774 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
775 |
+
👉 Column parallel TP implementation in Picotron (Click to expand)
|
776 |
+
</summary>
|
777 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
778 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
779 |
+
</div>
|
780 |
+
</details>
|
781 |
+
|
782 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
783 |
|
784 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
785 |
|
786 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
787 |
|
788 |
+
<p>Here's the implementation for row-wise tensor parallelism:</p>
|
789 |
+
|
790 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
791 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
792 |
+
👉 Row parallel TP implementation in Picotron (Click to expand)
|
793 |
+
</summary>
|
794 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
795 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
796 |
+
</div>
|
797 |
+
</details>
|
798 |
+
|
799 |
+
<p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
|
800 |
+
|
801 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
802 |
|
803 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
|
|
967 |
</tr>
|
968 |
</tbody>
|
969 |
</table>
|
|
|
|
|
|
|
|
|
970 |
|
971 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
972 |
|
|
|
1141 |
|
1142 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1143 |
|
1144 |
+
<p>You can find the full implementation of the AFAB pipeline in picotron:</p>
|
1145 |
|
1146 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1147 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1148 |
+
👉 AFAB PP implementation in Picotron (Click to expand)
|
1149 |
+
</summary>
|
1150 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1151 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1152 |
+
</div>
|
1153 |
+
</details>
|
1154 |
+
|
1155 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1156 |
|
1157 |
<d-math block>
|
|
|
1180 |
|
1181 |
<p>Here is the example training loop from the above gist:</p>
|
1182 |
|
1183 |
+
<p>You can find the full implementation in picotron as well:</p>
|
1184 |
|
1185 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1186 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1187 |
+
👉 1F1B PP implementation in Picotron (Click to expand)
|
1188 |
+
</summary>
|
1189 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1190 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1191 |
+
</div>
|
1192 |
+
</details>
|
1193 |
+
|
1194 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1195 |
|
1196 |
<h3>Interleaving stages</h3>
|
dist/style.css
CHANGED
@@ -20,7 +20,6 @@
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
23 |
-
|
24 |
.plotly_caption {
|
25 |
font-style: italic;
|
26 |
margin-top: 10px;
|
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
|
|
23 |
.plotly_caption {
|
24 |
font-style: italic;
|
25 |
margin-top: 10px;
|
src/index.html
CHANGED
@@ -474,13 +474,9 @@
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
477 |
-
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
478 |
-
|
479 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
480 |
|
481 |
-
<p>
|
482 |
-
|
483 |
-
<p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
|
484 |
|
485 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
486 |
|
@@ -510,7 +506,18 @@
|
|
510 |
|
511 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
512 |
|
513 |
-
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
|
515 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
516 |
|
@@ -519,6 +526,18 @@
|
|
519 |
|
520 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
521 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
522 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
523 |
|
524 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
@@ -749,12 +768,36 @@
|
|
749 |
|
750 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
751 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
752 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
753 |
|
754 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
755 |
|
756 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
757 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
758 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
759 |
|
760 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
@@ -924,10 +967,6 @@
|
|
924 |
</tr>
|
925 |
</tbody>
|
926 |
</table>
|
927 |
-
|
928 |
-
<p>You can find an example of implementation of both column and row linear TP in picotron:
|
929 |
-
|
930 |
-
<a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
|
931 |
|
932 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
933 |
|
@@ -1102,8 +1141,17 @@
|
|
1102 |
|
1103 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1104 |
|
1105 |
-
<p>You can find the full implementation of the AFAB pipeline in picotron
|
1106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1107 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1108 |
|
1109 |
<d-math block>
|
@@ -1132,8 +1180,17 @@
|
|
1132 |
|
1133 |
<p>Here is the example training loop from the above gist:</p>
|
1134 |
|
1135 |
-
<p>You can find the full implementation in picotron as well
|
1136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1137 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1138 |
|
1139 |
<h3>Interleaving stages</h3>
|
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
|
|
|
|
477 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
478 |
|
479 |
+
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
|
|
|
|
480 |
|
481 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
482 |
|
|
|
506 |
|
507 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
508 |
|
509 |
+
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
|
510 |
+
|
511 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
512 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
513 |
+
👉 Naive DP implementation with overlap in Picotron (Click to expand)
|
514 |
+
</summary>
|
515 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
516 |
+
<script
|
517 |
+
src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
|
518 |
+
</script>
|
519 |
+
</div>
|
520 |
+
</details>
|
521 |
|
522 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
523 |
|
|
|
526 |
|
527 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
528 |
|
529 |
+
<p>Here's the code implementation with bucketing:</p>
|
530 |
+
|
531 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
532 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
533 |
+
👉 Bucket DP implementation in Picotron (Click to expand)
|
534 |
+
</summary>
|
535 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
536 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
|
537 |
+
</script>
|
538 |
+
</div>
|
539 |
+
</details>
|
540 |
+
|
541 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
542 |
|
543 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
|
|
768 |
|
769 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
770 |
|
771 |
+
<p>Here's the code implementation of column wise tensor parallelism:</p>
|
772 |
+
|
773 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
774 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
775 |
+
👉 Column parallel TP implementation in Picotron (Click to expand)
|
776 |
+
</summary>
|
777 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
778 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
779 |
+
</div>
|
780 |
+
</details>
|
781 |
+
|
782 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
783 |
|
784 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
785 |
|
786 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
787 |
|
788 |
+
<p>Here's the implementation for row-wise tensor parallelism:</p>
|
789 |
+
|
790 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
791 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
792 |
+
👉 Row parallel TP implementation in Picotron (Click to expand)
|
793 |
+
</summary>
|
794 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
795 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
796 |
+
</div>
|
797 |
+
</details>
|
798 |
+
|
799 |
+
<p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
|
800 |
+
|
801 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
802 |
|
803 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
|
|
967 |
</tr>
|
968 |
</tbody>
|
969 |
</table>
|
|
|
|
|
|
|
|
|
970 |
|
971 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
972 |
|
|
|
1141 |
|
1142 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1143 |
|
1144 |
+
<p>You can find the full implementation of the AFAB pipeline in picotron:</p>
|
1145 |
|
1146 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1147 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1148 |
+
👉 AFAB PP implementation in Picotron (Click to expand)
|
1149 |
+
</summary>
|
1150 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1151 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1152 |
+
</div>
|
1153 |
+
</details>
|
1154 |
+
|
1155 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1156 |
|
1157 |
<d-math block>
|
|
|
1180 |
|
1181 |
<p>Here is the example training loop from the above gist:</p>
|
1182 |
|
1183 |
+
<p>You can find the full implementation in picotron as well:</p>
|
1184 |
|
1185 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1186 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1187 |
+
👉 1F1B PP implementation in Picotron (Click to expand)
|
1188 |
+
</summary>
|
1189 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1190 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1191 |
+
</div>
|
1192 |
+
</details>
|
1193 |
+
|
1194 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1195 |
|
1196 |
<h3>Interleaving stages</h3>
|
src/style.css
CHANGED
@@ -20,7 +20,6 @@
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
23 |
-
|
24 |
.plotly_caption {
|
25 |
font-style: italic;
|
26 |
margin-top: 10px;
|
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
|
|
23 |
.plotly_caption {
|
24 |
font-style: italic;
|
25 |
margin-top: 10px;
|