fix-figures

#55
by lvwerra HF staff - opened
assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
assets/images/tp_diagram.svg CHANGED
assets/images/tp_diagram4.png CHANGED

Git LFS Details

  • SHA256: f075304c019e12be1ac0ef8afa9241c03bc466f568dca0c66e20b1391a471bca
  • Pointer size: 131 Bytes
  • Size of remote file: 486 kB

Git LFS Details

  • SHA256: a37adac220e4ec37dd58be698d26630520501c2de71161c6601d6318e1cbffcd
  • Pointer size: 131 Bytes
  • Size of remote file: 618 kB
dist/assets/images/5D_nutshell_tp_sp.svg CHANGED
dist/assets/images/5d_nutshell_cp.svg CHANGED
dist/assets/images/5d_nutshell_ep.svg CHANGED
dist/assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
dist/assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
dist/assets/images/tp_diagram.svg CHANGED
dist/assets/images/tp_diagram4.png CHANGED

Git LFS Details

  • SHA256: 92f1591b62f4f7eb8a059b973a379784523915386ee9f682e17e3ab43d4f494d
  • Pointer size: 130 Bytes
  • Size of remote file: 89.8 kB

Git LFS Details

  • SHA256: cb2772716631ff96aeab01b1eb6cc8e59927d4f30cba72d8ba506dcf326406c7
  • Pointer size: 131 Bytes
  • Size of remote file: 129 kB
dist/index.html CHANGED
@@ -18,8 +18,28 @@
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
- "affiliation": {"name": "HuggingFace"},
22
  "authors": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  {
24
  "author":"Leandro Werra",
25
  "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
202
  </li>
203
  </ul>
204
 
 
 
205
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
206
 
207
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
580
  </ul>
581
 
582
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
583
- <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
584
 
585
  <p>The trace helps identify bottlenecks like:</p>
586
  <ul>
@@ -1080,11 +1102,9 @@
1080
 
1081
  <p>In practice we’ll go from the left diagram to the right:</p>
1082
 
1083
- <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1084
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1085
- SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
1086
-
1087
- <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
1088
 
1089
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1090
 
@@ -1099,7 +1119,7 @@
1099
  <li>"f" is an all-reduce to synchronize gradients</li>
1100
  </ul>
1101
 
1102
- <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1103
 
1104
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1105
 
@@ -1900,24 +1920,75 @@
1900
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1901
 
1902
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1903
- <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
1904
 
1905
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1906
 
1907
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1908
- <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
1909
-
1910
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1911
 
1912
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1917
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
-
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
1923
  <ul>
@@ -1953,8 +2024,9 @@
1953
 
1954
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1955
 
1956
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1957
 
 
1958
 
1959
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1960
 
@@ -1982,7 +2054,7 @@
1982
  tl.store(out_ptr0 + (x0), tmp6, xmask)
1983
  </d-code>
1984
 
1985
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
1986
 
1987
  <d-code block language="python">
1988
  @triton.jit
@@ -2013,23 +2085,25 @@
2013
 
2014
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2015
 
2016
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2017
 
2018
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2019
 
2020
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2021
 
2022
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2023
 
2024
- <ul>
2025
  <li>Pytorch: easy but slow</li>
2026
  <li>torch.compile: easy, fast, but not flexible</li>
2027
  <li>triton: harder, faster, and more flexible</li>
2028
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2029
 
2030
- </ul>
2031
 
2032
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2033
 
2034
  <h4>Memory Coalescing</h4>
2035
 
@@ -2060,8 +2134,12 @@
2060
 
2061
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2062
 
2063
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2064
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2065
 
2066
 
2067
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
@@ -2091,7 +2169,7 @@
2091
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2092
 
2093
 
2094
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
2095
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2096
 
2097
 
@@ -2197,14 +2275,14 @@
2197
 
2198
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2199
 
2200
- <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
2201
-
2202
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2203
 
2204
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2205
 
2206
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2207
- <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
2208
 
2209
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2210
  <ul>
@@ -2503,9 +2581,14 @@
2503
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2504
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2505
  </ul>
2506
-
2507
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2508
 
 
 
 
 
 
2509
  <h2>References</h2>
2510
 
2511
  <h3>Landmark LLM Scaling Papers</h3>
 
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
+ "affiliation": {"name": "Hugging Face"},
22
  "authors": [
23
+ {
24
+ "author":"Nouamane Tazi",
25
+ "authorURL":"https://huggingface.co/nouamanetazi"
26
+ },
27
+ {
28
+ "author":"Ferdinand Mom",
29
+ "authorURL":"https://huggingface.co/3outeille"
30
+ },
31
+ {
32
+ "author":"Haojun Zhao",
33
+ "authorURL":"https://huggingface.co/zzhhjjj"
34
+ },
35
+ {
36
+ "author":"Phuc Nguyen",
37
+ "authorURL":"https://huggingface.co/neuralink"
38
+ },
39
+ {
40
+ "author":"Mohamed Mekkouri",
41
+ "authorURL":"https://huggingface.co/medmekk"
42
+ },
43
  {
44
  "author":"Leandro Werra",
45
  "authorURL":"https://huggingface.co/lvwerra"
 
222
  </li>
223
  </ul>
224
 
225
+ <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
226
+
227
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
228
 
229
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
 
602
  </ul>
603
 
604
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
605
+ <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
606
 
607
  <p>The trace helps identify bottlenecks like:</p>
608
  <ul>
 
1102
 
1103
  <p>In practice we’ll go from the left diagram to the right:</p>
1104
 
1105
+ <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1106
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1107
+ SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" style="width: 500px" /></p>
 
 
1108
 
1109
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1110
 
 
1119
  <li>"f" is an all-reduce to synchronize gradients</li>
1120
  </ul>
1121
 
1122
+ <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1123
 
1124
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1125
 
 
1920
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1921
 
1922
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1923
+ <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
1924
 
1925
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1926
 
1927
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1928
+ <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
1929
+
1930
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1931
 
1932
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1937
+ <div>
1938
+ <d-code block language="python">
1939
+ // Host code
1940
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1941
+ // Allocate vectors in device memory
1942
+ int size = n * sizeof(float);
1943
+ float *d_A, *d_B, *d_C;
1944
+ cudaMalloc(&d_A, size);
1945
+ cudaMalloc(&d_B, size);
1946
+ cudaMalloc(&d_C, size);
1947
+
1948
+ // Copy vectors from host memory to device memory
1949
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1950
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1951
+
1952
+ // Invoke kernel
1953
+ int threadsPerBlock = 256;
1954
+ int blocksPerGrid =
1955
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1956
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1957
+
1958
+ // Copy result from device memory to host memory
1959
+ // h_C contains the result in host memory
1960
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1961
+
1962
+ // Free device memory
1963
+ cudaFree(d_A);
1964
+ cudaFree(d_B);
1965
+ cudaFree(d_C);
1966
+ }</d-code>
1967
+ <div class="figure-legend">
1968
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1969
+ </div>
1970
+ </div>
1971
+ <div>
1972
+ <d-code block language="python">
1973
+ // Device code
1974
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1975
+ {
1976
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1977
+ if (i < N)
1978
+ C[i] = A[i] + B[i];
1979
+ }
1980
+ </d-code>
1981
+ <div class="figure-legend">
1982
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1983
+
1984
+ </div>
1985
+ </div>
1986
+ </div>
1987
+
1988
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1989
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1990
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1991
+ -->
 
1992
  <p>Kernels are generally scheduled as follow:</p>
1993
 
1994
  <ul>
 
2024
 
2025
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2026
 
2027
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2028
 
2029
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2030
 
2031
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2032
 
 
2054
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2055
  </d-code>
2056
 
2057
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2058
 
2059
  <d-code block language="python">
2060
  @triton.jit
 
2085
 
2086
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2087
 
2088
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2089
 
2090
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2091
 
2092
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2093
 
2094
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2095
+
2096
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2097
 
2098
+ <ol>
2099
  <li>Pytorch: easy but slow</li>
2100
  <li>torch.compile: easy, fast, but not flexible</li>
2101
  <li>triton: harder, faster, and more flexible</li>
2102
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2103
 
2104
+ </ol>
2105
 
2106
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2107
 
2108
  <h4>Memory Coalescing</h4>
2109
 
 
2134
 
2135
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2136
 
2137
+ <div class="large-image-background">
2138
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2139
+ </div>
2140
+ <div class="large-image-background">
2141
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2142
+ </div>
2143
 
2144
 
2145
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
2169
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2170
 
2171
 
2172
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2173
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2174
 
2175
 
 
2275
 
2276
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2277
 
2278
+ <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
2279
+
2280
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2281
 
2282
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2283
 
2284
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2285
+ <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
2286
 
2287
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2288
  <ul>
 
2581
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2582
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2583
  </ul>
2584
+
2585
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2586
 
2587
+ <h3>Acknowledgements</h3>
2588
+
2589
+ <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
2590
+
2591
+
2592
  <h2>References</h2>
2593
 
2594
  <h3>Landmark LLM Scaling Papers</h3>
dist/main.bundle.js CHANGED
@@ -5396,7 +5396,7 @@ function _loadFragments() {
5396
  while (1) switch (_context5.prev = _context5.next) {
5397
  case 0:
5398
  fragmentName = element.id.replace('fragment-', '');
5399
- fragmentPath = "/fragments/".concat(fragmentName, ".html");
5400
  return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
5401
  var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
5402
  var fetchPromise;
 
5396
  while (1) switch (_context5.prev = _context5.next) {
5397
  case 0:
5398
  fragmentName = element.id.replace('fragment-', '');
5399
+ fragmentPath = "fragments/".concat(fragmentName, ".html");
5400
  return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
5401
  var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
5402
  var fetchPromise;
dist/main.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
dist/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }
src/fragmentLoader.js CHANGED
@@ -36,7 +36,7 @@ async function loadFragments() {
36
 
37
  async addFetch(element) {
38
  const fragmentName = element.id.replace('fragment-', '');
39
- const fragmentPath = `/fragments/${fragmentName}.html`;
40
 
41
  return new Promise(async (resolve, reject) => {
42
  try {
 
36
 
37
  async addFetch(element) {
38
  const fragmentName = element.id.replace('fragment-', '');
39
+ const fragmentPath = `fragments/${fragmentName}.html`;
40
 
41
  return new Promise(async (resolve, reject) => {
42
  try {
src/index.html CHANGED
@@ -18,8 +18,28 @@
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
- "affiliation": {"name": "HuggingFace"},
22
  "authors": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  {
24
  "author":"Leandro Werra",
25
  "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
202
  </li>
203
  </ul>
204
 
 
 
205
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
206
 
207
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
580
  </ul>
581
 
582
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
583
- <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
584
 
585
  <p>The trace helps identify bottlenecks like:</p>
586
  <ul>
@@ -1080,11 +1102,9 @@
1080
 
1081
  <p>In practice we’ll go from the left diagram to the right:</p>
1082
 
1083
- <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1084
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1085
- SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
1086
-
1087
- <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
1088
 
1089
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1090
 
@@ -1099,7 +1119,7 @@
1099
  <li>"f" is an all-reduce to synchronize gradients</li>
1100
  </ul>
1101
 
1102
- <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1103
 
1104
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1105
 
@@ -1900,24 +1920,75 @@
1900
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1901
 
1902
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1903
- <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
1904
 
1905
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1906
 
1907
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1908
- <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
1909
-
1910
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1911
 
1912
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
1913
 
1914
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1915
 
1916
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1917
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1918
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1919
- <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
1920
-
1921
  <p>Kernels are generally scheduled as follow:</p>
1922
 
1923
  <ul>
@@ -1953,8 +2024,9 @@
1953
 
1954
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1955
 
1956
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1957
 
 
1958
 
1959
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1960
 
@@ -1982,7 +2054,7 @@
1982
  tl.store(out_ptr0 + (x0), tmp6, xmask)
1983
  </d-code>
1984
 
1985
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
1986
 
1987
  <d-code block language="python">
1988
  @triton.jit
@@ -2013,23 +2085,25 @@
2013
 
2014
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2015
 
2016
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2017
 
2018
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2019
 
2020
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2021
 
2022
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2023
 
2024
- <ul>
2025
  <li>Pytorch: easy but slow</li>
2026
  <li>torch.compile: easy, fast, but not flexible</li>
2027
  <li>triton: harder, faster, and more flexible</li>
2028
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2029
 
2030
- </ul>
2031
 
2032
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2033
 
2034
  <h4>Memory Coalescing</h4>
2035
 
@@ -2060,8 +2134,12 @@
2060
 
2061
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2062
 
2063
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2064
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2065
 
2066
 
2067
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
@@ -2091,7 +2169,7 @@
2091
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2092
 
2093
 
2094
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
2095
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2096
 
2097
 
@@ -2197,14 +2275,14 @@
2197
 
2198
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2199
 
2200
- <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
2201
-
2202
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2203
 
2204
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2205
 
2206
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2207
- <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
2208
 
2209
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2210
  <ul>
@@ -2503,9 +2581,14 @@
2503
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2504
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2505
  </ul>
2506
-
2507
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2508
 
 
 
 
 
 
2509
  <h2>References</h2>
2510
 
2511
  <h3>Landmark LLM Scaling Papers</h3>
 
18
  "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
19
  "description": "This blog covers everything about scaling LLMs in 2025.",
20
  "published": "Feb 19, 2025",
21
+ "affiliation": {"name": "Hugging Face"},
22
  "authors": [
23
+ {
24
+ "author":"Nouamane Tazi",
25
+ "authorURL":"https://huggingface.co/nouamanetazi"
26
+ },
27
+ {
28
+ "author":"Ferdinand Mom",
29
+ "authorURL":"https://huggingface.co/3outeille"
30
+ },
31
+ {
32
+ "author":"Haojun Zhao",
33
+ "authorURL":"https://huggingface.co/zzhhjjj"
34
+ },
35
+ {
36
+ "author":"Phuc Nguyen",
37
+ "authorURL":"https://huggingface.co/neuralink"
38
+ },
39
+ {
40
+ "author":"Mohamed Mekkouri",
41
+ "authorURL":"https://huggingface.co/medmekk"
42
+ },
43
  {
44
  "author":"Leandro Werra",
45
  "authorURL":"https://huggingface.co/lvwerra"
 
222
  </li>
223
  </ul>
224
 
225
+ <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
226
+
227
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
228
 
229
  <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
 
602
  </ul>
603
 
604
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
605
+ <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
606
 
607
  <p>The trace helps identify bottlenecks like:</p>
608
  <ul>
 
1102
 
1103
  <p>In practice we’ll go from the left diagram to the right:</p>
1104
 
1105
+ <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
1106
  in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
1107
+ SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" style="width: 500px" /></p>
 
 
1108
 
1109
  <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
1110
 
 
1119
  <li>"f" is an all-reduce to synchronize gradients</li>
1120
  </ul>
1121
 
1122
+ <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
1123
 
1124
  <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
1125
 
 
1920
  <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
1921
 
1922
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
1923
+ <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
1924
 
1925
  <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
1926
 
1927
  <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
1928
+ <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
1929
+
1930
  <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
1931
 
1932
  <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1937
+ <div>
1938
+ <d-code block language="python">
1939
+ // Host code
1940
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1941
+ // Allocate vectors in device memory
1942
+ int size = n * sizeof(float);
1943
+ float *d_A, *d_B, *d_C;
1944
+ cudaMalloc(&d_A, size);
1945
+ cudaMalloc(&d_B, size);
1946
+ cudaMalloc(&d_C, size);
1947
+
1948
+ // Copy vectors from host memory to device memory
1949
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1950
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1951
+
1952
+ // Invoke kernel
1953
+ int threadsPerBlock = 256;
1954
+ int blocksPerGrid =
1955
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1956
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1957
+
1958
+ // Copy result from device memory to host memory
1959
+ // h_C contains the result in host memory
1960
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1961
+
1962
+ // Free device memory
1963
+ cudaFree(d_A);
1964
+ cudaFree(d_B);
1965
+ cudaFree(d_C);
1966
+ }</d-code>
1967
+ <div class="figure-legend">
1968
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1969
+ </div>
1970
+ </div>
1971
+ <div>
1972
+ <d-code block language="python">
1973
+ // Device code
1974
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1975
+ {
1976
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1977
+ if (i < N)
1978
+ C[i] = A[i] + B[i];
1979
+ }
1980
+ </d-code>
1981
+ <div class="figure-legend">
1982
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1983
+
1984
+ </div>
1985
+ </div>
1986
+ </div>
1987
+
1988
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1989
  <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1990
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1991
+ -->
 
1992
  <p>Kernels are generally scheduled as follow:</p>
1993
 
1994
  <ul>
 
2024
 
2025
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2026
 
2027
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2028
 
2029
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2030
 
2031
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2032
 
 
2054
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2055
  </d-code>
2056
 
2057
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2058
 
2059
  <d-code block language="python">
2060
  @triton.jit
 
2085
 
2086
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2087
 
2088
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2089
 
2090
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2091
 
2092
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2093
 
2094
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2095
+
2096
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2097
 
2098
+ <ol>
2099
  <li>Pytorch: easy but slow</li>
2100
  <li>torch.compile: easy, fast, but not flexible</li>
2101
  <li>triton: harder, faster, and more flexible</li>
2102
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2103
 
2104
+ </ol>
2105
 
2106
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2107
 
2108
  <h4>Memory Coalescing</h4>
2109
 
 
2134
 
2135
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2136
 
2137
+ <div class="large-image-background">
2138
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2139
+ </div>
2140
+ <div class="large-image-background">
2141
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2142
+ </div>
2143
 
2144
 
2145
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
2169
  <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2170
 
2171
 
2172
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2173
  <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2174
 
2175
 
 
2275
 
2276
  <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
2277
 
2278
+ <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
2279
+
2280
  <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
2281
 
2282
  <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
2283
 
2284
  <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
2285
+ <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
2286
 
2287
  <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
2288
  <ul>
 
2581
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2582
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2583
  </ul>
2584
+
2585
  <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2586
 
2587
+ <h3>Acknowledgements</h3>
2588
+
2589
+ <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
2590
+
2591
+
2592
  <h2>References</h2>
2593
 
2594
  <h3>Landmark LLM Scaling Papers</h3>
src/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }