Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

fix-figures

#55

by lvwerra HF staff - opened 14 days ago

base: refs/heads/main

←

from: refs/pr/55

Discussion Files changed

+273

-72

Files changed (18) hide show

assets/images/torch-compile-triton-kernel.png +3 -0
assets/images/torch-compile-triton.png +3 -0
assets/images/tp_diagram.svg +2 -3
assets/images/tp_diagram4.png +2 -2
dist/assets/images/5D_nutshell_tp_sp.svg +1 -1
dist/assets/images/5d_nutshell_cp.svg +1 -1
dist/assets/images/5d_nutshell_ep.svg +0 -0
dist/assets/images/torch-compile-triton-kernel.png +3 -0
dist/assets/images/torch-compile-triton.png +3 -0
dist/assets/images/tp_diagram.svg +1 -1
dist/assets/images/tp_diagram4.png +2 -2
dist/index.html +112 -29
dist/main.bundle.js +1 -1
dist/main.bundle.js.map +0 -0
dist/style.css +12 -0
src/fragmentLoader.js +1 -1
src/index.html +112 -29
src/style.css +12 -0

assets/images/torch-compile-triton-kernel.png ADDED Viewed

Git LFS Details

SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
Pointer size: 131 Bytes
Size of remote file: 113 kB

assets/images/torch-compile-triton.png ADDED Viewed

Git LFS Details

SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
Pointer size: 131 Bytes
Size of remote file: 102 kB

assets/images/tp_diagram.svg CHANGED Viewed

assets/images/tp_diagram4.png CHANGED Viewed

Git LFS Details

SHA256: f075304c019e12be1ac0ef8afa9241c03bc466f568dca0c66e20b1391a471bca
Pointer size: 131 Bytes
Size of remote file: 486 kB

Git LFS Details

SHA256: a37adac220e4ec37dd58be698d26630520501c2de71161c6601d6318e1cbffcd
Pointer size: 131 Bytes
Size of remote file: 618 kB

dist/assets/images/5D_nutshell_tp_sp.svg CHANGED Viewed

dist/assets/images/5d_nutshell_cp.svg CHANGED Viewed

dist/assets/images/5d_nutshell_ep.svg CHANGED Viewed

dist/assets/images/torch-compile-triton-kernel.png ADDED Viewed

Git LFS Details

SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
Pointer size: 131 Bytes
Size of remote file: 113 kB

dist/assets/images/torch-compile-triton.png ADDED Viewed

Git LFS Details

SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
Pointer size: 131 Bytes
Size of remote file: 102 kB

dist/assets/images/tp_diagram.svg CHANGED Viewed

dist/assets/images/tp_diagram4.png CHANGED Viewed

Git LFS Details

SHA256: 92f1591b62f4f7eb8a059b973a379784523915386ee9f682e17e3ab43d4f494d
Pointer size: 130 Bytes
Size of remote file: 89.8 kB

Git LFS Details

SHA256: cb2772716631ff96aeab01b1eb6cc8e59927d4f30cba72d8ba506dcf326406c7
Pointer size: 131 Bytes
Size of remote file: 129 kB

dist/index.html CHANGED Viewed

@@ -18,8 +18,28 @@
     "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
     "description": "This blog covers everything about scaling LLMs in 2025.",
     "published": "Feb 19, 2025",
-    "affiliation": {"name": "HuggingFace"},
     "authors": [
       {
         "author":"Leandro Werra",
         "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
           </li>
         </ul>
         <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
         <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
-        <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
@@ -1080,11 +1102,9 @@
         <p>In practice we’ll go from the left diagram to the right:</p>
-        <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
             in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
-           SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
-        <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
         <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
@@ -1099,7 +1119,7 @@
             <li>"f" is an all-reduce to synchronize gradients</li>
         </ul>
-        <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
         <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
@@ -1900,24 +1920,75 @@
         <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
-        <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
         <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
-        <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
         <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
         <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
@@ -1953,8 +2024,9 @@
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
@@ -1982,7 +2054,7 @@
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
-        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
@@ -2013,23 +2085,25 @@
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
-        <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
-        <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
-        <ul>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
-        </ul>
-        <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
@@ -2060,8 +2134,12 @@
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
@@ -2091,7 +2169,7 @@
         <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
-        <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
         <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
@@ -2197,14 +2275,14 @@
         <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
-        <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
         <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
         <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
         <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
-        <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
         <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
         <ul>
@@ -2503,9 +2581,14 @@
             <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
             <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
         </ul>
         <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
         <h2>References</h2>
         <h3>Landmark LLM Scaling Papers</h3>

     "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
     "description": "This blog covers everything about scaling LLMs in 2025.",
     "published": "Feb 19, 2025",
+    "affiliation": {"name": "Hugging Face"},
     "authors": [
+      {
+        "author":"Nouamane Tazi",
+        "authorURL":"https://huggingface.co/nouamanetazi"
+      },
+      {
+        "author":"Ferdinand Mom",
+        "authorURL":"https://huggingface.co/3outeille"
+      },
+      {
+        "author":"Haojun Zhao",
+        "authorURL":"https://huggingface.co/zzhhjjj"
+      },
+      {
+        "author":"Phuc Nguyen",
+        "authorURL":"https://huggingface.co/neuralink"
+      },
+      {
+        "author":"Mohamed Mekkouri",
+        "authorURL":"https://huggingface.co/medmekk"
+      },
       {
         "author":"Leandro Werra",
         "authorURL":"https://huggingface.co/lvwerra"
           </li>
         </ul>
+        <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
         <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
         <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
+        <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
         <p>In practice we’ll go from the left diagram to the right:</p>
+        <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
             in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
+           SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png"  style="width: 500px" /></p>
         <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
             <li>"f" is an all-reduce to synchronize gradients</li>
         </ul>
+        <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
         <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
         <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
+        <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
         <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
+        <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
         <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
         <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
+        <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
+            <div>
+                <d-code block language="python">
+                    // Host code
+                    void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
+                        // Allocate vectors in device memory
+                        int size = n * sizeof(float);
+                        float *d_A, *d_B, *d_C;
+                        cudaMalloc(&d_A, size);
+                        cudaMalloc(&d_B, size);
+                        cudaMalloc(&d_C, size);
+                        // Copy vectors from host memory to device memory
+                        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
+                        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
+                        // Invoke kernel
+                        int threadsPerBlock = 256;
+                        int blocksPerGrid =
+                                (N + threadsPerBlock - 1) / threadsPerBlock;
+                        VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
+                        // Copy result from device memory to host memory
+                        // h_C contains the result in host memory
+                        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
+                        // Free device memory
+                        cudaFree(d_A);
+                        cudaFree(d_B);
+                        cudaFree(d_C);
+                    }</d-code>
+        <div class="figure-legend">
+            <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+        </div>
+    </div>
+            <div>
+                <d-code block language="python">
+                    // Device code
+                    __global__ void VecAdd(float* A, float* B, float* C, int N)
+                    {
+                        int i = blockDim.x * blockIdx.x + threadIdx.x;
+                        if (i < N)
+                            C[i] = A[i] + B[i];
+                    }
+                </d-code>
+            <div class="figure-legend">
+                <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+            </div>
+            </div>
+        </div>
+        <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
+        <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
+        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
+        <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
+        <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
+        <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
+        <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
+        <ol>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
+        </ol>
+        <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
+        </div>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
+        </div>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
         <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
+        <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
         <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
         <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
+        <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
         <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
         <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
         <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
+        <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
         <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
         <ul>
             <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
             <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
         </ul>
         <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
+        <h3>Acknowledgements</h3>
+        <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
         <h2>References</h2>
         <h3>Landmark LLM Scaling Papers</h3>

dist/main.bundle.js CHANGED Viewed

@@ -5396,7 +5396,7 @@ function _loadFragments() {
                     while (1) switch (_context5.prev = _context5.next) {
                       case 0:
                         fragmentName = element.id.replace('fragment-', '');
-                        fragmentPath = "/fragments/".concat(fragmentName, ".html");
                         return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
                           var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
                             var fetchPromise;

                     while (1) switch (_context5.prev = _context5.next) {
                       case 0:
                         fragmentName = element.id.replace('fragment-', '');
+                        fragmentPath = "fragments/".concat(fragmentName, ".html");
                         return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
                           var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
                             var fetchPromise;

dist/main.bundle.js.map CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/style.css CHANGED Viewed

@@ -424,3 +424,15 @@ d-article {
 d-code {
     font-size: 12px;
 }

 d-code {
     font-size: 12px;
 }
+.large-image-background {
+        width: 100vw;
+        padding-top: 10px;
+        padding-bottom: 10px;
+        margin-left: calc(-50vw + 50%);
+        margin-right: calc(-50vw + 50%);
+        background: white;
+        height: fit-content; /* This will make it match the image height */
+        display: flex;
+        justify-content: center; /* This will center your image */
+}

src/fragmentLoader.js CHANGED Viewed

@@ -36,7 +36,7 @@ async function loadFragments() {
         async addFetch(element) {
             const fragmentName = element.id.replace('fragment-', '');
-            const fragmentPath = `/fragments/${fragmentName}.html`;
             return new Promise(async (resolve, reject) => {
                 try {

         async addFetch(element) {
             const fragmentName = element.id.replace('fragment-', '');
+            const fragmentPath = `fragments/${fragmentName}.html`;
             return new Promise(async (resolve, reject) => {
                 try {

src/index.html CHANGED Viewed

@@ -18,8 +18,28 @@
     "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
     "description": "This blog covers everything about scaling LLMs in 2025.",
     "published": "Feb 19, 2025",
-    "affiliation": {"name": "HuggingFace"},
     "authors": [
       {
         "author":"Leandro Werra",
         "authorURL":"https://huggingface.co/lvwerra"
@@ -202,6 +222,8 @@
           </li>
         </ul>
         <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
         <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
@@ -580,7 +602,7 @@
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
-        <p>Figure: Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
@@ -1080,11 +1102,9 @@
         <p>In practice we’ll go from the left diagram to the right:</p>
-        <p><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
             in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
-           SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png" /></p>
-        <p>Where the abbreviations are: in forward: f = no-op ; f<em> = all-reduce ; g = all-gather ; g</em> = reduce-scatter in backward: f = all-reduce ; f<em> = no-op ; g = reduce-scatter ; g</em> = all-gather SP region needs full hidden_dim</p>
         <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
@@ -1099,7 +1119,7 @@
             <li>"f" is an all-reduce to synchronize gradients</li>
         </ul>
-        <p>These operations "f" and "f<em>" are called </em><em>conjugate</em>* pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
         <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
@@ -1900,24 +1920,75 @@
         <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
-        <p><em>Source: https://blog.codingconfessions.com/p/gpu-computing.</em></p>
         <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
-        <p><em>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</em></p>
         <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
         <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>Figure 6: Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
@@ -1953,8 +2024,9 @@
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
@@ -1982,7 +2054,7 @@
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
-        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
@@ -2013,23 +2085,25 @@
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
-        <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
-        <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
-        <ul>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
-        </ul>
-        <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
@@ -2060,8 +2134,12 @@
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
@@ -2091,7 +2169,7 @@
         <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
-        <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong> !</p>
         <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
@@ -2197,14 +2275,14 @@
         <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
-        <p><img alt="image.png" src="/assets/images/flashattn.png" /></p>
         <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
         <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
         <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
-        <p>From the FLASH-ATTENTION paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p>
         <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
         <ul>
@@ -2503,9 +2581,14 @@
             <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
             <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
         </ul>
         <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
         <h2>References</h2>
         <h3>Landmark LLM Scaling Papers</h3>

     "title": "The Ultra-Scale Playbook: Training LLMs on GPU Clusters",
     "description": "This blog covers everything about scaling LLMs in 2025.",
     "published": "Feb 19, 2025",
+    "affiliation": {"name": "Hugging Face"},
     "authors": [
+      {
+        "author":"Nouamane Tazi",
+        "authorURL":"https://huggingface.co/nouamanetazi"
+      },
+      {
+        "author":"Ferdinand Mom",
+        "authorURL":"https://huggingface.co/3outeille"
+      },
+      {
+        "author":"Haojun Zhao",
+        "authorURL":"https://huggingface.co/zzhhjjj"
+      },
+      {
+        "author":"Phuc Nguyen",
+        "authorURL":"https://huggingface.co/neuralink"
+      },
+      {
+        "author":"Mohamed Mekkouri",
+        "authorURL":"https://huggingface.co/medmekk"
+      },
       {
         "author":"Leandro Werra",
         "authorURL":"https://huggingface.co/lvwerra"
           </li>
         </ul>
+        <aside>If you want to watch a video on distributed training rather than reading the blog or picotron code checkout <a href="https://www.youtube.com/watch?v=u2VSwDDpaBM&list=PL-_armZiJvAnhcRr6yTJ0__f3Oi-LLi9S">Ferdinand's YouTube channel</a>.</aside>
         <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
         <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
+        <div class="figure-legend"><p>Example trace showing CPU thread launching kernels asynchronously to GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
         <p>In practice we’ll go from the left diagram to the right:</p>
+        <p style="text-align: center"><img alt=" in forward: f = no-op ; f* = all-reduce ; g = all-gather ; g* = reduce-scatter
             in backward: f = all-reduce ; f* = no-op ; g = reduce-scatter ; g* = all-gather
+           SP region needs full hidden_dim" src="/assets/images/tp_sp_diagram.png"  style="width: 500px" /></p>
         <p>The diagram shows how we transition between tensor-parallel and sequence-parallel regions using different collective operations (labeled "f" and "g"). The key challenge is managing these transitions efficiently while keeping memory usage low and maintaining correctness.</p>
             <li>"f" is an all-reduce to synchronize gradients</li>
         </ul>
+        <p>These operations "f" and "f*" are called <strong>conjugate</strong> pairs because they complement each other - when one is a no-op in forward, the other is an all-reduce in backward, and vice versa.</p>
         <p>For sequence parallelism (SP), we use different operations labeled "g" and "g*". Specifically, we avoid using all-reduce in the SP region since that would require gathering the full activations and increase our peak memory usage, defeating the purpose of SP.</p>
         <p>On the compute side, GPUs consist of an array of compute units called <strong>Streaming Multiprocessors</strong> (SM). Each SM contains and controls a set of streaming processors, also known as cores. For example, an Nvidia H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
+        <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
         <p>The memory side is also highly hierarchical with several layers of cache and memory: <strong>Registers</strong> are the smallest units and are private to the threads during executions, <strong>Shared Memory</strong> and <strong>L1 cache are</strong> shared between the threads running on a single SM, higher up is the <strong>L2 cache</strong> shared by all SMs, finally there is the <strong>Global Memory</strong> which is the largest memory on the GPU (the advertised 80 GB for a H100 for instance) but also the slowest to access and query.</p>
         <p><img alt="image.png" src="/assets/images/diving_primergpu2.svg" /></p>
+        <div class="figure-legend"><p>Source: https://www.youtube.com/watch?v=ZQKMZIP3Fzg</p></div>
         <p>The goal of GPU will be to run as many workloads as possible, in parallel, on the GPU cores, by taking advantage of this hierarchical organization of compute/memory.</p>
         <p>A piece of code running on a core of the GPU is called a <strong>kernel</strong>. It can be written at a high-level in <strong>CUDA</strong> or <strong>Triton</strong> for instance, and is then compiled to Parallel Thread Execution, PTX, the low-level assembly used by NVIDIA GPUs.</p>
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
+        <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
+            <div>
+                <d-code block language="python">
+                    // Host code
+                    void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
+                        // Allocate vectors in device memory
+                        int size = n * sizeof(float);
+                        float *d_A, *d_B, *d_C;
+                        cudaMalloc(&d_A, size);
+                        cudaMalloc(&d_B, size);
+                        cudaMalloc(&d_C, size);
+                        // Copy vectors from host memory to device memory
+                        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
+                        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
+                        // Invoke kernel
+                        int threadsPerBlock = 256;
+                        int blocksPerGrid =
+                                (N + threadsPerBlock - 1) / threadsPerBlock;
+                        VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
+                        // Copy result from device memory to host memory
+                        // h_C contains the result in host memory
+                        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
+                        // Free device memory
+                        cudaFree(d_A);
+                        cudaFree(d_B);
+                        cudaFree(d_C);
+                    }</d-code>
+        <div class="figure-legend">
+            <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+        </div>
+    </div>
+            <div>
+                <d-code block language="python">
+                    // Device code
+                    __global__ void VecAdd(float* A, float* B, float* C, int N)
+                    {
+                        int i = blockDim.x * blockIdx.x + threadIdx.x;
+                        if (i < N)
+                            C[i] = A[i] + B[i];
+                    }
+                </d-code>
+            <div class="figure-legend">
+                <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+            </div>
+            </div>
+        </div>
+        <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
         <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
+        <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
+        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
+        <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
+        <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
+        <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
+        <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
+        <ol>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
+        </ol>
+        <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
+        </div>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
+        </div>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
         <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
+        <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
         <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
         <p>A basic implementation of the attention mechanism involve a lot of transfer between memory and workers. It requires materializing the S and P matrices in HBM which means that the results need to be sent to HBM and then back to SRAM for the next computations:</p>
+        <p style="text-align: center"><img alt="image.png" src="/assets/images/flashattn.png" style="width: 500px" /></p>
         <p>Since bandwidth is much lower in HBM this introduces a severe bottleneck in the attention computation. Can we do better? Tri Dao says yes!</p>
         <p>The key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM. But we can do even better and avoid materializing the very large S matrix all together in favor of keeping only the necessary statistics for computing the normalization factor of the softmax. So we can compute part of <d-math>O</d-math> directly in one computation in SRAM rather than moving intermediate results back and forth. In this case, not even do we make use of the shared memory but we also release the memory bottleneck resulting from materializing one of the largest activation matrices in the model (at long context length), the attention matrix.</p>
         <p><img alt="image.png" src="/assets/images/flashattn2.png" /></p>
+        <div class="figure-legend"><p>Source: FlashAttention paper<d-cite bibtex-key="dao2022flashattention"></d-cite></p></div>
         <p>The idea of flash attention resolves so many bottlenecks in model training that it has quickly become the default way to perform attention in all transformers:</p>
         <ul>
             <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
             <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
         </ul>
         <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
+        <h3>Acknowledgements</h3>
+        <p>We thank <a href="https://huggingface.co/eliebak">Elie</a> for conducting thorough reviews and creating the audio components using NotebookLM. Special thanks to <a href="https://huggingface.co/hynky">Hynek</a> for optimizing the frontend performance. We also thank <a href="https://huggingface.co/sbrandeis">Simon</a> for resolving some issues on the hub.</p>
         <h2>References</h2>
         <h3>Landmark LLM Scaling Papers</h3>

src/style.css CHANGED Viewed

@@ -424,3 +424,15 @@ d-article {
 d-code {
     font-size: 12px;
 }

 d-code {
     font-size: 12px;
 }
+.large-image-background {
+        width: 100vw;
+        padding-top: 10px;
+        padding-bottom: 10px;
+        margin-left: calc(-50vw + 50%);
+        margin-right: calc(-50vw + 50%);
+        background: white;
+        height: fit-content; /* This will make it match the image height */
+        display: flex;
+        justify-content: center; /* This will center your image */
+}