diff --git "a/dist/index.html" "b/dist/index.html" --- "a/dist/index.html" +++ "b/dist/index.html" @@ -61,7 +61,7 @@

The Ultra-Scale Playbook:
Training LLMs on GPU Clusters

-
+

We ran over 4000 scaling experiments on up to 512 GPUs and measured throughput (size of markers) and GPU utilization (color of markers). Note that both are normalized per model size in this visualization.

@@ -1355,7 +1355,7 @@
  1. Send “current keys and values” to the next machine except during the last time step in a non-blocking manner so we can starts the following step before this step is finished
  2. -
  3. Locally compute the attention score on the “current keys and values” it already has, which typically involves performing Softmax(\frac{QK^T}{\sqrt{d}}) * Vd-math>.
  4. +
  5. Locally compute the attention score on the “current keys and values” it already has, which typically involves performing Softmax(\frac{QK^T}{\sqrt{d}}) * V.
  6. Wait to receive keys and values from the previous GPU and then circle back to step 1. where “current keys and values” are now the key/values just received from the previous GPU.
@@ -1874,7 +1874,7 @@

Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.

-

How to Find the Best Training Configuration

+

Finding the Best Training Configuration

We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?