-
+
We ran over 4000 scaling experiments on up to 512 GPUs and measured throughput (size of markers) and GPU utilization (color of markers). Note that both are normalized per model size in this visualization.
@@ -1381,7 +1381,7 @@- Send “current keys and values” to the next machine except during the last time step in a non-blocking manner so we can starts the following step before this step is finished -
- Locally compute the attention score on the “current keys and values” it already has, which typically involves performing
Softmax(\frac{QK^T}{\sqrt{d}}) * V d-math>.
+ - Locally compute the attention score on the “current keys and values” it already has, which typically involves performing
Softmax(\frac{QK^T}{\sqrt{d}}) * V . - Wait to receive keys and values from the previous GPU and then circle back to step 1. where “current keys and values” are now the key/values just received from the previous GPU.