Low Precision Training Methods

Accelerate provides integrations to train on lower precision methods using specified supported hardware through the TransformersEngine and MS-AMP packages. This documentation will help guide you through what hardware is supported, how to configure your Accelerator to leverage the low precision methods, and what you can expect when training.

What training on FP8 means

To explore more of the nitty-gritty in training in FP8 with PyTorch and Accelerate, check out the concept_guide on why this can be difficult. But essentially rather than training in BF16, some (or all) aspects of training a model can be performed using 8 bits instead of 16. The challenge is doing so without degrading final performance.

This is only enabled on specific NVIDIA hardware, namely:

Anything after the 3000 series consumer graphics cards (such as the 4090)
Hopper-based GPU architectures (such as the H100 and H200)

What this will result in is some gain in the memory used (as we’ve cut the needed memory in half for some parts of training) and an increase in throughput should be seen as well for larger models that can replace certain layers with FP8-enabled ones.

Configuring the Accelerator

Currently two different backends for FP8 are supported (TransformersEngine and MS-AMP), each with different capabilities and configurations.

To use either, the same core API is used. Just pass mixed_precision="fp8" to either the Accelerator, during accelerate config when prompted about mixed precision, or as part of your config.yaml file in the mixed_precision key:

from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="fp8")

By default, if MS-AMP is available in your environment, Accelerate will automatically utilize it as a backend. To specify it yourself (and customize other parts of the FP8 mixed precision setup), you can utilize the utils.FP8RecipeKwargs or clarify it in your config yaml/during accelerate launch:

from accelerate import Accelerator
from accelerate.utils import FP8RecipeKwargs
kwargs = [FP8RecipeKwargs(backend="msamp")]
# Or to specify the backend as `TransformersEngine` even if MS-AMP is installed
# kwargs = [FP8RecipeKwargs(backend="te")]
accelerator = Accelerator(mixed_precision="fp8", kwarg_handlers=kwargs)

mixed_precision: fp8
fp8_config:
  amax_compute_algorithm: max
  amax_history_length: 1024
  backend: TE
  fp8_format: HYBRID
  interval: 1
  margin: 0
  override_linear_precision: false
  use_autocast_during_eval: false

Configuring MS-AMP

Of the two, MS-AMP is traditionally the easier one to configure as there is only a single argument: the optimization level.

Currently two levels of optimization are supported in the Accelerate integration, "O1" and "O2" (using the letter ‘o’, not zero).

"O1" will cast the weight gradients and all_reduce communications to happen in 8-bit, while the rest are done in 16 bit. This reduces the general GPU memory usage and speeds up communication bandwidths.
"O2" will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the Adam optimizer is supported). This tries its best to minimize final accuracy degradation and will save the highest potential memory.

To specify an optimization level, pass it to the FP8KwargsHandler by setting the optimization_level argument:

from accelerate import Accelerator
from accelerate.utils import FP8RecipeKwargs
kwargs = [FP8RecipeKwargs(backend="msamp", optimization_level="O2")]
accelerator = Accelerator(mixed_precision="fp8", kwarg_handlers=kwargs)

Or during accelerate launch via --fp8_backend=msamp --fp8_opt_level=O2

Similarly this can be set in your config.yaml:

mixed_precision: fp8
fp8_config:
    backend: MSAMP
    opt_level: O2

Configuring TransformersEngine

TransformersEngine has much more available for customizing how and what FP8 calculations are performed. A full list of supported arguments and what they mean are available in NVIDIA’s documentation, however they are restated as part of FP8KwargsHandler’s docstring for your convenience.

Accelerate tries to set sensible defaults, but exploring and tweaking the various parameters yourself can lead to better performance potentially.

To use it, specify backend="te" and modify any of the arguments you want as part of your kwarg handler:

from accelerate import Accelerator
from accelerate.utils import FP8RecipeKwargs
kwargs = [FP8RecipeKwargs(backend="te", ...)]
accelerator = Accelerator(mixed_precision="fp8", kwarg_handlers=kwargs)

Or during accelerate launch via --fp8_backend=te .... Use accelerate launch --fp8_backend=te -h to see relevent arguments.

Similarly this can be set in your config.yaml:

mixed_precision: fp8
fp8_config:
    amax_compute_algorithm: max
    amax_history_length: 1024
    backend: TE
    fp8_format: HYBRID
    interval: 1
    margin: 0
    override_linear_precision: false
    use_autocast_during_eval: false

Example Zoo

We have examples showcasing training with FP8 both with accelerate and its underlying implementation available in the accelerate repo. Currently we support scripts showcasing:

Single GPU
Distributed Data Parallelism (Multi-GPU)
Fully Sharded Data Parallelism
DeepSpeed ZeRO 1 through 3

Find out more here