DeepSpeed for HPUs

DeepSpeed enables you to fit and train larger models on HPUs thanks to various optimizations described in the ZeRO paper. In particular, you can use the two following ZeRO configurations that have been validated to be fully functioning with Gaudi:

ZeRO-1: partitions the optimizer states across processes.
ZeRO-2: partitions the optimizer states + gradients across processes.
ZeRO-3: ZeRO-2 + full model state is partitioned across the processes.

These configurations are fully compatible with Intel Gaudi Mixed Precision and can thus be used to train your model in bf16 precision.

You can find more information about DeepSpeed Gaudi integration here.

Setup

To use DeepSpeed on Gaudi, you need to install Optimum for Intel Gaudi and DeepSpeed fork for Intel Gaudi with:

pip install optimum[habana]
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0

Using DeepSpeed with Optimum for Intel Gaudi

The GaudiTrainer allows using DeepSpeed as easily as the Transformers Trainer. This can be done in 3 steps:

A DeepSpeed configuration has to be defined.
The deepspeed training argument enables to specify the path to the DeepSpeed configuration.
The deepspeed launcher must be used to run your script.

These steps are detailed below. A comprehensive guide about how to use DeepSpeed with the Transformers Trainer is also available here.

DeepSpeed configuration

The DeepSpeed configuration to use is passed through a JSON file and enables you to choose the optimizations to apply. Here is an example for applying ZeRO-2 optimizations and bf16 precision:

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}

The special value "auto" enables to automatically get the correct or most efficient value. You can also specify the values yourself but, if you do so, you should be careful not to have conflicting values with your training arguments. It is strongly advised to read this section in the Transformers documentation to completely understand how this works.

Other examples of configurations for HPUs are proposed here by Intel.

The Transformers documentation explains how to write a configuration from scratch very well. A more complete description of all configuration possibilities is available here.

The deepspeed training argument

To use DeepSpeed, you must specify deespeed=path_to_my_deepspeed_configuration in your GaudiTrainingArguments instance:

training_args = GaudiTrainingArguments(
    # my usual training arguments...
    use_habana=True,
    use_lazy_mode=True,
    gaudi_config_name=path_to_my_gaudi_config,
    deepspeed=path_to_my_deepspeed_config,
)

This argument both indicates that DeepSpeed should be used and points to your DeepSpeed configuration.

Launching your script

Finally, there are two possible ways to launch your script:

Using the gaudi_spawn.py script:

python gaudi_spawn.py \
    --world_size number_of_hpu_you_have --use_deepspeed \
    path_to_script.py --args1 --args2 ... --argsN \
    --deepspeed path_to_deepspeed_config

where --argX is an argument of the script to run with DeepSpeed.

Using the DistributedRunner directly in code:

from optimum.habana.distributed import DistributedRunner
from optimum.utils import logging

world_size=8 # Number of HPUs to use (1 or 8)

# define distributed runner
distributed_runner = DistributedRunner(
    command_list=["scripts/train.py --args1 --args2 ... --argsN --deepspeed path_to_deepspeed_config"],
    world_size=world_size,
    use_deepspeed=True,
)

# start job
ret_code = distributed_runner.run()

You should set "use_fused_adam": false in your Gaudi configuration because it is not compatible with DeepSpeed yet.

< > Update on GitHub