Accelerate documentation

FSDP vs DeepSpeed

You are viewing v1.1.0 version. A newer version v1.2.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

FSDP vs DeepSpeed

Accelerate offers flexibilty of training frameworks, by integrating two extremely powerful tools for distributed training, namely Pytorch FSDP and Microsoft DeepSpeed. The aim of this tutorial is to draw parallels, as well as to outline potential differences, to empower the user to switch seamlessly between these two frameworks.

To switch between the frameworks, we recommend launching code accelerate launch passing in the correct config file with --config_file, or passing in the respective arguments directly for FSDP and DeepSpeed .

Example Accelerate configurations can be found here for DeepSpeed and FSDP, or in the example zoo under “Launch Configurations”

This tutorial is for single-node, multi-GPU, scenarios only.

Configuring Functionalities

Model tensors are split into different GPUs in an attempt to scale up model sizes; this is termed sharding in FSDP, and partitioning in DeepSpeed. FSDP sharding and DeepSpeed ZeRO (partitioning) stages are configured by --fsdp_sharding_strategy, and --zero_stage, respectively. In particular, FSDP FULL_SHARD maps to DeepSpeed ZeRO stage 3; see this comprehensive mapping between FSDP sharding and DeepSpeed ZeRO settings. The below table summarizes and groups similar settings:

Group Framework Configuration Example Restrictions (if any)
sharding / partitioning FSDP
DeepSpeed
--fsdp_sharding_strategy
--zero_stage
1 (FULL_SHARD)
3
offload FSDP
DeepSpeed
--fsdp_offload_params
--offload_param_device
--offload_optimizer_device
true
cpu
cpu
all or nothing

model loading FSDP
DeepSpeed
--fsdp_cpu_ram_efficient_loading
--zero3_init_flag
true
true

only ZeRO 3
efficient checkpointing FSDP
DeepSpeed
--fsdp_state_dict_type
--zero3_save_16bit_model
SHARDED_STATE_DICT
true

only ZeRO 3
weights prefetching FSDP

DeepSpeed
--fsdp_forward_prefetch
--fsdp_backward_prefetch
None
true
BACKWARD_PRE


model FSDP

DeepSpeed
--fsdp_auto_wrap_policy
--fsdp_transformer_layer_cls_to_wrap
None
TRANSFORMER_BASED_WRAP
<Layer Class>

Usually not needed
Transparent to user.
parameters summoning FSDP
DeepSpeed
--fsdp_use_orig_params
None
true required for torch.compile
Transparent to user
parameters syncing FSDP
DeepSpeed
--fsdp_sync_module_states
None
true
training FSDP
DeepSpeed
None
--gradient_accumulation_steps
--gradient_clipping

auto
auto
Transparent to user

For detailed descriptions of the above, refer to Accelerate launch documentation.

To access other DeepSpeed configurations, such as mixed precision settings, you need to pass in a --deepspeed_config_file, see the documentation.

DeepSpeed can be also configured via DeepSpeedPlugin, e.g., DeepSpeedPlugin.zero_stage is equivalent of --zero_stage, and DeepSpeedPlugin.hf_ds_config can be used to pass --deepeed_config_file.

FSDP can be also configured via FullyShardedDataParallelPlugin, e.g., FullyShardedDataParallelPlugin.sharding_strategy is equivalent of --fsdp_sharding_strategy.

Checkpointing

Do note that while FSDP can be configured via --fsdp_state_dict_type to save either full / sharded checkpoints.

For DeepSpeed Zero3, one could pass a --zero3_save_16bit_model true, which conveniently consolidates the model to a single rank and saves; this is the FSDP equivalent of fsdp_state_dict_type: FULL_STATE_DICT.

For large models, consolidating the model to a single rank can be very slow.

For quicker checkpointing, for FSDP use fsdp_state_dict_type: SHARDED_STATE_DICT, and for DeepSpeed Zero3 use the zero_to_fp32.py script to post-convert sharded checkpoints.

Offloading

FSDP only allows all-or-nothing offload (i.e., either offload parameters, gradients, and optimizer, or keep them all in GPU), but DeepSpeed can offload parameters and optimizer differently. Furthermore, DeepSpeed also supports offloading to NVME.

Prefetching

FSDP allows two prefetching configurations --fsdp_forward_prefetch and --fsdp_backward_prefetch to improve overlap of comms / computation at a cost of extra memory, see FSDP documentation. For DeepSpeed, the prefetching will be turned on when needed, and it turns on depending on certain hyper-params like stage3_param_persistence_threshold, stage3_max_reuse_distance, etc, that can be configured for Zero3; accelerate may set these hyper-params automatically if you don’t set those explicitly in the deepspeed config file.

For FSDP set fsdp_backward_prefetch: BACKWARD_PRE for improved throughputs if memory allows.

Model Loading

While FSDP require an explicit --fsdp_cpu_ram_efficient_loading true to activate efficient model loading, transformers will activate the similar feature whenever DeepSpeed Zero3 is used.

For FSDP, whenever setting --fsdp_cpu_ram_efficient_loading true, accelerate will automatically set sync_module_states to true. For RAM efficient loading the weights will be loaded only in a singe rank, and thus requires sync_module_states to broadcast weights to other ranks.

Model

FSDP requires an explicit --fsdp_auto_wrap_policy for the algorithm to decide how to schedule the all-gather and reduce-scatter operations. But for DeepSpeed this is transparent to the user.

For FSDP, simply set fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP. With the latest transformers versions, we try our best to figure out the suitable fsdp_transformer_layer_cls_to_wrap for HF transformers models. However, if you get an error regarding it, please specify this.

Parameters Summoning

FSDP requires an explicit --fsdp_use_orig_params flag if using torch.compile, see the pytorch documenation. For DeepSpeed this is transparent to the user.

For FSDP, when using torch.compile please set fsdp_use_orig_params: True.

Training

Deepspeed requires explicit --gradient_accumulation_steps and --gradient_clipping flags. For FSDP this is transparent to the user.

When using DeepSpeed, set gradient_accumulation_steps: "auto" and gradient_clipping: "auto" to automatically pick up values set in the Accelerator or TrainingArguments (if using transformers).

On Differences in Data Precision Handling

To discuss the how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first “flatten” them to one-dimensional torch.Tensor. The implementation of FSDP / DeepSpeed varies in the respect of the dtype in which these “flattened” parameters are stored, and there are ramifications with regards to how torch.Optimizer allocate their dtypes. The table below outlines the processes for both frameworks; the “Local” column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.

As a rule of thumb, for stable training with automatic mixed precision, all the trainable parameters have to be in torch.float32.

Process Local Framework Details
Loading, i.e., AutoModel.from_pretrained(..., torch_dtype=torch_dtype)
Preparation, i.e., creation of “flat params” FSDP
DeepSpeed
created in torch_dtype.
disregards torch_dtype, created in float32.
Optimizer initialization FSDP
DeepSpeed
creates parameters in torch_dtype
creates parameters in float32
Training Step, i.e, forward, backward, reduction FSDP
DeepSpeed
follows MixedPrecision
follows deepspeed_config_file mixed precision settings.
Optimizer (Pre-Step) FSDP
DeepSpeed
upcasting (if any) to torch_dtype
upcasted to float32
Optimizer (Actual Step) FSDP
DeepSpeed
occurs in torch_dtype
occurs in float32.

Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preperation.

With FSDP, in the absence of mixed precision, it is possible to operate the torch.Optimizer in low precision torch_dtype, which may be helpful when using small number of GPUs.

With mixed precision, FSDP and DeepSpeed will upcast in the model preparation step (c.f. table above). But do note that FSDP will then save checkpoints in the upcasted precision; Deepspeed may still save low precision checkpoints if --zero3_save_16bit_model is specified.

To clarify the above table consider the concrete examples below; the optimizer pre- and actual step combined for brevity. With FSDP it is possible to operate in the two modes shown below, but DeepSpeed can only operate in one.

Framework Model Loading (torch_dtype) Mixed Precision Preparation (Local) Training Optimizer (Local)
FSDP bf16 default (none) bf16 bf16 bf16
FSDP bf16 bf16 fp32 bf16 fp32
DeepSpeed bf16 bf16 fp32 bf16 fp32
< > Update on GitHub