Fully Sharded Data Parallel (FSDP)

Overview

Fully Sharded Data Parallel (FSDP) is a type of data-parallel training, but unlike traditional data-parallel, which maintains a per-GPU copy of a model’s parameters, gradients and optimizer states, it shards all of these states across data-parallel workers and can optionally offload the sharded model parameters to CPUs.

NeMo Framework supports FSDP for GPT-based models such as GPT-3, Llama, etc.

Usage

Model Training

Define the configuration for FSDP in the model configuration:

megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP
fsdp: True # Enable training with torch FSDP.
fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'.
fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type.
fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint.

optim:
   name: fused_adam # distributed_fused_adam is currently not supported by FSDP

Please, note that FSDP is currently not supported with distributed_fused_adam optimizer and megatron_amp_O2.

Model Fine-Tuning

Define the configuration for FSDP in the model configuration:

megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP
fsdp: True # Enable training with torch FSDP.
fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'.
fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type.
fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint.
fsdp_use_orig_params: False # Set to True to use FSDP with specific fine-tuning scheme (ptuning, lora, adapter, etc.).

optim:
   name: fused_adam # distributed_fused_adam is currently not supported by FSDP

Please, note that FSDP is currently not supported with distributed_fused_adam optimizer and megatron_amp_O2.