Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Fully Sharded Data Parallel (FSDP)
Overview
Fully Sharded Data Parallel (FSDP) is a type of data-parallel training, but unlike traditional data-parallel, which maintains a per-GPU copy of a model’s parameters, gradients and optimizer states, it shards all of these states across data-parallel workers and can optionally offload the sharded model parameters to CPUs.
NeMo Framework supports FSDP for GPT-based models such as GPT-3, Llama, etc.
Usage
Model Training
Define the configuration for FSDP in the model configuration:
megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP
fsdp: True # Enable training with torch FSDP.
fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'.
fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type.
fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint.
optim:
name: fused_adam # distributed_fused_adam is currently not supported by FSDP
Please, note that FSDP is currently not supported with distributed_fused_adam
optimizer and megatron_amp_O2
.
Model Fine-Tuning
Define the configuration for FSDP in the model configuration:
megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP
fsdp: True # Enable training with torch FSDP.
fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'.
fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type.
fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint.
fsdp_use_orig_params: False # Set to True to use FSDP with specific fine-tuning scheme (ptuning, lora, adapter, etc.).
optim:
name: fused_adam # distributed_fused_adam is currently not supported by FSDP
Please, note that FSDP is currently not supported with distributed_fused_adam
optimizer and megatron_amp_O2
.