Fully Sharded Data Parallel (FSDP)

User Guide (Latest Version)

Fully Sharded Data Parallel (FSDP) is a type of data-parallel training, but unlike traditional data-parallel, which maintains a per-GPU copy of a model’s parameters, gradients and optimizer states, it shards all of these states across data-parallel workers and can optionally offload the sharded model parameters to CPUs.

NeMo Framework supports FSDP for GPT-based models such as GPT-3, Llama, etc.

Model Training

Define the configuration for FSDP in the model configuration:

Copy
Copied!
            

megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP fsdp: True # Enable training with torch FSDP. fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'. fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type. fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint. optim: name: fused_adam # distributed_fused_adam is currently not supported by FSDP

Please, note that FSDP is currently not supported with distributed_fused_adam optimizer and megatron_amp_O2.

Model Fine-Tuning

Define the configuration for FSDP in the model configuration:

Copy
Copied!
            

megatron_amp_O2: False # megatron_amp_O2 is not supported by FSDP fsdp: True # Enable training with torch FSDP. fsdp_sharding_strategy: 'full' # Method to shard model states. Available options are 'full', 'hybrid', and 'grad'. fsdp_grad_reduce_dtype: 'bf16' # Gradient reduction data type. fsdp_sharded_checkpoint: False # Store and load FSDP shared checkpoint. fsdp_use_orig_params: False # Set to True to use FSDP with specific fine-tuning scheme (ptuning, lora, adapter, etc.). optim: name: fused_adam # distributed_fused_adam is currently not supported by FSDP

Please, note that FSDP is currently not supported with distributed_fused_adam optimizer and megatron_amp_O2.

Previous Parameter Efficient Fine-Tuning (PEFT)
Next Torch Distributed Checkpoint (TDC)
© | | | | | | |. Last updated on May 30, 2024.