Megatron-FSDP User Guide#

Table of Contents#

Megatron-FSDP Quick Start#

We recommend using the latest NVIDIA NeMo Framework Container, which provides a tested software stack and optimized performance.

For your reference, we provide an example launch script for DeepSeek-V3: sbatch_mfsdp_deepseek_v3.sh.

Required Configurations#

To enable Megatron-FSDP, add the following required flags to your training script:

--use-megatron-fsdp
--data-parallel-sharding-strategy optim_grads_params
--no-gradient-accumulation-fusion
--use-distributed-optimizer
--ckpt-format fsdp_dtensor

Checkpoint Conversion from 3D-Parallel to Megatron-FSDP#

Megatron-FSDP introduces fsdp_dtensor, a DTensor-based distributed checkpoint format that serves as its standard. To help you smoothly transition from 3D-Parallel to Megatron-FSDP, we provide a script for converting checkpoints from the torch_dist format to the fsdp_dtensor format. Using DeepSeek-V3 as an example, the detailed conversion process is described below.

Step 1: Generate 3D-Parallel Checkpoint with param_to_param_group_map#

Run your 3D-parallel + EP training script to generate a torch_dist checkpoint along with a directory containing param_to_param_group_map files. Add the following flag to your training script:

--dump-param-to-param-group-map /path/to/param_to_param_group_map

If you already have a torch_dist checkpoint, simply specify the --dump-param-to-param-group-map /path/to/param_to_param_group_map flag and run a very short experiment-this will create the param_to_param_group_map you need without full pretraining.

Step 2: Export param_to_param_group_map to a JSON File#

Convert the param_to_param_group_map into a JSON file for easier processing by running:

python tools/checkpoint/checkpoint_inspector.py print-torch-dcp-in-json /path/to/param_to_param_group_map

This will create a param_to_param_group_map.json file in the /path/to/param_to_param_group_map directory.

Step 3: Convert Checkpoint from torch_dist to fsdp_dtensor#

Convert your torch_dist checkpoint to the fsdp_dtensor format using the parameter to param_to_param_group_map JSON file:

torchrun --nproc_per_node=8 --nnodes=1 \
    tools/checkpoint/checkpoint_inspector.py \
    convert-torch-dist-to-fsdp-dtensor --swiglu \
    /path/to/input_torch_dist_checkpoint \
    /path/to/output_fsdp_dtensor_checkpoint \
    --param-to-param-group-map-json /path/to/param_to_param_group_map.json

Note: For multi-node conversion tasks, please refer to the example script: sbatch_checkpoint_convert.sh.

Step 4: Launch Megatron-FSDP Training#

Start your Megatron-FSDP training job using the converted fsdp_dtensor checkpoint.