nemo_automodel.components.distributed.fsdp2
nemo_automodel.components.distributed.fsdp2
Module Contents
Classes
Functions
Data
API
Manager for parallelizing models using FSDP2 with TP, DP, CP sharding.
This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and CPU offloading options.
The device mesh must be created externally and passed in.
Parameters:
Configuration for FSDP2 distributed training.
Device mesh for distributed operations.
Optional device mesh for expert parallelism.
Apply per-layer compile after sharding, alongside whole-model compile_model().
Parallelizes the given model using FSDP2 and TP sharding strategies.
Parameters:
The model to be parallelized.
Returns:
The parallelized model.
Eliminate CPU-GPU sync from flash attention for standard (non-packed) training.
transformers._is_packed_sequence() returns a GPU bool scalar when batch_size==1,
which causes Python’s if to call aten::is_nonzero — a CPU-GPU sync — once per
attention layer per forward pass. With FSDP+TP+gradient-checkpointing this fires
hundreds of times per iteration.
For standard (non-packed) training sequences are never packed, so returning the Python False immediately is both correct and avoids the sync. Do NOT apply this patch when using packed-sequence training (multiple sequences concatenated into one tensor with position_ids that reset to 0 mid-sequence).