nemo_automodel.components.distributed.fsdp2

Module Contents

Classes

Name	Description
`FSDP2Manager`	Manager for parallelizing models using FSDP2 with TP, DP, CP sharding.

Functions

Name	Description
`_patch_is_packed_sequence_for_training`	Eliminate CPU-GPU sync from flash attention for standard (non-packed) training.

Data

logger

API

class nemo_automodel.components.distributed.fsdp2.FSDP2Manager(
    config: nemo_automodel.components.distributed.config.FSDP2Config,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)

Manager for parallelizing models using FSDP2 with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed TP sharding plan. It supports mixed precision and CPU offloading options.

The device mesh must be created externally and passed in.

Parameters:

config

FSDP2Config

Configuration for FSDP2 distributed training.

device_mesh

DeviceMesh

Device mesh for distributed operations.

moe_mesh

Optional[DeviceMesh]Defaults to None

Optional device mesh for expert parallelism.

activation_checkpointing

= config.activation_checkpointing

defer_fsdp_grad_sync

= config.defer_fsdp_grad_sync

enable_async_tensor_parallel

= config.enable_async_tensor_parallel

enable_compile

= config.enable_compile

enable_fsdp2_prefetch

= config.enable_fsdp2_prefetch

fsdp2_backward_prefetch_depth

= config.fsdp2_backward_prefetch_depth

fsdp2_forward_prefetch_depth

= config.fsdp2_forward_prefetch_depth

mp_policy

= config.mp_policy

offload_policy

= config.offload_policy

reshard_after_forward

= config.reshard_after_forward

sequence_parallel

= config.sequence_parallel

tp_plan

= config.tp_plan

nemo_automodel.components.distributed.fsdp2.FSDP2Manager.maybe_compile(
    model
)

Apply per-layer compile after sharding, alongside whole-model compile_model().

nemo_automodel.components.distributed.fsdp2.FSDP2Manager.parallelize(
    model
)

Parallelizes the given model using FSDP2 and TP sharding strategies.

Parameters:

model

nn.Module

The model to be parallelized.

Returns:

The parallelized model.

nemo_automodel.components.distributed.fsdp2._patch_is_packed_sequence_for_training() -> None

Eliminate CPU-GPU sync from flash attention for standard (non-packed) training.

transformers._is_packed_sequence() returns a GPU bool scalar when batch_size==1, which causes Python’s if to call aten::is_nonzero — a CPU-GPU sync — once per attention layer per forward pass. With FSDP+TP+gradient-checkpointing this fires hundreds of times per iteration.

For standard (non-packed) training sequences are never packed, so returning the Python False immediately is both correct and avoids the sync. Do NOT apply this patch when using packed-sequence training (multiple sequences concatenated into one tensor with position_ids that reset to 0 mid-sequence).

nemo_automodel.components.distributed.fsdp2.logger = logging.getLogger(__name__)