> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.distributed.fsdp2

## Module Contents

### Classes

| Name                                                                        | Description                                                            |
| --------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| [`FSDP2Manager`](#nemo_automodel-components-distributed-fsdp2-FSDP2Manager) | Manager for parallelizing models using FSDP2 with TP, DP, CP sharding. |

### Functions

| Name                                                                                                                            | Description                                                                     |
| ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`_patch_is_packed_sequence_for_training`](#nemo_automodel-components-distributed-fsdp2-_patch_is_packed_sequence_for_training) | Eliminate CPU-GPU sync from flash attention for standard (non-packed) training. |

### Data

[`logger`](#nemo_automodel-components-distributed-fsdp2-logger)

### API

```python
class nemo_automodel.components.distributed.fsdp2.FSDP2Manager(
    config: nemo_automodel.components.distributed.config.FSDP2Config,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    moe_mesh: typing.Optional[torch.distributed.device_mesh.DeviceMesh] = None
)
```

Manager for parallelizing models using FSDP2 with TP, DP, CP sharding.

This manager applies parallelization to the model using a prescribed
TP sharding plan. It supports mixed precision and CPU offloading options.

The device mesh must be created externally and passed in.

**Parameters:**

Configuration for FSDP2 distributed training.

Device mesh for distributed operations.

Optional device mesh for expert parallelism.

```python
nemo_automodel.components.distributed.fsdp2.FSDP2Manager.maybe_compile(
    model
)
```

Apply per-layer compile after sharding, alongside whole-model compile\_model().

```python
nemo_automodel.components.distributed.fsdp2.FSDP2Manager.parallelize(
    model
)
```

Parallelizes the given model using FSDP2 and TP sharding strategies.

**Parameters:**

The model to be parallelized.

**Returns:**

The parallelized model.

```python
nemo_automodel.components.distributed.fsdp2._patch_is_packed_sequence_for_training() -> None
```

Eliminate CPU-GPU sync from flash attention for standard (non-packed) training.

transformers.\_is\_packed\_sequence() returns a GPU bool scalar when batch\_size==1,
which causes Python's `if` to call aten::is\_nonzero — a CPU-GPU sync — once per
attention layer per forward pass.  With FSDP+TP+gradient-checkpointing this fires
hundreds of times per iteration.

For standard (non-packed) training sequences are never packed, so returning the
Python False immediately is both correct and avoids the sync.  Do NOT apply this
patch when using packed-sequence training (multiple sequences concatenated into one
tensor with position\_ids that reset to 0 mid-sequence).

```python
nemo_automodel.components.distributed.fsdp2.logger = logging.getLogger(__name__)
```