> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.distributed.parallelizer

## Module Contents

### Classes

| Name                                                                                                                         | Description                                                                          |
| ---------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`DeepseekV4ParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-DeepseekV4ParallelizationStrategy) | DeepSeek-V4 keeps a small set of reference-sensitive parameters in fp32.             |
| [`DefaultParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-DefaultParallelizationStrategy)       | Default parallelization strategy used by most models.                                |
| [`Gemma4ForConditionalGeneration`](#nemo_automodel-components-distributed-parallelizer-Gemma4ForConditionalGeneration)       | Placeholder when the installed transformers build has no Gemma4.                     |
| [`HunyuanParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-HunyuanParallelizationStrategy)       | Parallelization strategy for Hunyuan-style transformer modules used in HunyuanVideo. |
| [`NemotronHParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-NemotronHParallelizationStrategy)   | Specialized parallelization strategy for NemotronH models.                           |
| [`ParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-ParallelizationStrategy)                     | Abstract base class for model parallelization strategies.                            |
| [`Qwen3_5ParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-Qwen3_5ParallelizationStrategy)       | Parallelization strategy for Qwen3.5 dense models with mixed-dtype GatedDeltaNet.    |
| [`WanParallelizationStrategy`](#nemo_automodel-components-distributed-parallelizer-WanParallelizationStrategy)               | Parallelization strategy for Wan-style transformer modules used in Diffusers.        |

### Functions

| Name                                                                                                                                                       | Description                                                                                      |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| [`_apply_bagel_full_layer_activation_checkpointing`](#nemo_automodel-components-distributed-parallelizer-_apply_bagel_full_layer_activation_checkpointing) | Apply native BAGEL-style activation checkpointing to whole logical layers.                       |
| [`_apply_per_layer_compile`](#nemo_automodel-components-distributed-parallelizer-_apply_per_layer_compile)                                                 | Compile each decoder layer in-place after FSDP2 sharding.                                        |
| [`_attention_is_head_sharded`](#nemo_automodel-components-distributed-parallelizer-_attention_is_head_sharded)                                             | Return True when the TP plan column-wise shards any QKV attention projection.                    |
| [`_extract_model_layers`](#nemo_automodel-components-distributed-parallelizer-_extract_model_layers)                                                       | Extract layers from different model architectures for parallelization.                           |
| [`_find_largest_module_list`](#nemo_automodel-components-distributed-parallelizer-_find_largest_module_list)                                               | Heuristic function to find the largest layer container in a model.                               |
| [`_get_module_by_fqn`](#nemo_automodel-components-distributed-parallelizer-_get_module_by_fqn)                                                             | -                                                                                                |
| [`_get_parallel_plan`](#nemo_automodel-components-distributed-parallelizer-_get_parallel_plan)                                                             | Select the tensor-parallel plan for the given model.                                             |
| [`_is_checkpoint_wrapped`](#nemo_automodel-components-distributed-parallelizer-_is_checkpoint_wrapped)                                                     | -                                                                                                |
| [`_is_transformers_v5_or_higher`](#nemo_automodel-components-distributed-parallelizer-_is_transformers_v5_or_higher)                                       | Check if transformers version is 5.x or higher.                                                  |
| [`_nemotronh_decoder_blocks`](#nemo_automodel-components-distributed-parallelizer-_nemotronh_decoder_blocks)                                               | Return `(container, blocks)` for a NemotronH model's decoder blocks.                             |
| [`_patch_dtensor_spec_hash_for_symint`](#nemo_automodel-components-distributed-parallelizer-_patch_dtensor_spec_hash_for_symint)                           | Fix a crash when torch.compile + DTensor are used together.                                      |
| [`_subtree_all_frozen`](#nemo_automodel-components-distributed-parallelizer-_subtree_all_frozen)                                                           | Return True if `module` owns parameters and none of them require grad.                           |
| [`_update_attention_head_counts_for_tp`](#nemo_automodel-components-distributed-parallelizer-_update_attention_head_counts_for_tp)                         | After TP sharding, the Q/K/V outputs are split across ranks (each rank has                       |
| [`apply_fsdp2_sharding_recursively`](#nemo_automodel-components-distributed-parallelizer-apply_fsdp2_sharding_recursively)                                 | Recursively apply FSDP2 sharding to modules, with optimizations for ModuleList.                  |
| [`apply_selective_activation_checkpointing`](#nemo_automodel-components-distributed-parallelizer-apply_selective_activation_checkpointing)                 | Apply selective activation checkpointing to `model` end to end.                                  |
| [`fsdp2_strategy_parallelize`](#nemo_automodel-components-distributed-parallelizer-fsdp2_strategy_parallelize)                                             | Apply parallelisms and activation checkpointing to the model.                                    |
| [`get_hf_tp_shard_plan`](#nemo_automodel-components-distributed-parallelizer-get_hf_tp_shard_plan)                                                         | Get the Hugging Face tensor parallel plan from the model.                                        |
| [`get_parallelization_strategy`](#nemo_automodel-components-distributed-parallelizer-get_parallelization_strategy)                                         | Get the appropriate parallelization strategy for the given model.                                |
| [`import_class_from_path`](#nemo_automodel-components-distributed-parallelizer-import_class_from_path)                                                     | Import a class from a string path (e.g. 'torch.optim.AdamW').                                    |
| [`import_classes_from_paths`](#nemo_automodel-components-distributed-parallelizer-import_classes_from_paths)                                               | Helper function to import classes from string paths.                                             |
| [`megatron_fsdp_strategy_parallelize`](#nemo_automodel-components-distributed-parallelizer-megatron_fsdp_strategy_parallelize)                             | Apply tensor/data parallelism (MegatronFSDP) and optional activation-checkpointing to the model. |
| [`register_parallel_strategy`](#nemo_automodel-components-distributed-parallelizer-register_parallel_strategy)                                             | Decorator to register out-of-tree parallelism strategies.                                        |
| [`translate_to_torch_parallel_style`](#nemo_automodel-components-distributed-parallelizer-translate_to_torch_parallel_style)                               | Translates string descriptions to parallelism plans.                                             |
| [`unshard_fsdp2_model`](#nemo_automodel-components-distributed-parallelizer-unshard_fsdp2_model)                                                           | Explicitly unshard and then reshard the FSDP2 modules. Useful for logprob inference.             |
| [`validate_tp_mesh`](#nemo_automodel-components-distributed-parallelizer-validate_tp_mesh)                                                                 | Validate that attention heads and key value heads are divisible by TP size                       |
| [`validate_tp_mesh_for_nemotron_nas`](#nemo_automodel-components-distributed-parallelizer-validate_tp_mesh_for_nemotron_nas)                               | Validate that a Nemotron-NAS model can be tensor-parallel sharded.                               |

### Data

[`HAVE_MEGATRON_FSDP`](#nemo_automodel-components-distributed-parallelizer-HAVE_MEGATRON_FSDP)

[`PARALLELIZATION_STRATEGIES`](#nemo_automodel-components-distributed-parallelizer-PARALLELIZATION_STRATEGIES)

[`_BAGEL_FULL_LAYER_CHECKPOINT_MODULE_LISTS`](#nemo_automodel-components-distributed-parallelizer-_BAGEL_FULL_LAYER_CHECKPOINT_MODULE_LISTS)

[`_DEFAULT_STRATEGY`](#nemo_automodel-components-distributed-parallelizer-_DEFAULT_STRATEGY)

[`logger`](#nemo_automodel-components-distributed-parallelizer-logger)

### API

```python
class nemo_automodel.components.distributed.parallelizer.DeepseekV4ParallelizationStrategy()
```

**Bases:** [DefaultParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-DefaultParallelizationStrategy)

DeepSeek-V4 keeps a small set of reference-sensitive parameters in fp32.

```python
nemo_automodel.components.distributed.parallelizer.DeepseekV4ParallelizationStrategy.parallelize(
    model,
    device_mesh,
    dp_shard_cp_mesh_name = 'dp_shard_cp',
    kwargs = {}
)
```

```python
class nemo_automodel.components.distributed.parallelizer.DefaultParallelizationStrategy()
```

**Bases:** [ParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-ParallelizationStrategy)

Default parallelization strategy used by most models.

```python
nemo_automodel.components.distributed.parallelizer.DefaultParallelizationStrategy.parallelize(
    model: torch.nn.Module,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    enable_async_tensor_parallel: bool = False,
    enable_compile: bool = False,
    enable_fsdp2_prefetch: bool = True,
    fsdp2_backward_prefetch_depth: int = 2,
    fsdp2_forward_prefetch_depth: int = 1,
    reshard_after_forward: typing.Optional[bool] = None,
    fully_shard_fn = None
) -> torch.nn.Module
```

Apply the default parallelization flow.

```python
class nemo_automodel.components.distributed.parallelizer.Gemma4ForConditionalGeneration()
```

Placeholder when the installed transformers build has no Gemma4.

```python
class nemo_automodel.components.distributed.parallelizer.HunyuanParallelizationStrategy()
```

**Bases:** [ParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-ParallelizationStrategy)

Parallelization strategy for Hunyuan-style transformer modules used in HunyuanVideo.

```python
nemo_automodel.components.distributed.parallelizer.HunyuanParallelizationStrategy.parallelize(
    model: torch.nn.Module,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = True,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    kwargs = {}
) -> torch.nn.Module
```

```python
class nemo_automodel.components.distributed.parallelizer.NemotronHParallelizationStrategy()
```

**Bases:** [ParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-ParallelizationStrategy)

Specialized parallelization strategy for NemotronH models.

```python
nemo_automodel.components.distributed.parallelizer.NemotronHParallelizationStrategy.parallelize(
    model: torch.nn.Module,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    kwargs = {}
) -> torch.nn.Module
```

Apply NemotronH-specific parallelization.

```python
class nemo_automodel.components.distributed.parallelizer.ParallelizationStrategy()
```

Abstract

Abstract base class for model parallelization strategies.

```python
nemo_automodel.components.distributed.parallelizer.ParallelizationStrategy.parallelize(
    model: torch.nn.Module,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    kwargs = {}
) -> torch.nn.Module
```

abstract

Apply parallelization strategy to the model.

```python
class nemo_automodel.components.distributed.parallelizer.Qwen3_5ParallelizationStrategy()
```

**Bases:** [DefaultParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-DefaultParallelizationStrategy)

Parallelization strategy for Qwen3.5 dense models with mixed-dtype GatedDeltaNet.

Qwen3.5 has linear\_attn layers with float32 params (A\_log, norm) alongside
bfloat16 params. Overrides the FSDP sharding step to use fully\_shard\_by\_dtype
per layer, and sets the CP mesh on CPAwareGatedDeltaNet modules.

```python
nemo_automodel.components.distributed.parallelizer.Qwen3_5ParallelizationStrategy.parallelize(
    model,
    device_mesh,
    dp_shard_cp_mesh_name = 'dp_shard_cp',
    kwargs = {}
)
```

```python
class nemo_automodel.components.distributed.parallelizer.WanParallelizationStrategy()
```

**Bases:** [ParallelizationStrategy](#nemo_automodel-components-distributed-parallelizer-ParallelizationStrategy)

Parallelization strategy for Wan-style transformer modules used in Diffusers.

Applies TP to condition embedders, FFN projections in each block, and final projection,
then applies FSDP sharding similarly to other strategies.

```python
nemo_automodel.components.distributed.parallelizer.WanParallelizationStrategy.parallelize(
    model: torch.nn.Module,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    kwargs = {}
) -> torch.nn.Module
```

```python
nemo_automodel.components.distributed.parallelizer._apply_bagel_full_layer_activation_checkpointing(
    model: torch.nn.Module
) -> bool
```

Apply native BAGEL-style activation checkpointing to whole logical layers.

```python
nemo_automodel.components.distributed.parallelizer._apply_per_layer_compile(
    model: torch.nn.Module
) -> None
```

Compile each decoder layer in-place after FSDP2 sharding.

Compiles at decoder-layer granularity (not sub-module) so that AOT autograd traces
the joint fwd+bwd graph under the training loop's enable\_grad context.  Sub-module
compile (e.g. on mlp alone) would be traced during activation checkpointing's first
forward pass which runs under no\_grad, producing a forward-only graph that drops
LoRA and other trainable-parameter gradients.

Prerequisite: NO\_REENTRANT checkpoint\_wrapper must already be applied to self\_attn
and mlp before FSDP2 sharding (done in DefaultParallelizationStrategy).  This
function only handles the compile step.

Whole-block selective-AC wrappers (tagged with `SELECTIVE_AC_WRAPPER_FLAG`)
are compiled OUTER -- the wrapper itself is compiled so the selective policy
is traced and the partitioner honors its recompute tags. Other layer-level
CheckpointWrappers (e.g. the PP path) are unwrapped and the decoder layer is
compiled directly.

nn.Module.compile() is used instead of torch.compile() to compile in-place without
introducing an \_orig\_mod wrapper, which would add a key prefix and break checkpoint
loading.

\_patch\_dtensor\_spec\_hash\_for\_symint() is called to allow torch.compile with dynamic
shapes to coexist with DTensor's lru\_cache-based sharding propagation.

```python
nemo_automodel.components.distributed.parallelizer._attention_is_head_sharded(
    model_parallel_plan: dict
) -> bool
```

Return True when the TP plan column-wise shards any QKV attention projection.

When Q/K/V projections use `ColwiseParallel` with sharded output (the
default), each TP rank holds `num_heads / tp_size` heads and the model
config / layer attributes must be updated accordingly.

Plans that keep attention replicated (e.g. Phi-3 with `RowwiseParallel`
on fused QKV and `Replicate` output) should *not* trigger a head-count
update.

```python
nemo_automodel.components.distributed.parallelizer._extract_model_layers(
    model: torch.nn.Module
) -> typing.List[torch.nn.Module]
```

Extract layers from different model architectures for parallelization.

This function handles various model types including vision-language models,
causal language models, and multimodal models. It collects both language
model layers and vision model layers where applicable.

**Parameters:**

The model to extract layers from.

**Returns:** `List[nn.Module]`

List\[nn.Module]: A list of all layers that should be parallelized.

```python
nemo_automodel.components.distributed.parallelizer._find_largest_module_list(
    model: torch.nn.Module
) -> typing.Optional[typing.Union[torch.nn.ModuleList, torch.nn.ModuleDict]]
```

Heuristic function to find the largest layer container in a model.

This function recursively traverses the model to find all nn.ModuleList and
pipeline-split nn.ModuleDict instances and returns the one with the most
modules. This is useful as a fallback when the model architecture is unknown,
since transformer layers are typically organized in ModuleLists. Pipeline
splitting converts ModuleLists to ModuleDicts keyed by original layer index.

**Parameters:**

The model to search through.

**Returns:** `Optional[Union[nn.ModuleList, nn.ModuleDict]]`

Optional\[Union\[nn.ModuleList, nn.ModuleDict]]: The largest layer container found, or None.

```python
nemo_automodel.components.distributed.parallelizer._get_module_by_fqn(
    module: torch.nn.Module,
    fqn: str
) -> typing.Optional[torch.nn.Module]
```

```python
nemo_automodel.components.distributed.parallelizer._get_parallel_plan(
    model: torch.nn.Module,
    sequence_parallel: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    tp_size: int = 1
) -> typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle]
```

Select the tensor-parallel plan for the given model.

Priority order:

1. If `tp_shard_plan` is provided as a dict or import path, use it.
2. If the model type exists in `PARALLELIZE_FUNCTIONS`, use its optimised plan; on failure, fall back to HF plan.
3. Otherwise, prefer the model's HF-native `_tp_plan` (via `get_hf_tp_shard_plan`).
4. Otherwise, fall back to the default base plan.

When `tp_size &gt; 1` and the model falls through to path 4 *and* the
model class was loaded from a custom-code source (HF's
`trust_remote_code=True` path, where the dynamic class lives under
`transformers_modules.*`), this raises `ValueError` instead of
returning the default base plan. On recent PyTorch the default plan's
placements do not populate `shard_order` and trip an internal assert in
`torch.distributed.tensor._redistribute` on the first weight
redistribute, which surfaces to the user as an opaque PyTorch internal
error. Custom-code architectures are the only known-broken case (see
[https://github.com/NVIDIA-NeMo/Automodel/issues/2243](https://github.com/NVIDIA-NeMo/Automodel/issues/2243)); known HF
architectures that happen to fall through (e.g. Mixtral) are left on the
default plan with a warning, since they have been working in practice.

When the model *did* define a `_tp_plan` but `get_hf_tp_shard_plan`
raised while translating it (e.g. styles nemo does not recognize), the
translator's error message is folded into the `ValueError` as a
diagnostic so the user can tell whether to add a `_tp_plan` from
scratch or fix the styles in the one they already have.

```python
nemo_automodel.components.distributed.parallelizer._is_checkpoint_wrapped(
    module: torch.nn.Module
) -> bool
```

```python
nemo_automodel.components.distributed.parallelizer._is_transformers_v5_or_higher() -> bool
```

Check if transformers version is 5.x or higher.

```python
nemo_automodel.components.distributed.parallelizer._nemotronh_decoder_blocks(
    model: torch.nn.Module
) -> tuple[torch.nn.Module, list[torch.nn.Module]]
```

Return `(container, blocks)` for a NemotronH model's decoder blocks.

Two distinct classes share the name `NemotronHForCausalLM`:

* the HF model keeps its blocks in `model.backbone.layers` (an `nn.ModuleList`), while
* the native Nemotron-V3 model (`NemotronV3Model`) keeps them in `model.model.layers`
  (an `nn.ModuleDict` keyed `"0".."N-1"`).

`container` is the underlying `ModuleList`/`ModuleDict` (so callers can write rewrapped
blocks back into the model), and `blocks` is the ordered list of block modules.

```python
nemo_automodel.components.distributed.parallelizer._patch_dtensor_spec_hash_for_symint() -> None
```

Fix a crash when torch.compile + DTensor are used together.

Problem: torch.compile traces with symbolic shapes (SymInt). DTensorSpec hashes
its shape to cache sharding decisions, but SymInt is not hashable -> crash.

Fix: if hashing the shape fails, fall back to hashing only (mesh, placements).
Cache hits are slightly reduced but correctness is unaffected.

```python
nemo_automodel.components.distributed.parallelizer._subtree_all_frozen(
    module: torch.nn.Module
) -> bool
```

Return True if `module` owns parameters and none of them require grad.

Used to skip FSDP-wrapping a frozen submodule that never runs in the forward
(e.g. the audio tower on image/text-only data); see
`apply_fsdp2_sharding_recursively`.

```python
nemo_automodel.components.distributed.parallelizer._update_attention_head_counts_for_tp(
    model: torch.nn.Module,
    tp_size: int
) -> None
```

After TP sharding, the Q/K/V outputs are split across ranks (each rank has
num\_heads/tp\_size heads). Update the config and each attention layer's
num\_heads / num\_key\_value\_heads so the forward uses the local head count
instead of the global one (avoids shape mismatches in .view()).

```python
nemo_automodel.components.distributed.parallelizer.apply_fsdp2_sharding_recursively(
    module: torch.nn.Module,
    mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy],
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    enable_fsdp2_prefetch: bool = True,
    fsdp2_backward_prefetch_depth: int = 2,
    fsdp2_forward_prefetch_depth: int = 1,
    reshard_after_forward: typing.Optional[bool] = None,
    fully_shard_fn = None
) -> None
```

Recursively apply FSDP2 sharding to modules, with optimizations for ModuleList.

This utility function traverses a model hierarchy and applies FSDP2 sharding
to each module. For ModuleList instances (commonly used for transformer layers),
it applies an optimization where the last layer doesn't reshard after forward
since FSDP2 will prefetch it immediately.

Handles both single-level and nested ModuleList/ModuleDict structures. If a
ModuleList contains other ModuleLists, it will recurse into them instead of trying
to wrap them (since ModuleList doesn't have a forward method).

Note:
This function modifies the module in-place by replacing modules with their
FSDP2-subclassed versions.

**Parameters:**

The module to apply FSDP sharding to.

The device mesh for FSDP sharding.

Mixed precision policy for FSDP.

CPU offload policy for FSDP.
Defaults to None.

Enable explicit forward/backward prefetch chains.

Backward prefetch depth.

Forward prefetch depth.

Optional override for each layer's
`fully_shard` reshard behavior.

```python
nemo_automodel.components.distributed.parallelizer.apply_selective_activation_checkpointing(
    model: torch.nn.Module,
    enable_compile: bool = False
) -> None
```

Apply selective activation checkpointing to `model` end to end.

Standalone entry point (detects KV-sharing, disables `use_cache`, and
wraps transformer blocks) for paths where the FSDP2 parallelize flow is
skipped -- notably single-GPU training.

**Parameters:**

The model to checkpoint.

Whether per-layer `torch.compile` will be applied.

```python
nemo_automodel.components.distributed.parallelizer.fsdp2_strategy_parallelize(
    model,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    mp_policy: typing.Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
    offload_policy: typing.Optional[torch.distributed.fsdp.OffloadPolicy] = None,
    sequence_parallel: bool = False,
    activation_checkpointing: bool = False,
    tp_shard_plan: typing.Optional[typing.Union[typing.Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
    dp_replicate_mesh_name: str = 'dp_replicate',
    dp_shard_cp_mesh_name: str = 'dp_shard_cp',
    tp_mesh_name: str = 'tp',
    enable_async_tensor_parallel: bool = False,
    enable_compile: bool = False,
    enable_fsdp2_prefetch: bool = True,
    fsdp2_backward_prefetch_depth: int = 2,
    fsdp2_forward_prefetch_depth: int = 1,
    reshard_after_forward: typing.Optional[bool] = None
)
```

Apply parallelisms and activation checkpointing to the model.

Enhanced version that uses a strategy pattern for different model parallelization approaches:

* Automatic strategy selection based on model type
* Polymorphic parallelization strategies for different model families
* Custom parallel plan support (dict or string path)
* Sequence parallel support
* Activation checkpointing for linear layers
* Model validation (attention heads divisible by TP size)
* Better fallback logic

NOTE: The passed-in model preferably should be on meta device. Otherwise,
the model must fit on GPU or CPU memory.

**Parameters:**

The model to be parallelized.

The device mesh for distributed training.

Mixed precision policy for model parallelism.

The offload policy for FSDP.

Whether to use sequence parallelism. Defaults to False.

Whether to use activation checkpointing. Defaults to False.

Custom tensor parallel plan for the model. Can be:

* A dictionary mapping module names to parallel styles
* A string path to a dictionary or function that returns a dictionary
  If provided, this takes precedence over automatic plan generation.

Key name for the data parallel replicate mesh in device\_mesh.
Used when data parallel replicate is enabled. Defaults to "dp\_replicate".

Key name for the data parallel shard + context parallel mesh in device\_mesh.
Used when data parallel shard is enabled. Defaults to "dp\_shard\_cp".

Key name for the tensor parallel mesh in device\_mesh.
Defaults to "tp".

**Returns:**

The parallelized model.

```python
nemo_automodel.components.distributed.parallelizer.get_hf_tp_shard_plan(
    model
)
```

Get the Hugging Face tensor parallel plan from the model.

This function:

* Retrieves TP strategies from model class, instance, and inner model levels.
* Handles special cases for `embed_tokens` and `lm_head` for speed up.
* Converts string-based parallel styles to DTensor parallelization strategies.

Taken and modified from: [https://github.com/NVIDIA/NeMo/blob/6c6169db01bcca73ae8ad3ac35242fadbb9a78ba/nemo/lightning/pytorch/strategies/utils.py#L532](https://github.com/NVIDIA/NeMo/blob/6c6169db01bcca73ae8ad3ac35242fadbb9a78ba/nemo/lightning/pytorch/strategies/utils.py#L532)

**Parameters:**

A Hugging Face model instance

**Returns:**

A dictionary mapping model component paths to their parallelization strategies

**Raises:**

* `AssertionError`: If no TP plan is found

```python
nemo_automodel.components.distributed.parallelizer.get_parallelization_strategy(
    model: torch.nn.Module
) -> nemo_automodel.components.distributed.parallelizer.ParallelizationStrategy
```

Get the appropriate parallelization strategy for the given model.

```python
nemo_automodel.components.distributed.parallelizer.import_class_from_path(
    name: str
) -> typing.Any
```

Import a class from a string path (e.g. 'torch.optim.AdamW').

**Parameters:**

Full path to class including module path and class name

**Returns:** `Any`

The imported class object

```python
nemo_automodel.components.distributed.parallelizer.import_classes_from_paths(
    class_paths: typing.List[str]
)
```

Helper function to import classes from string paths.

**Parameters:**

The list of string paths to the classes.

**Returns:**

List of imported classes.

```python
nemo_automodel.components.distributed.parallelizer.megatron_fsdp_strategy_parallelize(
    model,
    device_mesh: torch.distributed.device_mesh.DeviceMesh,
    optimizer = None,
    megatron_fsdp_unit_modules: typing.Optional[typing.List[str]] = None,
    tp_shard_plan: typing.Optional[typing.Dict[str, typing.Union[torch.distributed.tensor.parallel.RowwiseParallel, torch.distributed.tensor.parallel.ColwiseParallel, torch.distributed.tensor.parallel.SequenceParallel]]] = None,
    zero_dp_strategy: int = 3,
    init_fsdp_with_meta_device: bool = False,
    grad_reduce_in_fp32: bool = False,
    preserve_fp32_weights: bool = False,
    overlap_grad_reduce: bool = True,
    overlap_param_gather: bool = True,
    check_for_nan_in_grad: bool = True,
    average_in_collective: bool = False,
    disable_bucketing: bool = False,
    calculate_per_token_loss: bool = False,
    keep_fp8_transpose_cache: bool = False,
    nccl_ub: bool = False,
    fsdp_double_buffer: bool = False,
    dp_shard_dim: str = 'dp',
    tp_dim: str = 'tp'
)
```

Apply tensor/data parallelism (MegatronFSDP) and optional activation-checkpointing to the model.

NOTE: The passed-in model should preferably reside on the meta device.
Otherwise, ensure the model fits into available GPU or CPU memory.

NOTE: The user must ensure that the provided tp\_shard\_plan is compatible
with the model architecture.

**Parameters:**

The model to be parallelized.

The device mesh describing the physical devices
used for distributed training.

Names of sub-modules that should
become individual MegatronFSDP units. If None, the full model is wrapped as
a single unit.

A tensor-parallel sharding plan.
Keys are module names; values specify the parallel style to apply
(e.g., RowwiseParallel, ColwiseParallel, SequenceParallel).

The zero-DP strategy to use.

If True, construct the model on a
meta device first and materialize weights lazily to reduce memory
fragmentation.

Reduce gradients in FP32 irrespective of the
parameter precision to improve numerical stability.

Keep a master FP32 copy of weights when
training in reduced precision (e.g., FP16/BF16).

If True, overlap gradient reduction with
backward computation.

If True, overlap parameter gathering with
forward computation.

Whether to check gradients for NaNs/Infs
before applying the optimizer step.

Perform gradient averaging inside the
collective operation instead of dividing afterward.

Disable gradient bucketing; gradients are
reduced immediately as they are produced.

Compute loss normalized by the number of
tokens instead of the number of sequences.

Retain the FP8
transpose cache when using a custom MegatronFSDP wrapper.

Enable NCCL user-buffer API (experimental) for reduced
latency on some networks.

Enable double buffering of parameters to
overlap communication and computation in MegatronFSDP.

Key name for the data parallel mesh in device\_mesh.
Defaults to "dp".

Key name for the tensor parallel mesh in device\_mesh.
Defaults to "tp".

```python
nemo_automodel.components.distributed.parallelizer.register_parallel_strategy(
    arg = None,
    name: typing.Optional[str] = None
)
```

Decorator to register out-of-tree parallelism strategies.

Supports:

* @register\_parallel\_strategy(name="CustomModelName")

```python
nemo_automodel.components.distributed.parallelizer.translate_to_torch_parallel_style(
    style: str
)
```

Translates string descriptions to parallelism plans.

In model configurations, we use a neutral type (string) to specify parallel
styles, here we translate them into torch.distributed tensor-parallel
types.

```python
nemo_automodel.components.distributed.parallelizer.unshard_fsdp2_model(
    model: torch.nn.Module
) -> typing.Generator[None, None, None]
```

Explicitly unshard and then reshard the FSDP2 modules. Useful for logprob inference.

```python
nemo_automodel.components.distributed.parallelizer.validate_tp_mesh(
    model,
    tp_mesh
)
```

Validate that attention heads and key value heads are divisible by TP size

```python
nemo_automodel.components.distributed.parallelizer.validate_tp_mesh_for_nemotron_nas(
    model,
    tp_size
)
```

Validate that a Nemotron-NAS model can be tensor-parallel sharded.

```python
nemo_automodel.components.distributed.parallelizer.HAVE_MEGATRON_FSDP = True
```

```python
nemo_automodel.components.distributed.parallelizer.PARALLELIZATION_STRATEGIES: Dict[str, ParallelizationStrategy] = {'NemotronHForCausalLM': NemotronHParallelizationStrategy(), 'DeepseekV4ForCausa...
```

```python
nemo_automodel.components.distributed.parallelizer._BAGEL_FULL_LAYER_CHECKPOINT_MODULE_LISTS = ('model.language_model.model.layers', 'model.vit_model.vision_model.encoder.laye...
```

```python
nemo_automodel.components.distributed.parallelizer._DEFAULT_STRATEGY = DefaultParallelizationStrategy()
```

```python
nemo_automodel.components.distributed.parallelizer.logger = logging.getLogger(__name__)
```