> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.distributed.pipelining.functional

## Module Contents

### Classes

| Name                                                                                                          | Description                                                        |
| ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| [`ParallelizeFnProtocol`](#nemo_automodel-components-distributed-pipelining-functional-ParallelizeFnProtocol) | Callable protocol for applying distributed parallelism to a model. |

### Functions

| Name                                                                                                                                        | Description                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| [`_get_hidden_and_vocab_size`](#nemo_automodel-components-distributed-pipelining-functional-_get_hidden_and_vocab_size)                     | Extract hidden\_size and vocab\_size from a model config.                                                  |
| [`_get_optional_hook`](#nemo_automodel-components-distributed-pipelining-functional-_get_optional_hook)                                     | -                                                                                                          |
| [`_precompute_stage_shapes`](#nemo_automodel-components-distributed-pipelining-functional-_precompute_stage_shapes)                         | Precompute input/output meta tensors for each pipeline stage to bypass serial shape inference.             |
| [`_wrap_stage_forward_to_emit_tensor`](#nemo_automodel-components-distributed-pipelining-functional-_wrap_stage_forward_to_emit_tensor)     | Make a pipeline stage's `forward` emit a tensor, not a `ModelOutput`.                                      |
| [`build_pipeline_schedule`](#nemo_automodel-components-distributed-pipelining-functional-build_pipeline_schedule)                           | Builds a pipeline schedule for the given job configuration and stages.                                     |
| [`calculate_virtual_stages`](#nemo_automodel-components-distributed-pipelining-functional-calculate_virtual_stages)                         | Calculate virtual pipeline stages and layers per stage.                                                    |
| [`generate_hf_model_fqn_per_model_part`](#nemo_automodel-components-distributed-pipelining-functional-generate_hf_model_fqn_per_model_part) | Generates module names for each pipeline stage for HuggingFace models.                                     |
| [`pipeline_model`](#nemo_automodel-components-distributed-pipelining-functional-pipeline_model)                                             | HF-specific pipeline model splitting.                                                                      |
| [`reset_pp_stage_shapes`](#nemo_automodel-components-distributed-pipelining-functional-reset_pp_stage_shapes)                               | Reset pipeline stage infrastructure and recompute shapes for a new sequence length.                        |
| [`scale_grads_by_divisor`](#nemo_automodel-components-distributed-pipelining-functional-scale_grads_by_divisor)                             | Scale pipeline stage gradients by a common divisor when supported.                                         |
| [`split_model_into_stages`](#nemo_automodel-components-distributed-pipelining-functional-split_model_into_stages)                           | Splits a HuggingFace model for pipeline parallelism.                                                       |
| [`stage_ids_this_rank`](#nemo_automodel-components-distributed-pipelining-functional-stage_ids_this_rank)                                   | Compute the stage ids for the stages that will run on this pp rank for either a looped or V style schedule |

### Data

[`logger`](#nemo_automodel-components-distributed-pipelining-functional-logger)

### API

```python
class nemo_automodel.components.distributed.pipelining.functional.ParallelizeFnProtocol()
```

Protocol

Callable protocol for applying distributed parallelism to a model.

```python
nemo_automodel.components.distributed.pipelining.functional.ParallelizeFnProtocol.__call__(
    model: torch.nn.Module,
    world_mesh: torch.distributed.device_mesh.DeviceMesh,
    moe_mesh: torch.distributed.device_mesh.DeviceMesh,
    dp_axis_names: tuple[str, ...],
    cp_axis_name: str | None = None,
    tp_axis_name: str | None = None,
    ep_axis_name: str | None = None,
    ep_shard_axis_names: tuple[str, ...] | None = None
) -> None
```

```python
nemo_automodel.components.distributed.pipelining.functional._get_hidden_and_vocab_size(
    model_config
) -> tuple[int, int]
```

Extract hidden\_size and vocab\_size from a model config.

Handles both flat configs (LLM) and nested configs where these attributes
live under `text_config` (VLM models such as Qwen3-VL, LLaVA, etc.).

```python
nemo_automodel.components.distributed.pipelining.functional._get_optional_hook(
    module: object,
    name: str
) -> typing.Callable | None
```

```python
nemo_automodel.components.distributed.pipelining.functional._precompute_stage_shapes(
    stages: list[torch.distributed.pipelining.PipelineStage],
    model_config,
    microbatch_size: int,
    seq_len: int,
    tensor_dtype: torch.dtype | None = None
) -> None
```

Precompute input/output meta tensors for each pipeline stage to bypass serial shape inference.

By default, PipelineStage performs shape inference at runtime via a serial P2P chain:
stage 0 → send → stage 1 → send → ... → stage N-1.  This is O(N) in the number of
pipeline stages and becomes a bottleneck for large world sizes.

This function sets `inputs_meta` and `_outputs_meta` on each stage *before* the
first `step()` call, so that `_shape_inference` is never invoked and the serial
chain is completely eliminated.

**Parameters:**

The local pipeline stages (already parallelized).

The HuggingFace model config (`model.config`).

Microbatch size used by the pipeline schedule.

Sequence length of the input data.

```python
nemo_automodel.components.distributed.pipelining.functional._wrap_stage_forward_to_emit_tensor(
    stage_model: torch.nn.Module
) -> None
```

Make a pipeline stage's `forward` emit a tensor, not a `ModelOutput`.

Custom `*ForCausalLM` / `*ForConditionalGeneration` models now return a
`CausalLMOutputWithPast` from `forward` (fused-linear cross-entropy
support, `compute_lm_head_logits`). `torch.distributed.pipelining`
requires every stage to emit a tensor (or tuple/list of tensors):
`PipelineStage._validate_fwd_outputs` and the inter-stage P2P send/recv
treat the output as tensor leaves and read `.shape` on each, which raises
`AttributeError: 'CausalLMOutputWithPast' object has no attribute 'shape'`.

The stage's outer `forward` is left intact (a) for models that opt out of
patching via `_pp_keep_self_forward` and (b) for MoE configs that set
`patch_causal_lm_model=False` so only the inner model is patched. In both
cases the kept outer `forward` returns a `ModelOutput`. This wraps it so
the return is unwrapped to its `.logits` tensor:
`compute_lm_head_logits` puts the projected logits there on the final stage
and the pass-through `hidden_states` on non-final stages (`lm_head is
None`) -- exactly the tensor each stage must forward, and the logits the
last-stage loss (`PipelineCausalLMLoss` / `MaskedCrossEntropy`) consumes.

No-op when `forward` already returns a tensor or a tuple (the patched
`create_pipeline_forward_causal_lm` path, and MTP models that emit a
`(logits, *mtp, seq_idx)` tuple), since only `ModelOutput` is unwrapped.

```python
nemo_automodel.components.distributed.pipelining.functional.build_pipeline_schedule(
    pipeline_parallel_schedule_csv: str | None,
    pipeline_parallel_schedule: str | None,
    microbatch_size: int,
    local_batch_size: int,
    stages: list[torch.distributed.pipelining.PipelineStage],
    loss_fn: typing.Callable,
    scale_grads: bool = False
) -> torch.distributed.pipelining.schedules._PipelineSchedule
```

Builds a pipeline schedule for the given job configuration and stages.

**Parameters:**

The path to the pipeline parallel schedule csv file.

The name of the pipeline parallel schedule.

The microbatch size.

The local batch size.

The stages to be scheduled.

The loss function.

**Returns:** `_PipelineSchedule`

The pipeline schedule for the given stages.

```python
nemo_automodel.components.distributed.pipelining.functional.calculate_virtual_stages(
    num_layers: int,
    layers_per_stage: typing.Optional[int],
    pp_size: int,
    is_single_stage_schedule: bool,
    round_to_pp_multiple: str | None = None
) -> tuple[int, int]
```

Calculate virtual pipeline stages and layers per stage.

```python
nemo_automodel.components.distributed.pipelining.functional.generate_hf_model_fqn_per_model_part(
    num_stages: int,
    num_layers: int,
    include_embeddings: bool = True,
    include_lm_head: bool = True,
    include_rotary_emb: bool = True,
    include_multimodal_encoders: bool = True,
    extra_module_fqns: typing.Optional[list[str]] = None,
    fqn_prefix: str = 'model.',
    lm_head_fqn: str = 'lm_head'
) -> list[list[str]]
```

Generates module names for each pipeline stage for HuggingFace models.

**Parameters:**

Number of pipeline stages

Total number of transformer layers in the model

Whether to include embedding layer in first stage

Whether to include lm\_head in last stage (for CausalLM models)

Whether to include common vision/audio encoder modules in stage 0

Optional list of extra module FQNs to include in stage 0

**Returns:** `list[list[str]]`

List of lists containing module names for each stage

```python
nemo_automodel.components.distributed.pipelining.functional.pipeline_model(
    model: torch.nn.Module,
    world_mesh: torch.distributed.device_mesh.DeviceMesh,
    moe_mesh: torch.distributed.device_mesh.DeviceMesh,
    pp_axis_name: str,
    dp_axis_names: tuple[str, ...],
    cp_axis_name: str | None = None,
    tp_axis_name: str | None = None,
    ep_axis_name: str | None = None,
    ep_shard_axis_names: tuple[str, ...] | None = None,
    layers_per_stage: int | None,
    pipeline_parallel_schedule_csv: str | None,
    pipeline_parallel_schedule: str | None,
    microbatch_size: int,
    local_batch_size: int,
    device: torch.device,
    loss_fn: typing.Callable = None,
    parallelize_fn: typing.Callable | None = None,
    module_fqns_per_model_part: list[list[str]] | None = None,
    patch_inner_model: bool = True,
    patch_causal_lm_model: bool = True,
    scale_grads: bool = False,
    round_to_pp_multiple: str | None = None,
    patch_stage_backward_maybe_with_nosync: bool = False,
    reduce_grad_per_microbatch: bool = False,
    seq_len: int | None = None,
    tensor_dtype: torch.dtype | None = None
) -> tuple[torch.distributed.pipelining.schedules._PipelineSchedule, list[torch.nn.Module], bool, bool, list[torch.distributed.pipelining.PipelineStage]]
```

HF-specific pipeline model splitting.

```python
nemo_automodel.components.distributed.pipelining.functional.reset_pp_stage_shapes(
    schedule: torch.distributed.pipelining.schedules._PipelineSchedule,
    stages: list[torch.distributed.pipelining.PipelineStage],
    model_config,
    microbatch_size: int,
    seq_len: int,
    tensor_dtype: torch.dtype | None = None
) -> None
```

Reset pipeline stage infrastructure and recompute shapes for a new sequence length.

VLM training produces batches with highly variable sequence lengths (image tokens expand
the sequence dramatically).  PyTorch's PipelineStage locks in output shapes and recv
buffer sizes on the first `schedule.step()` call (`_stages_initialized = True`).
Subsequent steps with a different seq\_len therefore hit a shape-mismatch error.

This function resets the per-stage infrastructure so that `_initialize_stages` re-runs
on the next `step()` call.  It then calls `_precompute_stage_shapes` to set the
correct shapes analytically — avoiding the expensive real-valued forward pass that
`_shape_inference` would otherwise perform.

**Parameters:**

The active pipeline schedule.

The local pipeline stages for this rank.

The HuggingFace model config (`model.config`).

Per-microbatch batch size used by the schedule.

Sequence length of the upcoming batch (e.g. `input_ids.shape[1]`).

```python
nemo_automodel.components.distributed.pipelining.functional.scale_grads_by_divisor(
    stages: list[torch.distributed.pipelining.PipelineStage],
    divisor: int
) -> None
```

Scale pipeline stage gradients by a common divisor when supported.

```python
nemo_automodel.components.distributed.pipelining.functional.split_model_into_stages(
    model: torch.nn.Module,
    pp_mesh: torch.distributed.device_mesh.DeviceMesh,
    pp_axis_name: str,
    pp_schedule: str,
    device: torch.device,
    module_names_per_stage: typing.Optional[list[list[str]]] = None,
    layers_per_stage: typing.Optional[int] = None,
    patch_inner_model: bool = True,
    patch_causal_lm_model: bool = True,
    round_to_pp_multiple: str | None = None
) -> tuple[list[torch.distributed.pipelining.PipelineStage], list[torch.nn.Module]]
```

Splits a HuggingFace model for pipeline parallelism.

**Parameters:**

The HuggingFace model to split

Pipeline parallel device mesh

Name of pipeline parallelism schedule

Device to place stages on

Optional manual specification of modules per stage

Number of pipeline stages (used if module\_names\_per\_stage not provided)

**Returns:** `list[PipelineStage]`

Tuple of (stages, models) where stages are PipelineStage objects and models are the

```python
nemo_automodel.components.distributed.pipelining.functional.stage_ids_this_rank(
    pp_rank: int,
    pp_size: int,
    num_stages: int,
    style: str = 'loop'
) -> tuple[int]
```

Compute the stage ids for the stages that will run on this pp rank for either a looped or V style schedule

```python
nemo_automodel.components.distributed.pipelining.functional.logger = logging.getLogger(__name__)
```