nemo_automodel.components.distributed.pipelining.functional
nemo_automodel.components.distributed.pipelining.functional
Module Contents
Classes
Functions
Data
API
Callable protocol for applying distributed parallelism to a model.
Extract hidden_size and vocab_size from a model config.
Handles both flat configs (LLM) and nested configs where these attributes
live under text_config (VLM models such as Qwen3-VL, LLaVA, etc.).
Precompute input/output meta tensors for each pipeline stage to bypass serial shape inference.
By default, PipelineStage performs shape inference at runtime via a serial P2P chain: stage 0 → send → stage 1 → send → … → stage N-1. This is O(N) in the number of pipeline stages and becomes a bottleneck for large world sizes.
This function sets inputs_meta and _outputs_meta on each stage before the
first step() call, so that _shape_inference is never invoked and the serial
chain is completely eliminated.
Parameters:
The local pipeline stages (already parallelized).
The HuggingFace model config (model.config).
Microbatch size used by the pipeline schedule.
Sequence length of the input data.
Make a pipeline stage’s forward emit a tensor, not a ModelOutput.
Custom *ForCausalLM / *ForConditionalGeneration models now return a
CausalLMOutputWithPast from forward (fused-linear cross-entropy
support, compute_lm_head_logits). torch.distributed.pipelining
requires every stage to emit a tensor (or tuple/list of tensors):
PipelineStage._validate_fwd_outputs and the inter-stage P2P send/recv
treat the output as tensor leaves and read .shape on each, which raises
AttributeError: 'CausalLMOutputWithPast' object has no attribute 'shape'.
The stage’s outer forward is left intact (a) for models that opt out of
patching via _pp_keep_self_forward and (b) for MoE configs that set
patch_causal_lm_model=False so only the inner model is patched. In both
cases the kept outer forward returns a ModelOutput. This wraps it so
the return is unwrapped to its .logits tensor:
compute_lm_head_logits puts the projected logits there on the final stage
and the pass-through hidden_states on non-final stages (lm_head is None) — exactly the tensor each stage must forward, and the logits the
last-stage loss (PipelineCausalLMLoss / MaskedCrossEntropy) consumes.
No-op when forward already returns a tensor or a tuple (the patched
create_pipeline_forward_causal_lm path, and MTP models that emit a
(logits, *mtp, seq_idx) tuple), since only ModelOutput is unwrapped.
Builds a pipeline schedule for the given job configuration and stages.
Parameters:
The path to the pipeline parallel schedule csv file.
The name of the pipeline parallel schedule.
The microbatch size.
The local batch size.
The stages to be scheduled.
The loss function.
Returns: _PipelineSchedule
The pipeline schedule for the given stages.
Calculate virtual pipeline stages and layers per stage.
Generates module names for each pipeline stage for HuggingFace models.
Parameters:
Number of pipeline stages
Total number of transformer layers in the model
Whether to include embedding layer in first stage
Whether to include lm_head in last stage (for CausalLM models)
Whether to include common vision/audio encoder modules in stage 0
Optional list of extra module FQNs to include in stage 0
Returns: list[list[str]]
List of lists containing module names for each stage
HF-specific pipeline model splitting.
Reset pipeline stage infrastructure and recompute shapes for a new sequence length.
VLM training produces batches with highly variable sequence lengths (image tokens expand
the sequence dramatically). PyTorch’s PipelineStage locks in output shapes and recv
buffer sizes on the first schedule.step() call (_stages_initialized = True).
Subsequent steps with a different seq_len therefore hit a shape-mismatch error.
This function resets the per-stage infrastructure so that _initialize_stages re-runs
on the next step() call. It then calls _precompute_stage_shapes to set the
correct shapes analytically — avoiding the expensive real-valued forward pass that
_shape_inference would otherwise perform.
Parameters:
The active pipeline schedule.
The local pipeline stages for this rank.
The HuggingFace model config (model.config).
Per-microbatch batch size used by the schedule.
Sequence length of the upcoming batch (e.g. input_ids.shape[1]).
Scale pipeline stage gradients by a common divisor when supported.
Splits a HuggingFace model for pipeline parallelism.
Parameters:
The HuggingFace model to split
Pipeline parallel device mesh
Name of pipeline parallelism schedule
Device to place stages on
Optional manual specification of modules per stage
Number of pipeline stages (used if module_names_per_stage not provided)
Returns: list[PipelineStage]
Tuple of (stages, models) where stages are PipelineStage objects and models are the
Compute the stage ids for the stages that will run on this pp rank for either a looped or V style schedule