Pipeline Parallelism with AutoPipeline

Introduction

As large language models continue to grow in size, training and fine-tuning them efficiently across multiple GPUs has become increasingly challenging. While data parallelism works well for smaller models, models with billions of parameters require more sophisticated parallelization strategies to overcome memory constraints and communication overhead.

Pipeline parallelism addresses these challenges by splitting a model’s layers across different devices and processing them in a pipelined fashion. Each device processes a different stage of the model, enabling training of models that wouldn’t fit on a single device while maintaining high GPU utilization through overlapped computation.

AutoPipeline is NeMo AutoModel’s high-level pipeline parallelism interface specifically designed for Hugging Face models, making pipeline parallelism as simple as data parallelism. Built on PyTorch’s native torch.distributed.pipelining, AutoPipeline provides seamless pipeline parallelism support for any Hugging Face decoder-only causal language model with minimal code changes.

For custom models and more granular control, the functional API in nemo_automodel.components.distributed.pipelining.functional provides modular, accessible building blocks that can be used with any PyTorch model architecture.

This guide walks you through the complete process of using AutoPipeline for Hugging Face models and the functional API for custom models. You’ll learn how to configure pipeline stages, integrate with existing training workflows, optimize performance, and combine pipeline parallelism with other parallelization strategies.

Prerequisites:

$ # Install uv from https://docs.astral.sh/uv/getting-started/installation/
$ # Initialize the virtual environment using uv
$ uv venv
$ 
$ # Install the latest stable release from PyPI
$ uv pip install nemo-automodel
$ 
$ # Or install from source for the latest features
$ uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git

Before proceeding with this guide, please ensure that you have NeMo AutoModel installed on your machine. For a complete guide and additional options please consult the AutoModel Installation Guide.

Key Features

AutoPipeline provides the following capabilities:

Universal Hugging Face Support: Works with any Hugging Face decoder-only causal language model including Llama, Qwen, Mistral, Gemma, and more
PyTorch Native Integration: Built on PyTorch’s torch.distributed.pipelining for optimal performance
Flexible Configuration: Multiple scheduling strategies, configurable microbatch sizes, and automatic or manual layer splitting
Mixed Parallelism Support: Combine pipeline parallelism with data parallelism, tensor parallelism, and FSDP
Modular Functional API: For custom models, the functional module provides accessible, low-level building blocks
Minimal Opinions: Easy to extend and integrate with existing training workflows

Quick Start with AutoPipeline (Hugging Face Models)

Here’s a minimal example to get started with AutoPipeline using 2 pipeline stages with a Hugging Face model:

1 import torch
2 from torch.distributed.device_mesh import init_device_mesh
3 from nemo_automodel.components.distributed.pipelining import AutoPipeline
4 from transformers import AutoModelForCausalLM
5 from transformers.integrations.accelerate import init_empty_weights
6 from transformers.modeling_utils import no_init_weights
7 from transformers.utils import ContextManagers
8 
9 def loss_fn(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
10     """Define loss function for pipeline training."""
11     return torch.nn.functional.cross_entropy(
12         logits.float().view(-1, logits.size(-1)),
13         targets.view(-1),
14         ignore_index=-100
15     )
16 
17 if __name__ == "__main__":
18     # 1) Initialize device mesh with 2 pipeline stages
19     world_mesh = init_device_mesh("cuda", mesh_shape=(2,), mesh_dim_names=("pp",))
20 
21     # 2) Load model on meta device to avoid OOM with large models
22     init_ctx = ContextManagers([no_init_weights(), init_empty_weights()])
23     with init_ctx:
24         model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
25 
26     # 3) Configure and build pipeline
27     ap = AutoPipeline(
28         world_mesh=world_mesh,
29         pp_axis_name="pp",
30         pp_schedule="1f1b",
31         pp_microbatch_size=1,
32         pp_batch_size=8,  # Total batch size across pipeline
33         device=torch.cuda.current_device(),
34         dtype=torch.bfloat16,
35     ).build(model, loss_fn=loss_fn)
36 
37     # 4) Access pipeline components
38     print(ap.debug_summary())
39     print(ap.pretty_print_stages())

Run the Quick Start Example

Save the above code as pipeline_example.py and run with:

$ # Run with 2 GPUs for 2 pipeline stages
$ uv run torchrun --nproc-per-node=2 pipeline_example.py

For a complete training example:

$ # Run fine-tuning with 2-way pipeline parallelism using Llama 3.1 8B
$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_1/llama3_1_8b_hellaswag_pp.yaml

Configuration Options

Basic Configuration

AutoPipeline provides comprehensive control over pipeline behavior:

1 ap = AutoPipeline(
2     # Device mesh configuration
3     world_mesh=world_mesh,           # DeviceMesh with pipeline axis
4     pp_axis_name="pp",              # Name of pipeline axis (default: "pp")
5 
6     # Schedule configuration
7     pp_schedule="1f1b",             # Pipeline schedule ("1f1b", "looped_bfs", etc.)
8     pp_microbatch_size=1,           # Microbatch size per stage
9     # pp_batch_size is automatically inferred from dataloader.batch_size
10 
11     # Stage configuration
12     layers_per_stage=None,          # Layers per stage (None for auto)
13     module_fqns_per_model_part=None,  # Manual module assignment
14 
15     # Model patching (HF-specific)
16     patch_inner_model=True,         # Make decoder forward stage-friendly
17     patch_causal_lm_model=True,     # Make CausalLM wrapper return tensors (hidden/logits)
18 ).build(model, loss_fn=loss_fn)

Model Patching (`patch_inner_model`, `patch_causal_lm_model`)

AutoPipeline splits a model by deep-copying it per stage and pruning away modules that don’t belong to that stage. Many Hugging Face models assume the full module tree is present and return ModelOutput objects; after pruning, their original forward() often breaks (or returns objects that are awkward to pipeline).

These two flags switch AutoPipeline to lightweight, pipeline-friendly forward() implementations that return tensors (see nemo_automodel.components.distributed.pipelining.hf_utils.patch_hf_model_for_pp):

patch_inner_model: patches the decoder module (model.model for ...ForCausalLM, otherwise the module itself) so each stage can run even after pruning.
- Stage 0 (has embed_tokens): takes token IDs and produces hidden states.
- Middle stages (no embed_tokens): take hidden states from the previous stage (using inputs_embeds, or a float tensor passed through input_ids) and produce hidden states.
- Handles sliced layer containers (e.g., layers becoming dict-like after stage pruning) and returns a tensor of hidden states so stages can be chained.
For compilation/performance, this patched forward prefers a precomputed causal_mask_mapping dict (it will fall back to computing masks and warn if you don’t provide it).
patch_causal_lm_model: patches the ...ForCausalLM wrapper forward (the module that owns lm_head) so pipeline stages return tensors:
- Returns hidden states when lm_head is absent on that stage.
- Returns logits when lm_head is present (typically only the last stage).
- Supports logits_to_keep to compute logits for only the last k tokens.
Note: this is only used when the module you pipeline is a ...ForCausalLM-style wrapper (i.e., it has a .model attribute). If you pass a base decoder module directly, patch_causal_lm_model typically has no effect.

When Should I Change These?

Leave both True (default) for standard Hugging Face AutoModelForCausalLM / ...ForCausalLM models. This is the common case and gives the expected behavior: token IDs -> hidden states -> logits across stages.
Set both False when your model already has a pipeline-friendly forward (returns tensors and can accept hidden states when embeddings are absent) or it needs custom kwargs/paths that the HF patch doesn’t preserve (common for NeMo AutoModel-native model implementations, packed-sequence/thd paths, extra args like padding_mask, etc.). Many benchmark configs for NeMo-native models do this (for example examples/benchmark/configs/qwen3_moe_30b_torch.yaml).
Set patch_inner_model=False, patch_causal_lm_model=True when your inner model is already stage-friendly, but the wrapper forward still returns a ModelOutput and you only want the wrapper simplified to “hidden states or logits”.

If you disable patch_causal_lm_model, your last stage will typically output hidden states instead of logits; in that case, make sure your loss_fn (or your last-stage module) applies the LM head explicitly.

Automatic vs. Manual Layer Distribution

AutoPipeline offers flexible control over how your model is split across pipeline stages:

Automatic Distribution

Let AutoPipeline automatically balance layers across stages:

1 ap = AutoPipeline(
2     world_mesh=world_mesh,
3     pp_schedule="1f1b",
4     layers_per_stage=8,  # Each stage gets ~8 transformer layers
5 ).build(model, loss_fn=loss_fn)

Manual Distribution

Specify exactly which modules go to each stage:

1 from nemo_automodel.components.distributed.pipelining.functional import (
2     generate_hf_model_fqn_per_model_part
3 )
4 
5 # Generate balanced assignments
6 module_fqns = generate_hf_model_fqn_per_model_part(
7     num_stages=4,
8     num_layers=32,
9     include_embeddings=True,
10     include_lm_head=True,
11     include_rotary_emb=True,
12     fqn_prefix="model."
13 )
14 
15 # Or define custom assignments
16 custom_module_fqns = [
17     # Stage 0: Embeddings + first 8 layers
18     ["model.embed_tokens", "model.rotary_emb"] +
19     [f"model.layers.{i}" for i in range(8)],
20 
21     # Stage 1: Next 8 layers
22     ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(8, 16)],
23 
24     # Stage 2: Next 8 layers
25     ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(16, 24)],
26 
27     # Stage 3: Final 8 layers + output
28     ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(24, 32)] +
29     ["model.norm", "lm_head"]
30 ]
31 
32 ap = AutoPipeline(
33     world_mesh=world_mesh,
34     module_fqns_per_model_part=custom_module_fqns,
35 ).build(model, loss_fn=loss_fn)

Understand Model Splitting

When AutoPipeline splits your model, it intelligently distributes components across pipeline stages. Here’s how a typical model gets split:

Example: 32-Layer Model Across 2 Stages

1 # Stage 0 (Rank 0): Input processing + first half
2 stage_0_modules = [
3     "model.embed_tokens",     # Token embeddings
4     "model.layers.0-15",      # First 16 transformer layers
5     "model.rotary_emb"        # Position embeddings (shared)
6 ]
7 
8 # Stage 1 (Rank 1): Second half + output processing
9 stage_1_modules = [
10     "model.layers.16-31",     # Last 16 transformer layers
11     "model.norm",             # Final layer norm
12     "lm_head",               # Language modeling head
13     "model.rotary_emb"        # Position embeddings (shared)
14 ]

Example: 32-Layer Model Across 4 Stages

1 # Stage 0 (Rank 0): Input processing
2 stage_0_modules = [
3     "model.embed_tokens",     # Token embeddings
4     "model.layers.0-7",       # First 8 transformer layers
5     "model.rotary_emb"        # Position embeddings (shared)
6 ]
7 
8 # Stage 1 (Rank 1): Early layers
9 stage_1_modules = [
10     "model.layers.8-15",      # Next 8 transformer layers
11     "model.rotary_emb"
12 ]
13 
14 # Stage 2 (Rank 2): Middle layers
15 stage_2_modules = [
16     "model.layers.16-23",     # Next 8 transformer layers
17     "model.rotary_emb"
18 ]
19 
20 # Stage 3 (Rank 3): Output processing
21 stage_3_modules = [
22     "model.layers.24-31",     # Final 8 transformer layers
23     "model.norm",             # Final layer norm
24     "lm_head",               # Language modeling head
25     "model.rotary_emb"
26 ]

Key observations:

Embeddings only exist on the first stage
Language modeling head only exists on the last stage
Rotary embeddings are shared across all stages (for position encoding)
Transformer layers are evenly distributed

Use the Functional API for Custom Models

While AutoPipeline is specifically designed as a high-level interface for Hugging Face models, the functional API in nemo_automodel.components.distributed.pipelining.functional provides more modular and accessible building blocks that can be used with any PyTorch model, including custom architectures. This separation allows for cleaner code organization where AutoPipeline handles Hugging Face-specific optimizations while the functional module remains model-agnostic.

Key Functional API Components

The functional API provides several utilities for building custom pipeline parallel systems:

Stage ID Calculation

1 from nemo_automodel.components.distributed.pipelining.functional import stage_ids_this_rank
2 
3 # Calculate which stages run on this rank
4 # For a "loop" style schedule (default)
5 stage_ids = stage_ids_this_rank(pp_rank=0, pp_size=4, num_stages=8, style="loop")
6 # Returns: (0, 4) - rank 0 gets stages 0 and 4
7 
8 # For a "v" style schedule (for zero-bubble schedules)
9 stage_ids = stage_ids_this_rank(pp_rank=0, pp_size=4, num_stages=8, style="v")
10 # Returns: (0, 7) - rank 0 gets stages 0 and 7

Module Name Generation

1 from nemo_automodel.components.distributed.pipelining.functional import (
2     generate_hf_model_fqn_per_model_part
3 )
4 
5 # Generate balanced module assignments for any model
6 module_names = generate_hf_model_fqn_per_model_part(
7     num_stages=4,
8     num_layers=32,
9     include_embeddings=True,
10     include_lm_head=True,
11     include_rotary_emb=False,  # Set based on your model
12     fqn_prefix=""  # Use "model." for nested models
13 )

Virtual Stage Calculation

1 from nemo_automodel.components.distributed.pipelining.functional import calculate_virtual_stages
2 
3 # Calculate virtual stages for interleaved schedules
4 num_virtual_stages, stages_per_rank = calculate_virtual_stages(
5     num_layers=32,
6     layers_per_stage=4,  # Each virtual stage has 4 layers
7     pp_size=4,
8     is_single_stage_schedule=False,
9     round_to_pp_multiple="up"  # Round up to nearest multiple of pp_size
10 )

Pipeline Schedule Build

1 from nemo_automodel.components.distributed.pipelining.functional import build_pipeline_schedule
2 
3 # Build a schedule for your stages
4 schedule = build_pipeline_schedule(
5     pipeline_parallel_schedule_csv=None,  # Optional CSV schedule
6     pipeline_parallel_schedule="1f1b",
7     microbatch_size=1,
8     local_batch_size=8,
9     stages=stages,  # List of PipelineStage objects
10     loss_fn=loss_fn,
11     scale_grads=False
12 )

Example: Pipeline Parallelism for Custom Models

Here’s how to use the functional API to implement pipeline parallelism for a custom model:

1 import torch
2 import torch.nn as nn
3 from torch.distributed.device_mesh import init_device_mesh
4 from torch.distributed.pipelining import PipelineStage
5 from nemo_automodel.components.distributed.pipelining.functional import (
6     stage_ids_this_rank,
7     build_pipeline_schedule,
8     calculate_virtual_stages
9 )
10 
11 class CustomTransformerBlock(nn.Module):
12     def __init__(self, hidden_size):
13         super().__init__()
14         self.attention = nn.MultiheadAttention(hidden_size, num_heads=8)
15         self.mlp = nn.Sequential(
16             nn.Linear(hidden_size, hidden_size * 4),
17             nn.GELU(),
18             nn.Linear(hidden_size * 4, hidden_size)
19         )
20         self.norm1 = nn.LayerNorm(hidden_size)
21         self.norm2 = nn.LayerNorm(hidden_size)
22 
23     def forward(self, x):
24         # Simplified transformer block
25         attn_out, _ = self.attention(x, x, x)
26         x = self.norm1(x + attn_out)
27         x = self.norm2(x + self.mlp(x))
28         return x
29 
30 class CustomModel(nn.Module):
31     def __init__(self, vocab_size, hidden_size, num_layers):
32         super().__init__()
33         self.embedding = nn.Embedding(vocab_size, hidden_size)
34         self.layers = nn.ModuleList([
35             CustomTransformerBlock(hidden_size) for _ in range(num_layers)
36         ])
37         self.output_proj = nn.Linear(hidden_size, vocab_size)
38 
39     def forward(self, input_ids):
40         x = self.embedding(input_ids)
41         for layer in self.layers:
42             x = layer(x)
43         return self.output_proj(x)
44 
45 def split_custom_model_for_pipeline(model, pp_rank, pp_size, num_stages):
46     """Split a custom model into pipeline stages."""
47 
48     # Determine which stages this rank handles
49     stage_indices = stage_ids_this_rank(pp_rank, pp_size, num_stages, style="loop")
50 
51     stages = []
52     for stage_idx in stage_indices:
53         # Create a stage-specific version of the model
54         # This is a simplified example - you'd need to implement proper splitting
55         stage_model = create_stage_model(model, stage_idx, num_stages)
56 
57         # Create PipelineStage
58         stage = PipelineStage(
59             stage_model,
60             stage_idx,
61             num_stages,
62             device=torch.cuda.current_device(),
63             group=None  # Set your process group here
64         )
65         stages.append(stage)
66 
67     return stages
68 
69 # Usage
70 def main():
71     # Initialize device mesh
72     world_mesh = init_device_mesh("cuda", mesh_shape=(4,), mesh_dim_names=("pp",))
73     pp_rank = world_mesh["pp"].get_local_rank()
74     pp_size = world_mesh["pp"].size()
75 
76     # Create model
77     model = CustomModel(vocab_size=50000, hidden_size=768, num_layers=24)
78 
79     # Calculate virtual stages
80     num_virtual_stages, stages_per_rank = calculate_virtual_stages(
81         num_layers=24,
82         layers_per_stage=3,  # 8 virtual stages total
83         pp_size=4,
84         is_single_stage_schedule=False
85     )
86 
87     # Split model into stages
88     stages = split_custom_model_for_pipeline(model, pp_rank, pp_size, num_virtual_stages)
89 
90     # Define loss function
91     def loss_fn(logits, targets):
92         return nn.functional.cross_entropy(
93             logits.view(-1, logits.size(-1)),
94             targets.view(-1)
95         )
96 
97     # Build pipeline schedule
98     schedule = build_pipeline_schedule(
99         pipeline_parallel_schedule_csv=None,
100         pipeline_parallel_schedule="interleaved_1f1b",  # Good for multi-stage
101         microbatch_size=1,
102         local_batch_size=8,
103         stages=stages,
104         loss_fn=loss_fn,
105         scale_grads=True
106     )
107 
108     # Training loop
109     for batch in dataloader:
110         # Use schedule.step() for training
111         losses = []
112         schedule.step(batch["input_ids"], target=batch["labels"], losses=losses)
113 
114         # losses will contain the loss values from the last stage
115         if losses:
116             print(f"Loss: {sum(losses) / len(losses)}")

Advanced: Custom Model Splitting Logic

For more complex custom models, you can implement your own splitting logic:

1 from nemo_automodel.components.distributed.pipelining.functional import pipeline_model
2 
3 def custom_parallelize_fn(
4     model, world_mesh, moe_mesh, *,
5     pp_enabled, dp_axis_names, **kwargs
6 ):
7     """Custom parallelization function for each pipeline stage."""
8     # Apply your custom parallelization logic here
9     # This is called for each pipeline stage
10     if dp_axis_names:
11         # Apply data parallelism
12         pass
13     # Add any other parallelization strategies
14     pass
15 
16 # Use pipeline_model for complete pipeline setup
17 schedule, model_parts, has_first, has_last, stages = pipeline_model(
18     model=your_custom_model,
19     world_mesh=world_mesh,
20     moe_mesh=None,
21     pp_axis_name="pp",
22     dp_axis_names=("dp",),
23     layers_per_stage=4,
24     pipeline_parallel_schedule="1f1b",
25     pipeline_parallel_schedule_csv=None,
26     microbatch_size=1,
27     local_batch_size=8,
28     device=torch.cuda.current_device(),
29     loss_fn=loss_fn,
30     parallelize_fn=custom_parallelize_fn,
31     module_fqns_per_model_part=None,  # Provide custom module names
32     patch_inner_model=False,  # Custom model: don't apply HF forward patches
33     patch_causal_lm_model=False,  # Custom model: don't apply HF forward patches
34 )

Tips for Using Functional API with Custom Models

The functional API is designed to be more accessible and modular than AutoPipeline:

Module Naming: Ensure your model has consistent module naming that can be mapped to stages
State Management: Handle model state (embeddings, buffers) carefully across stages
Communication: First and last stages need special handling for inputs/outputs
Flexibility: The functional API gives you complete control over how models are split and parallelized
Testing: Start with a small model and verify correct splitting before scaling up

The functional module’s modular design makes it easier to integrate pipeline parallelism into existing custom model training workflows without the Hugging Face-specific assumptions that AutoPipeline makes.

Mixed Parallelism

AutoPipeline can be combined with other parallelization strategies for optimal performance:

1 def parallelize_fn(
2     model, world_mesh, moe_mesh, *,
3     pp_enabled, dp_axis_names,
4     cp_axis_name=None, tp_axis_name=None, ep_axis_name=None
5 ):
6     """Apply additional parallelization to each pipeline stage."""
7     # Example: Apply FSDP to each stage
8     if dp_axis_names:
9         from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
10         # Wrap model with FSDP (simplified example)
11         # In practice, you'd configure FSDP parameters
12         pass
13 
14     # Example: Apply tensor parallelism
15     if tp_axis_name:
16         # Apply tensor parallelism to attention/MLP layers
17         pass
18 
19 # Build pipeline with custom parallelization
20 ap = AutoPipeline(world_mesh=world_mesh).build(
21     model,
22     loss_fn=loss_fn,
23     parallelize_fn=parallelize_fn
24 )

Monitor and Debug

AutoPipeline provides comprehensive tools for understanding your pipeline configuration:

Pipeline Information

1 # Get pipeline info
2 info = ap.info
3 print(f"Pipeline enabled: {info.enabled}")
4 print(f"Has first stage: {info.has_first_stage}")
5 print(f"Has last stage: {info.has_last_stage}")
6 
7 # Access model parts
8 model_parts = ap.parts  # List of pipeline stages
9 stage_modules = ap.list_stage_modules()  # Module names per stage

Analysis

1 # Parameter distribution
2 stage_param_counts = ap.get_stage_param_counts()
3 total_params = ap.get_total_param_count()
4 trainable_params = ap.get_total_param_count(trainable_only=True)
5 
6 for i, params in enumerate(stage_param_counts):
7     percentage = (params / total_params) * 100
8     print(f"Stage {i}: {params:,} parameters ({percentage:.1f}%)")
9 
10 # Debug summary
11 print(ap.debug_summary())
12 print(ap.pretty_print_stages(max_modules_per_stage=10))
13 
14 # Visualize schedule
15 ap.visualize_current_schedule("pipeline_schedule.png")

Gradient Management

1 # Scale gradients for mixed parallelism
2 ap.scale_grads_by_divisor(divisor=8)
3 
4 # Clip gradients across pipeline stages
5 grad_norm = ap.clip_grad_norm(max_norm=1.0, norm_type=2.0)

Add Pipeline Parallelism to Existing Configurations

You can easily add pipeline parallelism to any existing training configuration through command-line overrides or YAML modifications.

Command-Line Override Method

Add pipeline parallelism to an existing config using command-line arguments:

$ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>     --distributed.strategy fsdp2 \
>     --distributed.pp_size 2 \
>     --distributed.pipeline.pp_schedule 1f1b \
>     --distributed.pipeline.pp_microbatch_size 1 \
>     --distributed.pipeline.round_virtual_stages_to_pp_multiple up \
>     --distributed.pipeline.scale_grads_in_schedule false

Key parameters to override:

--distributed.pp_size: Number of pipeline stages (must match nproc-per-node)
pp_batch_size is automatically inferred from --dataloader.batch_size
--distributed.pipeline.pp_schedule: Pipeline schedule (1f1b, interleaved_1f1b, etc.)

YAML Configuration Method

Add these sections to your existing YAML config:

1 distributed:
2   strategy: fsdp2
3   dp_size: 1
4   tp_size: 1
5   cp_size: 1
6   pp_size: 4  # Enable 4-way pipeline parallelism
7   sequence_parallel: false
8   pipeline:
9     pp_schedule: 1f1b
10     pp_microbatch_size: 1
11     # pp_batch_size is automatically inferred from dataloader.batch_size
12     round_virtual_stages_to_pp_multiple: up
13     scale_grads_in_schedule: false
14     layers_per_stage: null  # Auto-compute, or specify number

Mixed Parallelism Examples

Pipeline + Data Parallelism (4 GPUs Total)

$ automodel --nproc-per-node=4 your_config.yaml \
>     --distributed.pp_size 2 \
>     --distributed.dp_size 2 \
>     --dataloader.batch_size 16

Pipeline + Tensor Parallelism (4 GPUs Total)

$ automodel --nproc-per-node=4 your_config.yaml \
>     --distributed.pp_size 2 \
>     --distributed.tp_size 2 \
>     --dataloader.batch_size 8

Full Hybrid: PP + DP + TP (8 GPUs Total)

$ automodel --nproc-per-node=8 your_config.yaml \
>     --distributed.pp_size 2 \
>     --distributed.dp_size 2 \
>     --distributed.tp_size 2 \
>     --dataloader.batch_size 32

Integrate with Training Recipes

AutoPipeline seamlessly integrates with NeMo AutoModel’s recipe system. Here’s a complete example YAML configuration:

1 # config.yaml
2 distributed:
3   strategy: fsdp2
4   dp_size: 1
5   tp_size: 1
6   cp_size: 1
7   pp_size: 2          # 2-way pipeline parallelism
8   sequence_parallel: false
9   pipeline:
10     pp_schedule: 1f1b
11     pp_microbatch_size: 1
12     # pp_batch_size is automatically inferred from dataloader.batch_size
13     layers_per_stage: null  # Auto-compute layer distribution
14     round_virtual_stages_to_pp_multiple: up
15     scale_grads_in_schedule: false
16 
17 model:
18   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
19   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
20 
21 loss_fn:
22   _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
23 
24 dataset:
25   _target_: nemo_automodel.components.datasets.llm.squad.SQuAD
26   path_or_dataset: squad
27   split: train
28 
29 dataloader:
30   batch_size: 8
31   shuffle: true

Run training with:

$ # Run with 2 GPUs for 2-way pipeline parallelism
$ automodel --nproc-per-node=2 config.yaml

Troubleshooting

Common Issues

Model doesn’t fit in memory:

Increase number of pipeline stages
Reduce microbatch size
Enable gradient checkpointing

Pipeline bubbles reducing efficiency:

Increase batch size to have more microbatches
Try different schedules (e.g., interleaved_1f1b)
Adjust virtual stages configuration

Uneven stage distribution:

Use manual module assignment for fine control
Adjust layers_per_stage parameter
Check parameter counts with get_stage_param_counts()

Conclusion

AutoPipeline and the functional API together provide a complete pipeline parallelism solution for both Hugging Face and custom models. AutoPipeline offers a high-level, optimized interface specifically for Hugging Face models, while the functional module provides modular, accessible building blocks for custom architectures.

Key takeaways:

Pipeline parallelism enables training of models too large for a single GPU
AutoPipeline provides a simple API for Hugging Face models with powerful customization options
The functional API offers modular components for implementing pipeline parallelism with any PyTorch model
Both can be combined with other parallelization strategies for optimal performance
Use built-in monitoring tools to understand and optimize your pipeline