Pipeline Parallelism with AutoPipeline

View as Markdown

Introduction

As large language models continue to grow in size, training and fine-tuning them efficiently across multiple GPUs has become increasingly challenging. While data parallelism works well for smaller models, models with billions of parameters require more sophisticated parallelization strategies to overcome memory constraints and communication overhead.

Pipeline parallelism addresses these challenges by splitting a model’s layers across different devices and processing them in a pipelined fashion. Each device processes a different stage of the model, enabling training of models that wouldn’t fit on a single device while maintaining high GPU utilization through overlapped computation.

AutoPipeline is NeMo AutoModel’s high-level pipeline parallelism interface specifically designed for Hugging Face models, making pipeline parallelism as simple as data parallelism. Built on PyTorch’s native torch.distributed.pipelining, AutoPipeline provides seamless pipeline parallelism support for any Hugging Face decoder-only causal language model with minimal code changes.

For custom models and more granular control, the functional API in nemo_automodel.components.distributed.pipelining.functional provides modular, accessible building blocks that can be used with any PyTorch model architecture.

This guide walks you through the complete process of using AutoPipeline for Hugging Face models and the functional API for custom models. You’ll learn how to configure pipeline stages, integrate with existing training workflows, optimize performance, and combine pipeline parallelism with other parallelization strategies.

Prerequisites:

$# Install uv from https://docs.astral.sh/uv/getting-started/installation/
$# Initialize the virtual environment using uv
$uv venv
$
$# Install the latest stable release from PyPI
$uv pip install nemo-automodel
$
$# Or install from source for the latest features
$uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git

Before proceeding with this guide, please ensure that you have NeMo AutoModel installed on your machine. For a complete guide and additional options please consult the AutoModel Installation Guide.

Key Features

AutoPipeline provides the following capabilities:

  • Universal Hugging Face Support: Works with any Hugging Face decoder-only causal language model including Llama, Qwen, Mistral, Gemma, and more
  • PyTorch Native Integration: Built on PyTorch’s torch.distributed.pipelining for optimal performance
  • Flexible Configuration: Multiple scheduling strategies, configurable microbatch sizes, and automatic or manual layer splitting
  • Mixed Parallelism Support: Combine pipeline parallelism with data parallelism, tensor parallelism, and FSDP
  • Modular Functional API: For custom models, the functional module provides accessible, low-level building blocks
  • Minimal Opinions: Easy to extend and integrate with existing training workflows

Quick Start with AutoPipeline (Hugging Face Models)

Here’s a minimal example to get started with AutoPipeline using 2 pipeline stages with a Hugging Face model:

1import torch
2from torch.distributed.device_mesh import init_device_mesh
3from nemo_automodel.components.distributed.pipelining import AutoPipeline
4from transformers import AutoModelForCausalLM
5from transformers.integrations.accelerate import init_empty_weights
6from transformers.modeling_utils import no_init_weights
7from transformers.utils import ContextManagers
8
9def loss_fn(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
10 """Define loss function for pipeline training."""
11 return torch.nn.functional.cross_entropy(
12 logits.float().view(-1, logits.size(-1)),
13 targets.view(-1),
14 ignore_index=-100
15 )
16
17if __name__ == "__main__":
18 # 1) Initialize device mesh with 2 pipeline stages
19 world_mesh = init_device_mesh("cuda", mesh_shape=(2,), mesh_dim_names=("pp",))
20
21 # 2) Load model on meta device to avoid OOM with large models
22 init_ctx = ContextManagers([no_init_weights(), init_empty_weights()])
23 with init_ctx:
24 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
25
26 # 3) Configure and build pipeline
27 ap = AutoPipeline(
28 world_mesh=world_mesh,
29 pp_axis_name="pp",
30 pp_schedule="1f1b",
31 pp_microbatch_size=1,
32 pp_batch_size=8, # Total batch size across pipeline
33 device=torch.cuda.current_device(),
34 dtype=torch.bfloat16,
35 ).build(model, loss_fn=loss_fn)
36
37 # 4) Access pipeline components
38 print(ap.debug_summary())
39 print(ap.pretty_print_stages())

Run the Quick Start Example

Save the above code as pipeline_example.py and run with:

$# Run with 2 GPUs for 2 pipeline stages
$uv run torchrun --nproc-per-node=2 pipeline_example.py

For a complete training example:

$# Run fine-tuning with 2-way pipeline parallelism using Llama 3.1 8B
$automodel --nproc-per-node=2 examples/llm_finetune/llama3_1/llama3_1_8b_hellaswag_pp.yaml

Configuration Options

Basic Configuration

AutoPipeline provides comprehensive control over pipeline behavior:

1ap = AutoPipeline(
2 # Device mesh configuration
3 world_mesh=world_mesh, # DeviceMesh with pipeline axis
4 pp_axis_name="pp", # Name of pipeline axis (default: "pp")
5
6 # Schedule configuration
7 pp_schedule="1f1b", # Pipeline schedule ("1f1b", "looped_bfs", etc.)
8 pp_microbatch_size=1, # Microbatch size per stage
9 # pp_batch_size is automatically inferred from dataloader.batch_size
10
11 # Stage configuration
12 layers_per_stage=None, # Layers per stage (None for auto)
13 module_fqns_per_model_part=None, # Manual module assignment
14
15 # Model patching (HF-specific)
16 patch_inner_model=True, # Make decoder forward stage-friendly
17 patch_causal_lm_model=True, # Make CausalLM wrapper return tensors (hidden/logits)
18).build(model, loss_fn=loss_fn)

Model Patching (patch_inner_model, patch_causal_lm_model)

AutoPipeline splits a model by deep-copying it per stage and pruning away modules that don’t belong to that stage. Many Hugging Face models assume the full module tree is present and return ModelOutput objects; after pruning, their original forward() often breaks (or returns objects that are awkward to pipeline).

These two flags switch AutoPipeline to lightweight, pipeline-friendly forward() implementations that return tensors (see nemo_automodel.components.distributed.pipelining.hf_utils.patch_hf_model_for_pp):

  • patch_inner_model: patches the decoder module (model.model for ...ForCausalLM, otherwise the module itself) so each stage can run even after pruning.

    • Stage 0 (has embed_tokens): takes token IDs and produces hidden states.
    • Middle stages (no embed_tokens): take hidden states from the previous stage (using inputs_embeds, or a float tensor passed through input_ids) and produce hidden states.
    • Handles sliced layer containers (e.g., layers becoming dict-like after stage pruning) and returns a tensor of hidden states so stages can be chained.

    For compilation/performance, this patched forward prefers a precomputed causal_mask_mapping dict (it will fall back to computing masks and warn if you don’t provide it).

  • patch_causal_lm_model: patches the ...ForCausalLM wrapper forward (the module that owns lm_head) so pipeline stages return tensors:

    • Returns hidden states when lm_head is absent on that stage.
    • Returns logits when lm_head is present (typically only the last stage).
    • Supports logits_to_keep to compute logits for only the last k tokens.

    Note: this is only used when the module you pipeline is a ...ForCausalLM-style wrapper (i.e., it has a .model attribute). If you pass a base decoder module directly, patch_causal_lm_model typically has no effect.

When Should I Change These?

  • Leave both True (default) for standard Hugging Face AutoModelForCausalLM / ...ForCausalLM models. This is the common case and gives the expected behavior: token IDs -> hidden states -> logits across stages.
  • Set both False when your model already has a pipeline-friendly forward (returns tensors and can accept hidden states when embeddings are absent) or it needs custom kwargs/paths that the HF patch doesn’t preserve (common for NeMo AutoModel-native model implementations, packed-sequence/thd paths, extra args like padding_mask, etc.). Many benchmark configs for NeMo-native models do this (for example examples/benchmark/configs/qwen3_moe_30b_torch.yaml).
  • Set patch_inner_model=False, patch_causal_lm_model=True when your inner model is already stage-friendly, but the wrapper forward still returns a ModelOutput and you only want the wrapper simplified to “hidden states or logits”.

If you disable patch_causal_lm_model, your last stage will typically output hidden states instead of logits; in that case, make sure your loss_fn (or your last-stage module) applies the LM head explicitly.

Automatic vs. Manual Layer Distribution

AutoPipeline offers flexible control over how your model is split across pipeline stages:

Automatic Distribution

Let AutoPipeline automatically balance layers across stages:

1ap = AutoPipeline(
2 world_mesh=world_mesh,
3 pp_schedule="1f1b",
4 layers_per_stage=8, # Each stage gets ~8 transformer layers
5).build(model, loss_fn=loss_fn)

Manual Distribution

Specify exactly which modules go to each stage:

1from nemo_automodel.components.distributed.pipelining.functional import (
2 generate_hf_model_fqn_per_model_part
3)
4
5# Generate balanced assignments
6module_fqns = generate_hf_model_fqn_per_model_part(
7 num_stages=4,
8 num_layers=32,
9 include_embeddings=True,
10 include_lm_head=True,
11 include_rotary_emb=True,
12 fqn_prefix="model."
13)
14
15# Or define custom assignments
16custom_module_fqns = [
17 # Stage 0: Embeddings + first 8 layers
18 ["model.embed_tokens", "model.rotary_emb"] +
19 [f"model.layers.{i}" for i in range(8)],
20
21 # Stage 1: Next 8 layers
22 ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(8, 16)],
23
24 # Stage 2: Next 8 layers
25 ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(16, 24)],
26
27 # Stage 3: Final 8 layers + output
28 ["model.rotary_emb"] + [f"model.layers.{i}" for i in range(24, 32)] +
29 ["model.norm", "lm_head"]
30]
31
32ap = AutoPipeline(
33 world_mesh=world_mesh,
34 module_fqns_per_model_part=custom_module_fqns,
35).build(model, loss_fn=loss_fn)

Understand Model Splitting

When AutoPipeline splits your model, it intelligently distributes components across pipeline stages. Here’s how a typical model gets split:

Example: 32-Layer Model Across 2 Stages

1# Stage 0 (Rank 0): Input processing + first half
2stage_0_modules = [
3 "model.embed_tokens", # Token embeddings
4 "model.layers.0-15", # First 16 transformer layers
5 "model.rotary_emb" # Position embeddings (shared)
6]
7
8# Stage 1 (Rank 1): Second half + output processing
9stage_1_modules = [
10 "model.layers.16-31", # Last 16 transformer layers
11 "model.norm", # Final layer norm
12 "lm_head", # Language modeling head
13 "model.rotary_emb" # Position embeddings (shared)
14]

Example: 32-Layer Model Across 4 Stages

1# Stage 0 (Rank 0): Input processing
2stage_0_modules = [
3 "model.embed_tokens", # Token embeddings
4 "model.layers.0-7", # First 8 transformer layers
5 "model.rotary_emb" # Position embeddings (shared)
6]
7
8# Stage 1 (Rank 1): Early layers
9stage_1_modules = [
10 "model.layers.8-15", # Next 8 transformer layers
11 "model.rotary_emb"
12]
13
14# Stage 2 (Rank 2): Middle layers
15stage_2_modules = [
16 "model.layers.16-23", # Next 8 transformer layers
17 "model.rotary_emb"
18]
19
20# Stage 3 (Rank 3): Output processing
21stage_3_modules = [
22 "model.layers.24-31", # Final 8 transformer layers
23 "model.norm", # Final layer norm
24 "lm_head", # Language modeling head
25 "model.rotary_emb"
26]

Key observations:

  • Embeddings only exist on the first stage
  • Language modeling head only exists on the last stage
  • Rotary embeddings are shared across all stages (for position encoding)
  • Transformer layers are evenly distributed

Use the Functional API for Custom Models

While AutoPipeline is specifically designed as a high-level interface for Hugging Face models, the functional API in nemo_automodel.components.distributed.pipelining.functional provides more modular and accessible building blocks that can be used with any PyTorch model, including custom architectures. This separation allows for cleaner code organization where AutoPipeline handles Hugging Face-specific optimizations while the functional module remains model-agnostic.

Key Functional API Components

The functional API provides several utilities for building custom pipeline parallel systems:

Stage ID Calculation

1from nemo_automodel.components.distributed.pipelining.functional import stage_ids_this_rank
2
3# Calculate which stages run on this rank
4# For a "loop" style schedule (default)
5stage_ids = stage_ids_this_rank(pp_rank=0, pp_size=4, num_stages=8, style="loop")
6# Returns: (0, 4) - rank 0 gets stages 0 and 4
7
8# For a "v" style schedule (for zero-bubble schedules)
9stage_ids = stage_ids_this_rank(pp_rank=0, pp_size=4, num_stages=8, style="v")
10# Returns: (0, 7) - rank 0 gets stages 0 and 7

Module Name Generation

1from nemo_automodel.components.distributed.pipelining.functional import (
2 generate_hf_model_fqn_per_model_part
3)
4
5# Generate balanced module assignments for any model
6module_names = generate_hf_model_fqn_per_model_part(
7 num_stages=4,
8 num_layers=32,
9 include_embeddings=True,
10 include_lm_head=True,
11 include_rotary_emb=False, # Set based on your model
12 fqn_prefix="" # Use "model." for nested models
13)

Virtual Stage Calculation

1from nemo_automodel.components.distributed.pipelining.functional import calculate_virtual_stages
2
3# Calculate virtual stages for interleaved schedules
4num_virtual_stages, stages_per_rank = calculate_virtual_stages(
5 num_layers=32,
6 layers_per_stage=4, # Each virtual stage has 4 layers
7 pp_size=4,
8 is_single_stage_schedule=False,
9 round_to_pp_multiple="up" # Round up to nearest multiple of pp_size
10)

Pipeline Schedule Build

1from nemo_automodel.components.distributed.pipelining.functional import build_pipeline_schedule
2
3# Build a schedule for your stages
4schedule = build_pipeline_schedule(
5 pipeline_parallel_schedule_csv=None, # Optional CSV schedule
6 pipeline_parallel_schedule="1f1b",
7 microbatch_size=1,
8 local_batch_size=8,
9 stages=stages, # List of PipelineStage objects
10 loss_fn=loss_fn,
11 scale_grads=False
12)

Example: Pipeline Parallelism for Custom Models

Here’s how to use the functional API to implement pipeline parallelism for a custom model:

1import torch
2import torch.nn as nn
3from torch.distributed.device_mesh import init_device_mesh
4from torch.distributed.pipelining import PipelineStage
5from nemo_automodel.components.distributed.pipelining.functional import (
6 stage_ids_this_rank,
7 build_pipeline_schedule,
8 calculate_virtual_stages
9)
10
11class CustomTransformerBlock(nn.Module):
12 def __init__(self, hidden_size):
13 super().__init__()
14 self.attention = nn.MultiheadAttention(hidden_size, num_heads=8)
15 self.mlp = nn.Sequential(
16 nn.Linear(hidden_size, hidden_size * 4),
17 nn.GELU(),
18 nn.Linear(hidden_size * 4, hidden_size)
19 )
20 self.norm1 = nn.LayerNorm(hidden_size)
21 self.norm2 = nn.LayerNorm(hidden_size)
22
23 def forward(self, x):
24 # Simplified transformer block
25 attn_out, _ = self.attention(x, x, x)
26 x = self.norm1(x + attn_out)
27 x = self.norm2(x + self.mlp(x))
28 return x
29
30class CustomModel(nn.Module):
31 def __init__(self, vocab_size, hidden_size, num_layers):
32 super().__init__()
33 self.embedding = nn.Embedding(vocab_size, hidden_size)
34 self.layers = nn.ModuleList([
35 CustomTransformerBlock(hidden_size) for _ in range(num_layers)
36 ])
37 self.output_proj = nn.Linear(hidden_size, vocab_size)
38
39 def forward(self, input_ids):
40 x = self.embedding(input_ids)
41 for layer in self.layers:
42 x = layer(x)
43 return self.output_proj(x)
44
45def split_custom_model_for_pipeline(model, pp_rank, pp_size, num_stages):
46 """Split a custom model into pipeline stages."""
47
48 # Determine which stages this rank handles
49 stage_indices = stage_ids_this_rank(pp_rank, pp_size, num_stages, style="loop")
50
51 stages = []
52 for stage_idx in stage_indices:
53 # Create a stage-specific version of the model
54 # This is a simplified example - you'd need to implement proper splitting
55 stage_model = create_stage_model(model, stage_idx, num_stages)
56
57 # Create PipelineStage
58 stage = PipelineStage(
59 stage_model,
60 stage_idx,
61 num_stages,
62 device=torch.cuda.current_device(),
63 group=None # Set your process group here
64 )
65 stages.append(stage)
66
67 return stages
68
69# Usage
70def main():
71 # Initialize device mesh
72 world_mesh = init_device_mesh("cuda", mesh_shape=(4,), mesh_dim_names=("pp",))
73 pp_rank = world_mesh["pp"].get_local_rank()
74 pp_size = world_mesh["pp"].size()
75
76 # Create model
77 model = CustomModel(vocab_size=50000, hidden_size=768, num_layers=24)
78
79 # Calculate virtual stages
80 num_virtual_stages, stages_per_rank = calculate_virtual_stages(
81 num_layers=24,
82 layers_per_stage=3, # 8 virtual stages total
83 pp_size=4,
84 is_single_stage_schedule=False
85 )
86
87 # Split model into stages
88 stages = split_custom_model_for_pipeline(model, pp_rank, pp_size, num_virtual_stages)
89
90 # Define loss function
91 def loss_fn(logits, targets):
92 return nn.functional.cross_entropy(
93 logits.view(-1, logits.size(-1)),
94 targets.view(-1)
95 )
96
97 # Build pipeline schedule
98 schedule = build_pipeline_schedule(
99 pipeline_parallel_schedule_csv=None,
100 pipeline_parallel_schedule="interleaved_1f1b", # Good for multi-stage
101 microbatch_size=1,
102 local_batch_size=8,
103 stages=stages,
104 loss_fn=loss_fn,
105 scale_grads=True
106 )
107
108 # Training loop
109 for batch in dataloader:
110 # Use schedule.step() for training
111 losses = []
112 schedule.step(batch["input_ids"], target=batch["labels"], losses=losses)
113
114 # losses will contain the loss values from the last stage
115 if losses:
116 print(f"Loss: {sum(losses) / len(losses)}")

Advanced: Custom Model Splitting Logic

For more complex custom models, you can implement your own splitting logic:

1from nemo_automodel.components.distributed.pipelining.functional import pipeline_model
2
3def custom_parallelize_fn(
4 model, world_mesh, moe_mesh, *,
5 pp_enabled, dp_axis_names, **kwargs
6):
7 """Custom parallelization function for each pipeline stage."""
8 # Apply your custom parallelization logic here
9 # This is called for each pipeline stage
10 if dp_axis_names:
11 # Apply data parallelism
12 pass
13 # Add any other parallelization strategies
14 pass
15
16# Use pipeline_model for complete pipeline setup
17schedule, model_parts, has_first, has_last, stages = pipeline_model(
18 model=your_custom_model,
19 world_mesh=world_mesh,
20 moe_mesh=None,
21 pp_axis_name="pp",
22 dp_axis_names=("dp",),
23 layers_per_stage=4,
24 pipeline_parallel_schedule="1f1b",
25 pipeline_parallel_schedule_csv=None,
26 microbatch_size=1,
27 local_batch_size=8,
28 device=torch.cuda.current_device(),
29 loss_fn=loss_fn,
30 parallelize_fn=custom_parallelize_fn,
31 module_fqns_per_model_part=None, # Provide custom module names
32 patch_inner_model=False, # Custom model: don't apply HF forward patches
33 patch_causal_lm_model=False, # Custom model: don't apply HF forward patches
34)

Tips for Using Functional API with Custom Models

The functional API is designed to be more accessible and modular than AutoPipeline:

  1. Module Naming: Ensure your model has consistent module naming that can be mapped to stages
  2. State Management: Handle model state (embeddings, buffers) carefully across stages
  3. Communication: First and last stages need special handling for inputs/outputs
  4. Flexibility: The functional API gives you complete control over how models are split and parallelized
  5. Testing: Start with a small model and verify correct splitting before scaling up

The functional module’s modular design makes it easier to integrate pipeline parallelism into existing custom model training workflows without the Hugging Face-specific assumptions that AutoPipeline makes.

Mixed Parallelism

AutoPipeline can be combined with other parallelization strategies for optimal performance:

1def parallelize_fn(
2 model, world_mesh, moe_mesh, *,
3 pp_enabled, dp_axis_names,
4 cp_axis_name=None, tp_axis_name=None, ep_axis_name=None
5):
6 """Apply additional parallelization to each pipeline stage."""
7 # Example: Apply FSDP to each stage
8 if dp_axis_names:
9 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
10 # Wrap model with FSDP (simplified example)
11 # In practice, you'd configure FSDP parameters
12 pass
13
14 # Example: Apply tensor parallelism
15 if tp_axis_name:
16 # Apply tensor parallelism to attention/MLP layers
17 pass
18
19# Build pipeline with custom parallelization
20ap = AutoPipeline(world_mesh=world_mesh).build(
21 model,
22 loss_fn=loss_fn,
23 parallelize_fn=parallelize_fn
24)

Monitor and Debug

AutoPipeline provides comprehensive tools for understanding your pipeline configuration:

Pipeline Information

1# Get pipeline info
2info = ap.info
3print(f"Pipeline enabled: {info.enabled}")
4print(f"Has first stage: {info.has_first_stage}")
5print(f"Has last stage: {info.has_last_stage}")
6
7# Access model parts
8model_parts = ap.parts # List of pipeline stages
9stage_modules = ap.list_stage_modules() # Module names per stage

Analysis

1# Parameter distribution
2stage_param_counts = ap.get_stage_param_counts()
3total_params = ap.get_total_param_count()
4trainable_params = ap.get_total_param_count(trainable_only=True)
5
6for i, params in enumerate(stage_param_counts):
7 percentage = (params / total_params) * 100
8 print(f"Stage {i}: {params:,} parameters ({percentage:.1f}%)")
9
10# Debug summary
11print(ap.debug_summary())
12print(ap.pretty_print_stages(max_modules_per_stage=10))
13
14# Visualize schedule
15ap.visualize_current_schedule("pipeline_schedule.png")

Gradient Management

1# Scale gradients for mixed parallelism
2ap.scale_grads_by_divisor(divisor=8)
3
4# Clip gradients across pipeline stages
5grad_norm = ap.clip_grad_norm(max_norm=1.0, norm_type=2.0)

Add Pipeline Parallelism to Existing Configurations

You can easily add pipeline parallelism to any existing training configuration through command-line overrides or YAML modifications.

Command-Line Override Method

Add pipeline parallelism to an existing config using command-line arguments:

$automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
> --distributed.strategy fsdp2 \
> --distributed.pp_size 2 \
> --distributed.pipeline.pp_schedule 1f1b \
> --distributed.pipeline.pp_microbatch_size 1 \
> --distributed.pipeline.round_virtual_stages_to_pp_multiple up \
> --distributed.pipeline.scale_grads_in_schedule false

Key parameters to override:

  • --distributed.pp_size: Number of pipeline stages (must match nproc-per-node)
  • pp_batch_size is automatically inferred from --dataloader.batch_size
  • --distributed.pipeline.pp_schedule: Pipeline schedule (1f1b, interleaved_1f1b, etc.)

YAML Configuration Method

Add these sections to your existing YAML config:

1distributed:
2 strategy: fsdp2
3 dp_size: 1
4 tp_size: 1
5 cp_size: 1
6 pp_size: 4 # Enable 4-way pipeline parallelism
7 sequence_parallel: false
8 pipeline:
9 pp_schedule: 1f1b
10 pp_microbatch_size: 1
11 # pp_batch_size is automatically inferred from dataloader.batch_size
12 round_virtual_stages_to_pp_multiple: up
13 scale_grads_in_schedule: false
14 layers_per_stage: null # Auto-compute, or specify number

Mixed Parallelism Examples

Pipeline + Data Parallelism (4 GPUs Total)

$automodel --nproc-per-node=4 your_config.yaml \
> --distributed.pp_size 2 \
> --distributed.dp_size 2 \
> --dataloader.batch_size 16

Pipeline + Tensor Parallelism (4 GPUs Total)

$automodel --nproc-per-node=4 your_config.yaml \
> --distributed.pp_size 2 \
> --distributed.tp_size 2 \
> --dataloader.batch_size 8

Full Hybrid: PP + DP + TP (8 GPUs Total)

$automodel --nproc-per-node=8 your_config.yaml \
> --distributed.pp_size 2 \
> --distributed.dp_size 2 \
> --distributed.tp_size 2 \
> --dataloader.batch_size 32

Integrate with Training Recipes

AutoPipeline seamlessly integrates with NeMo AutoModel’s recipe system. Here’s a complete example YAML configuration:

1# config.yaml
2distributed:
3 strategy: fsdp2
4 dp_size: 1
5 tp_size: 1
6 cp_size: 1
7 pp_size: 2 # 2-way pipeline parallelism
8 sequence_parallel: false
9 pipeline:
10 pp_schedule: 1f1b
11 pp_microbatch_size: 1
12 # pp_batch_size is automatically inferred from dataloader.batch_size
13 layers_per_stage: null # Auto-compute layer distribution
14 round_virtual_stages_to_pp_multiple: up
15 scale_grads_in_schedule: false
16
17model:
18 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
19 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
20
21loss_fn:
22 _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
23
24dataset:
25 _target_: nemo_automodel.components.datasets.llm.squad.SQuAD
26 path_or_dataset: squad
27 split: train
28
29dataloader:
30 batch_size: 8
31 shuffle: true

Run training with:

$# Run with 2 GPUs for 2-way pipeline parallelism
$automodel --nproc-per-node=2 config.yaml

Troubleshooting

Common Issues

Model doesn’t fit in memory:

  • Increase number of pipeline stages
  • Reduce microbatch size
  • Enable gradient checkpointing

Pipeline bubbles reducing efficiency:

  • Increase batch size to have more microbatches
  • Try different schedules (e.g., interleaved_1f1b)
  • Adjust virtual stages configuration

Uneven stage distribution:

  • Use manual module assignment for fine control
  • Adjust layers_per_stage parameter
  • Check parameter counts with get_stage_param_counts()

Conclusion

AutoPipeline and the functional API together provide a complete pipeline parallelism solution for both Hugging Face and custom models. AutoPipeline offers a high-level, optimized interface specifically for Hugging Face models, while the functional module provides modular, accessible building blocks for custom architectures.

Key takeaways:

  • Pipeline parallelism enables training of models too large for a single GPU
  • AutoPipeline provides a simple API for Hugging Face models with powerful customization options
  • The functional API offers modular components for implementing pipeline parallelism with any PyTorch model
  • Both can be combined with other parallelization strategies for optimal performance
  • Use built-in monitoring tools to understand and optimize your pipeline