Pipeline Parallelism with AutoPipeline
Introduction
As large language models continue to grow in size, training and fine-tuning them efficiently across multiple GPUs has become increasingly challenging. While data parallelism works well for smaller models, models with billions of parameters require more sophisticated parallelization strategies to overcome memory constraints and communication overhead.
Pipeline parallelism addresses these challenges by splitting a model’s layers across different devices and processing them in a pipelined fashion. Each device processes a different stage of the model, enabling training of models that wouldn’t fit on a single device while maintaining high GPU utilization through overlapped computation.
AutoPipeline is NeMo AutoModel’s high-level pipeline parallelism interface specifically designed for Hugging Face models, making pipeline parallelism as simple as data parallelism. Built on PyTorch’s native torch.distributed.pipelining, AutoPipeline provides seamless pipeline parallelism support for any Hugging Face decoder-only causal language model with minimal code changes.
For custom models and more granular control, the functional API in nemo_automodel.components.distributed.pipelining.functional provides modular, accessible building blocks that can be used with any PyTorch model architecture.
This guide walks you through the complete process of using AutoPipeline for Hugging Face models and the functional API for custom models. You’ll learn how to configure pipeline stages, integrate with existing training workflows, optimize performance, and combine pipeline parallelism with other parallelization strategies.
Prerequisites:
Before proceeding with this guide, please ensure that you have NeMo AutoModel installed on your machine. For a complete guide and additional options please consult the AutoModel Installation Guide.
Key Features
AutoPipeline provides the following capabilities:
- Universal Hugging Face Support: Works with any Hugging Face decoder-only causal language model including Llama, Qwen, Mistral, Gemma, and more
- PyTorch Native Integration: Built on PyTorch’s
torch.distributed.pipeliningfor optimal performance - Flexible Configuration: Multiple scheduling strategies, configurable microbatch sizes, and automatic or manual layer splitting
- Mixed Parallelism Support: Combine pipeline parallelism with data parallelism, tensor parallelism, and FSDP
- Modular Functional API: For custom models, the functional module provides accessible, low-level building blocks
- Minimal Opinions: Easy to extend and integrate with existing training workflows
Quick Start with AutoPipeline (Hugging Face Models)
Here’s a minimal example to get started with AutoPipeline using 2 pipeline stages with a Hugging Face model:
Run the Quick Start Example
Save the above code as pipeline_example.py and run with:
For a complete training example:
Configuration Options
Basic Configuration
AutoPipeline provides comprehensive control over pipeline behavior:
Model Patching (patch_inner_model, patch_causal_lm_model)
AutoPipeline splits a model by deep-copying it per stage and pruning away modules that don’t belong to that stage. Many Hugging Face models assume the full module tree is present and return ModelOutput objects; after pruning, their original forward() often breaks (or returns objects that are awkward to pipeline).
These two flags switch AutoPipeline to lightweight, pipeline-friendly forward() implementations that return tensors (see nemo_automodel.components.distributed.pipelining.hf_utils.patch_hf_model_for_pp):
-
patch_inner_model: patches the decoder module (model.modelfor...ForCausalLM, otherwise the module itself) so each stage can run even after pruning.- Stage 0 (has
embed_tokens): takes token IDs and produces hidden states. - Middle stages (no
embed_tokens): take hidden states from the previous stage (usinginputs_embeds, or a float tensor passed throughinput_ids) and produce hidden states. - Handles sliced layer containers (e.g.,
layersbecoming dict-like after stage pruning) and returns a tensor of hidden states so stages can be chained.
For compilation/performance, this patched forward prefers a precomputed
causal_mask_mappingdict (it will fall back to computing masks and warn if you don’t provide it). - Stage 0 (has
-
patch_causal_lm_model: patches the...ForCausalLMwrapper forward (the module that ownslm_head) so pipeline stages return tensors:- Returns hidden states when
lm_headis absent on that stage. - Returns logits when
lm_headis present (typically only the last stage). - Supports
logits_to_keepto compute logits for only the lastktokens.
Note: this is only used when the module you pipeline is a
...ForCausalLM-style wrapper (i.e., it has a.modelattribute). If you pass a base decoder module directly,patch_causal_lm_modeltypically has no effect. - Returns hidden states when
When Should I Change These?
- Leave both
True(default) for standard Hugging FaceAutoModelForCausalLM/...ForCausalLMmodels. This is the common case and gives the expected behavior: token IDs -> hidden states -> logits across stages. - Set both
Falsewhen your model already has a pipeline-friendly forward (returns tensors and can accept hidden states when embeddings are absent) or it needs custom kwargs/paths that the HF patch doesn’t preserve (common for NeMo AutoModel-native model implementations, packed-sequence/thdpaths, extra args likepadding_mask, etc.). Many benchmark configs for NeMo-native models do this (for exampleexamples/llm_benchmark/qwen/qwen3_moe_30b_torch.yaml). - Set
patch_inner_model=False, patch_causal_lm_model=Truewhen your inner model is already stage-friendly, but the wrapper forward still returns aModelOutputand you only want the wrapper simplified to “hidden states or logits”.
If you disable patch_causal_lm_model, your last stage will typically output hidden states instead of logits; in that case, make sure your loss_fn (or your last-stage module) applies the LM head explicitly.
Automatic vs. Manual Layer Distribution
AutoPipeline offers flexible control over how your model is split across pipeline stages:
Automatic Distribution
Let AutoPipeline automatically balance layers across stages:
Manual Distribution
Specify exactly which modules go to each stage:
Understand Model Splitting
When AutoPipeline splits your model, it intelligently distributes components across pipeline stages. Here’s how a typical model gets split:
Example: 32-Layer Model Across 2 Stages
Example: 32-Layer Model Across 4 Stages
Key observations:
- Embeddings only exist on the first stage
- Language modeling head only exists on the last stage
- Rotary embeddings are shared across all stages (for position encoding)
- Transformer layers are evenly distributed
Use the Functional API for Custom Models
While AutoPipeline is specifically designed as a high-level interface for Hugging Face models, the functional API in nemo_automodel.components.distributed.pipelining.functional provides more modular and accessible building blocks that can be used with any PyTorch model, including custom architectures. This separation allows for cleaner code organization where AutoPipeline handles Hugging Face-specific optimizations while the functional module remains model-agnostic.
Key Functional API Components
The functional API provides several utilities for building custom pipeline parallel systems:
Stage ID Calculation
Module Name Generation
Virtual Stage Calculation
Pipeline Schedule Build
Example: Pipeline Parallelism for Custom Models
Here’s how to use the functional API to implement pipeline parallelism for a custom model:
Advanced: Custom Model Splitting Logic
For more complex custom models, you can implement your own splitting logic:
Tips for Using Functional API with Custom Models
The functional API is designed to be more accessible and modular than AutoPipeline:
- Module Naming: Ensure your model has consistent module naming that can be mapped to stages
- State Management: Handle model state (embeddings, buffers) carefully across stages
- Communication: First and last stages need special handling for inputs/outputs
- Flexibility: The functional API gives you complete control over how models are split and parallelized
- Testing: Start with a small model and verify correct splitting before scaling up
The functional module’s modular design makes it easier to integrate pipeline parallelism into existing custom model training workflows without the Hugging Face-specific assumptions that AutoPipeline makes.
Mixed Parallelism
AutoPipeline can be combined with other parallelization strategies for optimal performance:
Monitor and Debug
AutoPipeline provides comprehensive tools for understanding your pipeline configuration:
Pipeline Information
Analysis
Gradient Management
Add Pipeline Parallelism to Existing Configurations
You can easily add pipeline parallelism to any existing training configuration through command-line overrides or YAML modifications.
Command-Line Override Method
Add pipeline parallelism to an existing config using command-line arguments:
Key parameters to override:
--distributed.pp_size: Number of pipeline stages (must match nproc-per-node)pp_batch_sizeis automatically inferred from--dataloader.batch_size--distributed.pipeline.pp_schedule: Pipeline schedule (1f1b, interleaved_1f1b, etc.)
YAML Configuration Method
Add these sections to your existing YAML config:
Mixed Parallelism Examples
Pipeline + Data Parallelism (4 GPUs Total)
Pipeline + Tensor Parallelism (4 GPUs Total)
Full Hybrid: PP + DP + TP (8 GPUs Total)
Integrate with Training Recipes
AutoPipeline seamlessly integrates with NeMo AutoModel’s recipe system. Here’s a complete example YAML configuration:
Run training with:
Troubleshooting
Common Issues
Model doesn’t fit in memory:
- Increase number of pipeline stages
- Reduce microbatch size
- Enable gradient checkpointing
Pipeline bubbles reducing efficiency:
- Increase batch size to have more microbatches
- Try different schedules (e.g.,
interleaved_1f1b) - Adjust virtual stages configuration
Uneven stage distribution:
- Use manual module assignment for fine control
- Adjust
layers_per_stageparameter - Check parameter counts with
get_stage_param_counts()
Conclusion
AutoPipeline and the functional API together provide a complete pipeline parallelism solution for both Hugging Face and custom models. AutoPipeline offers a high-level, optimized interface specifically for Hugging Face models, while the functional module provides modular, accessible building blocks for custom architectures.
Key takeaways:
- Pipeline parallelism enables training of models too large for a single GPU
- AutoPipeline provides a simple API for Hugging Face models with powerful customization options
- The functional API offers modular components for implementing pipeline parallelism with any PyTorch model
- Both can be combined with other parallelization strategies for optimal performance
- Use built-in monitoring tools to understand and optimize your pipeline