Examples#
Note
This section provides practical, runnable examples of using CUDA Graphs in real-world scenarios.
Overview#
The following examples demonstrate CUDA Graph usage across different application domains. Each example is designed to be self-contained and adaptable to your specific use case.
Available Examples#
Speech recognition with dynamic shapes, greedy decoding, and bucketing strategies
Diffusion models with UNet, full-iteration capture, and PyTorch Lightning.
Per-layer graphing with FP8, pipeline parallelism, and complex interleaved microbatch execution.
Fine-tuning with LoRA adapters, FP8+FP4, and per-layer graphing.
Full-iteration graphing at massive scale with FP4 and Megatron-LM for training and inference.
Comparison Table#
The following table compares CUDA Graph implementations across different models and frameworks:
Aspect |
RNN-T |
Stable Diffusion v2 |
GPT-3 175B |
Llama 2 70B LoRA |
Llama 3.1 405B |
|---|---|---|---|---|---|
MLPerf Version |
v3.1 (2023/11) |
v5.0 (2025/06) |
v4.1 (2024/11) |
v5.0 (2025/06) |
v5.1 (2025/11) |
Software Stack |
Pure PyTorch + APEX + DALI |
NeMo + PyTorch Lightning + APEX |
NeMo + Megatron-LM + TE |
Megatron-LM + TE |
NeMo + Megatron-LM + PyTorch Lightning + TE |
Model Size |
~49M parameters |
~865M parameters (UNet) |
~175B parameters |
~70B base + LoRA adapters |
~405B parameters |
Capture Scope |
Partial (encoder + prediction networks) |
Full iteration (forward + backward + optimizer) |
Partial (per-layer) |
Partial (per-layer) |
Full iteration (forward + backward) |
Graph Count |
8 graphs with bucketing (4 per network Γ 2 networks) |
1 graph (training) |
per layer Γ per microbatch x 2 |
per layer Γ per microbatch x 2 |
2 graphs (training + validation) |
Capture Mechanism |
Custom |
NeMo CUDAGraphCallback |
TE |
TE |
Megatron-LM |
Static Shapes |
β Bucketing (audio: 400-1600, text: 150-600) |
β Fixed (latents: 4Γ64Γ64, text: 77Γ1024) |
β Fixed (seq_len: 2048) |
β Fixed shapes |
β Fixed (seq_len: 8192) |
Static Control Flow |
β Greedy decoding with masks |
β Conditional optimizer step |
β Conditional FP8 weight transpose |
β No dynamic control flow |
β No dynamic control flow |
Distributed Training |
β DDP (via DistributedFusedLAMB) |
β DDP |
β 3D parallelism (DP + TP + PP) |
β DP + TP + CP |
β DP + TP + PP + CP |
Mixed Precision |
FP16 |
FP16 |
BF16 + FP8 |
BF16 + FP8 + FP4 |
BF16 + FP4 |
Key Challenges |
β’ Variable-length sequences |
β’ PyTorch Lightning integration |
β’ FP8 compatibility |
β’ FP8 + FP4 compatibility |
β’ FP4 compatibility |
Notable Techniques |
β’ Bucketing |
β’ Capturable optimizer (noop flag) |
β’ Per-layer per-microbatch graphing |
β’ Per-layer per-microbatch graphing |
β’ Full iteration graphing with Megatron-LM |
Performance Gain |
Significant |
Up to 1.58Γ (58% faster) cumulative |
2.2% (256 GPUs) β 3.0% (11,616 GPUs) |
Variable (depends on LoRA config) |
~15-25% at large scale (preliminary) |
Best For |
Models with: |
Models with: |
Models with: |
Models with: |
Models with: |
Legend:
β = Supported/Handled
β = Not present/Not supported
General Patterns#
All examples demonstrate key CUDA graph concepts:
Capture Approaches: From manual per-layer graphing (GPT-3, Llama 2 LoRA) to automatic full-iteration graphing (Stable Diffusion, Llama 3.1)
Framework Integration: Pure PyTorch (RNN-T), PyTorch Lightning (Stable Diffusion), Transformer Engine (GPT-3, Llama 2 LoRA), and Megatron-LM (Llama 3.1)
Distributed Training: DDP (RNN-T, Stable Diffusion), 3D parallelism with pipeline parallelism (GPT-3, Llama 3.1), and LoRA fine-tuning (Llama 2)
Mixed Precision: FP16 (RNN-T, Stable Diffusion), FP8 (GPT-3, Llama 2, Llama 3.1)
Dynamic Challenges: Bucketing for variable shapes (RNN-T), conditional optimizer steps (Stable Diffusion), and pipeline schedule ordering (GPT-3, Llama 3.1)
What Youβll Learn#
Per-layer vs. full-iteration graphing: When to use each approach and their tradeoffs
FP8 training with CUDA graphs: Handling global buffers, weight caching, and dynamic scaling state
Pipeline parallelism compatibility: Managing complex interleaved schedules and memory pools
Framework-specific patterns: Integration with PyTorch Lightning, Transformer Engine, and Megatron-LM
Handling dynamic patterns: Bucketing strategies and conditional execution within graphs
Performance optimization: Measuring speedups from 2% to 58% across different scales and models
Whatβs Next?#
Select an example that matches your use case from the cards above, or explore all examples:
Example |
Best For |
|---|---|
π€ RNN-T |
Variable sequence lengths, dynamic control flow |
π¨ Stable Diffusion v2 |
PyTorch Lightning, maximum performance with full-iteration graphing |
π§ GPT-3 175B |
Per-layer graphing with FP8, pipeline parallelism, |
π¦ Llama 2 70B LoRA |
Fine-tuning, LoRA adapters |
π¦ Llama 3.1 405B |
Maximum scale, Megatron-LM |
After Reviewing Examples
π Best Practices: General guidance for all use cases
π οΈ Troubleshooting: Debug common issues and failures
π PyTorch Integration: Deep dive into PyTorchβs CUDA Graph APIs