Quick Checklist#
Note
A complete checklist to verify your PyTorch code is ready for CUDA Graph capture. Check each item before attempting to graph your workload.
Asynchronous Execution#
No host-device synchronization (sync-free code, capture failures)
No explicit sync:
torch.cuda.synchronize(),stream.synchronize(),event.synchronize()(details)No blocking GPU→CPU transfers:
.item(),.cpu(),.numpy(),print(tensor)(details)No direct CUDA tensor creation from Python objects (details)
No data-dependent control flow:
if tensor:,loss.item()(details)No GPU tensor indexing with CPU tensors or Python lists (details)
No slicing with CUDA tensor bounds:
x[i:j]wherei,jare CUDA tensors (details)
No default stream usage
No event/stream query
No
stream.query(),event.query()during capture (details)No background thread queries (details)
No pinned memory allocation during capture (triggers hidden event query) (details)
DataLoader with
pin_memory=True: usethread_localmode or disable pin_memory (details)NCCL watchdog handled (auto in PyTorch 2.2+) (details)
Static Graph#
Static graph topology (details)
No dynamic control flow (
if/elsebased on tensor values); usetorch.where()or capture multiple graphs (details)Gradient clipping: use sync-free
clip_grad_norm_(PyTorch 1.13+) (details)Early exit and adaptive inference: capture separate graphs per path (details)
Capture-aware code (
is_current_stream_capturing()) doesn’t change computation (details)
Static memory addresses (details)
Static input tensors allocated before capture, updated via
.copy_()(details)Global tensors used within graph are persistent (details)
Grouped GEMM / pointer arrays: keep host pointer tensors alive (details)
AMP autocast cache disabled (
cache_enabled=False) or capture autocast inside graph (details)
Static scalars (details)
CPU variable scalars converted to GPU tensors, update via
.fill_()(details)Learning rate / global step: use capturable optimizer (e.g., APEX FusedAdam) (details)
Handling RNG state correctly
Custom generators registered with
graph.register_generator_state()(details)Use graph-safe APIs:
graphsafe_get_state(),graphsafe_set_state()(details)Activation checkpointing uses
preserve_rng_state=False(details)Partial graphing uses
use_reentrant=False(details)torch.compilefunctions warmed up before capture (details)
Static shapes (details)
Self-Contained Stream Capture#
CPU Code Is Not Captured#
Memory Requirements#
No pinned memory alloc/free in global mode (details)
Persistent graph input tensors
No cross-iteration reuse of output tensor without cloning (details)
Memory pool sharing (if using shared pools):
Memory usage awareness (for OOM prevention):
Reuse static input tensors across graphs when possible (details)
Chain graph outputs as inputs to next graph (details)
Be aware: intermediate tensors can’t be reused across different pools (details)
Be aware: operations after capture can’t reuse graph pool memory (details)
Be aware: memory fragmentation across pools (details)
Be aware: deferred memory recycling with multi-streams during capture (details)
Be aware: gradient accumulator cross-stream growth (details)
Be aware:
cudaFreeis suppressed during capture (details)
Other Considerations#
Warmup iterations before capture on the same side stream (details)
Capture mode: Use
globalmode unless specific multi-threading needs requirethread_localorrelaxed(details)Module hooks: Only top-level module hooks fire with
make_graphed_callables(details)Deferred gradient hooks:
make_graphed_callablesdefers gradient accumulation and DDP hooks (details)NCCL communicator lifecycle: Destroy CUDA graphs before NCCL communicators (details)
Pinned memory race condition: Synchronize before CPU writes to pinned memory (details)
Stream count: Avoid too many streams to prevent channel serialization (details)
NCCL buffer registration: Set
NCCL_GRAPH_REGISTER=0if using expandable segments with older NCCL (details)
What’s Next?#
Best Practices: Systematic approach to adopting CUDA Graphs
Writing Sync-Free Code: Eliminate CPU-GPU synchronizations
Handling Dynamic Patterns: Solutions for common obstacles
Troubleshooting: Debug capture failures and issues