Quick Checklist#

Note

A complete checklist to verify your PyTorch code is ready for CUDA Graph capture. Check each item before attempting to graph your workload.

Self-Contained Stream Capture#

Side streams fork from capture stream via side_stream.wait_stream(capture_stream) (details)
Side streams join back to capture stream via capture_stream.wait_stream(side_stream) (details)
No dependency on external work, or use external events (details)

Warmup iterations before capture on the same side stream (details)
Capture mode: Use global mode unless specific multi-threading needs require thread_local or relaxed (details)
Module hooks: Only top-level module hooks fire with make_graphed_callables (details)
Deferred gradient hooks: make_graphed_callables defers gradient accumulation and DDP hooks (details)
NCCL communicator lifecycle: Destroy CUDA graphs before NCCL communicators (details)
Pinned memory race condition: Synchronize before CPU writes to pinned memory (details)
Stream count: Avoid too many streams to prevent channel serialization (details)
NCCL buffer registration: Set NCCL_GRAPH_REGISTER=0 if using expandable segments with older NCCL (details)