CUDA Graph#

Note

This section provides a quick overview of CUDA Graph—what it is, how it works, and when to use it—without diving into too much low-level CUDA details.

What is CUDA Graph?#

As we’ve seen in the previous section, kernel launch overhead can significantly impact GPU performance, especially when the CPU cannot keep up with feeding work to the GPU. CUDA Graphs are designed specifically to address this problem.

CUDA Graph is a powerful feature that allows you to define a graph of GPU operations once and launch it multiple times with minimal CPU overhead. Instead of launching each kernel individually—each time paying the full overhead cost—you bundle multiple operations into a graph with their dependencies, then launch the entire graph as a single unit with a single submission from the CPU. By launching entire workflows as a single graph, CUDA Graphs dramatically reduce cumulative kernel launch overhead and improve GPU utilization, especially for workloads with many small operations.

The core idea is straightforward. Instead of the traditional approach:

CPU: Launch kernel 1 → GPU: Execute kernel 1
CPU: Launch kernel 2 → GPU: Execute kernel 2
CPU: Launch kernel 3 → GPU: Execute kernel 3
...

With CUDA Graphs, you:

Definition: "Here's my graph of kernels and their dependencies"
Instantiation: "Pre-build the launch structure"
Execution: Launch entire graph → GPU: Execute all kernels

CUDA Graphs work in three distinct phases:

Definition: Describe what operations you want to perform and their dependencies—think of it as writing a recipe: “First do A, then do B and C in parallel, then do D.”
Instantiation: Take that recipe and prepare everything for execution: validating the graph structure, pre-allocating resources, building optimized launch descriptors, and doing as much work upfront as possible. This is like prepping all your ingredients and tools before cooking—it takes time, but you only do it once!
Execution: Launch the prepared graph, which is now extremely fast because all the setup work was done during instantiation. It’s like cooking with everything already prepped—much faster!

CUDA Graph Structure#

At its core, a CUDA Graph is a directed acyclic graph (DAG) that represents the execution flow of GPU operations. Understanding its structure helps you leverage CUDA Graphs effectively.

Nodes: The Work Units#

Nodes represent the actual operations that need to be executed on the GPU. Each node encapsulates a specific operation:

Kernel launches: The computational work (e.g., matrix multiplication, convolution)
Memory operations: Copying data between CPU and GPU, or within GPU memory (device-to-device transfers)
Memory management: Allocating or freeing memory on the GPU
Host functions: CPU-side operations that need to be synchronized with GPU work
Empty nodes: Synchronization points with no actual work, useful for coordinating dependencies

Each node is immutable once the graph is instantiated—you cannot change what operation a node performs, though you can update certain parameters (like memory addresses) in limited ways.

Edges: The Dependencies#

Edges define the execution order by specifying dependencies between nodes. An edge from node A to node B means “B cannot start until A completes.” These dependencies ensure correctness while allowing maximum parallelism:

Sequential dependencies: Operations that must happen in order (e.g., forward pass → backward pass)
Parallel opportunities: Independent operations can run simultaneously (e.g., parallel branches in a network)
Synchronization points: Where parallel operations must complete before proceeding (e.g., all-reduce a bucket of gradients in distributed training)

The CUDA runtime analyzes the edges to determine which operations can run concurrently and schedules them efficiently across the GPU’s streaming multiprocessors.

Visual Example#

Here’s a simple graph showing how nodes and edges create an execution flow:

        graph TD
    A[Kernel A: Preprocess] --> B[Kernel B: Process Left]
    A --> C[Kernel C: Process Right]
    B --> D[Kernel D: Combine Results]
    C --> D

This graph structure means:

Kernel A (Preprocess) runs first
Once A completes, Kernels B and C run in parallel (no dependency between them)
Both B and C must complete before Kernel D (Combine Results) can start
The GPU runtime may execute B and C in parallel when resources allow

This structure is powerful because the graph explicitly captures the “what must happen before what” relationships, allowing the GPU driver to optimize scheduling without any additional CPU overhead during graph launch.

Creating CUDA Graphs#

There are two primary ways to create a CUDA Graph: building it explicitly by manually adding nodes and edges, or capturing it from existing CUDA work. Each approach has its use cases and trade-offs.

Method 1: Explicit Graph Construction#

With explicit construction, you manually build the graph by creating nodes and connecting them with edges. This gives you fine-grained control over the graph structure and is useful when you need to programmatically construct complex execution patterns.

How it works:

Create an empty graph object
Add nodes one by one (kernel launches, memory copies, etc.)
Define edges to specify dependencies between nodes
Instantiate the graph to prepare it for execution (one-time cost)
Execute (launch) the graph repeatedly with minimal overhead

Example concept (in CUDA C++):

cudaGraph_t graph;
cudaGraphCreate(&graph, 0);

// Define kernel parameters
cudaKernelNodeParams params_a = {};
params_a.func = (void*)my_kernel;
params_a.gridDim = dim3(grid_size);
params_a.blockDim = dim3(block_size);
params_a.kernelParams = kernel_args;  // Array of pointers to arguments

// Add kernel nodes (no dependencies = NULL, 0)
cudaGraphNode_t kernel_a, kernel_b;
cudaGraphAddKernelNode(&kernel_a, graph, NULL, 0, &params_a);
cudaGraphAddKernelNode(&kernel_b, graph, NULL, 0, &params_b);

// Add memcpy node
cudaGraphNode_t memcpy_node;
cudaGraphAddMemcpyNode(&memcpy_node, graph, NULL, 0, &memcpy_params);

// Add edges (dependencies): B and memcpy depend on A
cudaGraphNode_t deps[] = {kernel_a};
cudaGraphAddDependencies(graph, deps, &kernel_b, 1);
cudaGraphAddDependencies(graph, deps, &memcpy_node, 1);

// Instantiate (one-time setup cost)
cudaGraphExec_t graph_exec;
cudaGraphInstantiate(&graph_exec, graph, NULL, NULL, 0);

// Execute many times (fast!)
for (int i = 0; i < 1000; i++) {
    cudaGraphLaunch(graph_exec, stream);
}

Advantages:

Full control over graph structure
Can create complex dependency patterns programmatically
Useful for dynamic graph generation based on runtime conditions

Disadvantages:

More verbose and complex code
Requires low-level CUDA knowledge
Manual management of nodes and edges
Needs extensive refactoring to apply to existing code

Method 2: Stream Capture#

Stream capture is the more practical approach, especially for existing CUDA code. You execute your operations normally, and CUDA records them into a graph automatically.

How it works:

Begin capture on a CUDA stream
Execute your operations normally (they get recorded instead of executed)
End capture to get the graph
Instantiate the graph
Launch the graph multiple times

CUDA C++ concept:

cudaStream_t stream;
cudaStreamCreate(&stream);

// Begin capture
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// Execute operations (they get recorded)
kernel_a<<<grid, block, 0, stream>>>(...);
kernel_b<<<grid, block, 0, stream>>>(...);
cudaMemcpyAsync(..., stream);

// End capture and get the graph
cudaGraph_t graph;
cudaStreamEndCapture(stream, &graph);

// Instantiate (one-time setup)
cudaGraphExec_t graph_exec;
cudaGraphInstantiate(&graph_exec, graph, NULL, NULL, 0);

// Execute many times (fast!)
for (int i = 0; i < 1000; i++) {
    cudaGraphLaunch(graph_exec, stream);
}

Advantages:

Much simpler to use—minimal code changes
Works with existing code—just wrap it in capture
Automatically handles dependencies based on stream ordering
Captures complex operations (cuDNN, cuBLAS calls) automatically

Disadvantages:

Less control over graph structure—dependencies are inferred from stream ordering
Sequential operations on a single stream become serialized in the graph, even if they could run in parallel
Must use multiple streams (forked from capture stream) to express parallelism

Note

How Dependencies Are Captured

With explicit graph construction, you define dependencies directly by adding edges between nodes based on data dependencies. With stream capture, dependencies are inferred from CUDA stream and event ordering—operations on the same stream are serialized, and cudaStreamWaitEvent creates cross-stream dependencies. This means if you run operations with potential parallelism sequentially on a single stream, the captured graph will reflect that serial ordering rather than the underlying data dependencies. To capture parallelism, you must use multiple streams during capture.

The Magic: Reusability#

Now we come to the key advantage that makes CUDA Graphs so powerful: once you create and instantiate a graph, you can execute it many times with minimal overhead. This reusability is what allows CUDA Graphs to dramatically reduce the launch overhead we discussed in the introduction.

Recall that launching kernels individually involves significant CPU overhead—language transitions, runtime processing, driver operations, and hardware submission—costing anywhere from 20-200 μs typically per operation in Deep Learning applications. CUDA Graphs eliminate most of this overhead by batching all the setup work into the graph instantiation phase, which happens only once.

Here’s the performance breakdown:

Initial setup (one-time cost):

Graph capture/construction: ~100 ms (depending on actual workload) - Record or build the graph structure
Graph instantiation: (depending on graph size) - Validate, optimize, and prepare for execution
First execution: ~10 μs (depending on graph size) - Launch the entire graph with a single command

Every subsequent execution:

Graph launch: ~10 μs (depending on graph size) - Just launch, all setup already done

The magic is in what gets eliminated during graph launch. Instead of paying the launch overhead for each operation, you pay a single small cost (~10 μs) to launch the entire graph. All the expensive work—parameter validation, kernel selection, memory preparation, buffer setup—was done once during instantiation and is simply reused on every launch.

This reusability makes CUDA Graphs especially valuable for:

Training loops: The same forward and backward operations repeat thousands of times per iteration / epoch
Inference serving: The same model graph processes millions of requests
Simulation timesteps: Identical operations execute for thousands of steps
Iterative algorithms: Same computation pattern repeats until convergence

Which Method to Use?#

The choice between explicit construction and stream capture depends on your use case and framework:

For PyTorch users: PyTorch uses stream capture (Method 2) under the hood. When you use torch.cuda.graph(), PyTorch automatically handles the capture process—beginning stream capture, executing your model to record operations, ending capture, and instantiating the graph. This makes CUDA Graphs accessible with minimal code changes. You simply wrap the PyTorch code you want to graph in the capture context, and PyTorch takes care of all the low-level details. This is the recommended and most widely-used approach for PyTorch-based deep learning.

For TensorFlow/JAX users: TensorFlow and JAX take a different approach using command buffers, which are built on top of explicit graph construction (Method 1). Instead of capturing a stream, these frameworks construct graphs explicitly by tracing your computation and building the node-edge structure programmatically. This gives them more control over graph optimization and allows for advanced features. However, as a user, you typically interact with high-level APIs (tf.function with XLA, JAX’s JIT compilation) that hide these details.

For CUDA C++ developers: Use stream capture (Method 2) for most cases—it’s simpler and works well for repetitive workloads. Only use explicit construction (Method 1) if you need to do advanced work like:

Programmatically generate complex graph patterns that can’t be expressed as linear stream execution
Reuse and compose graph components across different graphs
Build graphs conditionally based on runtime logic
Implement custom graph optimization passes

Since PyTorch is the dominant framework for deep learning and uses stream capture, the rest of this documentation focuses on Method 2 (stream capture) and how to effectively use it in PyTorch applications.

Advanced Topics#

While the basics covered above are sufficient for most use cases, CUDA Graphs offer several advanced features for specialized scenarios:

Graph Updates: Instead of recapturing a graph from scratch when you need to change parameters (like kernel arguments or memory addresses), you can update an existing graph in place. This is much faster than full recapture and useful when only small parts of your computation change. See the CUDA Programming Guide on Updating Instantiated Graphs for details.

Device-Side Graph Launch: CUDA allows kernels running on the GPU to launch other graphs directly, without CPU involvement. This enables complex hierarchical execution patterns where GPU code can dynamically decide which pre-built graph to execute next, reducing CPU-GPU round trips. Learn more in the CUDA Programming Guide on Device Graph Launch.

Conditional Nodes: Conditional nodes allow graphs to contain branching logic that is evaluated on the GPU at runtime. A conditional node can choose between different subgraphs based on a device-accessible condition, enabling some dynamic behavior within an otherwise static graph structure. See CUDA Programming Guide on Conditional Graph Nodes for implementation details.

These advanced features are primarily relevant for CUDA C++ developers and complex use cases. Most PyTorch users won’t need them, but it’s useful to know they exist for specialized optimization scenarios.

What’s Next?#

Warning

Understand the Constraints

Before diving deeper into CUDA Graphs, it’s essential to understand their fundamental constraints. Not all workloads are suitable for graph capture. See Constraints to learn about the limitations imposed by the CUDA runtime, including requirements for static memory addresses, static shapes, and restrictions on synchronization and control flow.

Now that you understand the CUDA Graph basics, you can explore:

Constraints: Understand the fundamental limitations imposed by CUDA runtime before using CUDA graphs
Quantitative Benefits: Learn how to measure and quantify CUDA Graph benefits in your application