Overview#

The cuDNN library exposes open-source frontend Python and C++ API layers, which provide a simplified programming model that is sufficient for most use cases. These layers offer all of the graph functionality of the cuDNN backend while adding abstractions and utilities for ease of use. In the frontend API, you can describe multiple operations that form subgraphs through a persistent graph object. You don’t have to worry about specifying the shapes and sizes of the intermediate virtual tensors.

The Python frontend API layer and C++ frontend API layer are functionally equivalent. Therefore, you can choose which API layer to use according to your language preference.

Building and Running a cuDNN Graph Workflow#

Create a cuDNN graph and specify the global properties. The global properties like compute precision and input/output data type help infer properties that are not explicitly mentioned.
Create and add the input tensors.
Create and add the operation nodes. The outputs of these operations are of tensor type and can be sequentially used as inputs to the next node.
Validate the operation graph. This step makes sure the graph is well built and does not have hanging tensors or nodes.
Build the cuDNN operation graph. This step lowers the graph into the cuDNN dialect.
Create the execution plan, based on the heuristics type of your choice.
(Optional) Check support of the operation graph.
(Optional) Filter out the plans by your custom criteria.
Build (one or all) the execution plans.
(Optional) Run autotuning on the filter plan.
Execute the graph with the relevant data pointers.

APIs#

The frontend API follows a functional style of building a graph. Operations take in input tensors and return output tensors. This also allows composition of operations.

Building A Graph#
Purpose	C++ API	Python API
Create tensor	`tensor`	`tensor`
Convolution fprop	`conv_fprop` `Conv_fprop_attributes`	`conv_fprop`
Convolution dgrad	`conv_dgrad` `Conv_dgrad_attributes`	`conv_dgrad`
Convolution wgrad	`conv_wgrad` `Conv_wgrad_attributes`	`conv_wgrad`
Matrix Multiplication	`matmul` `Matmul_attributes`	`matmul`
Pointwise Operations	`pointwise` `Pointwise_attributes`	`add` `bias` `rqsrt` `sub` `mul` `scale` `relu` `elu` `gelu` `cmp_gt`
Batch Normalization	`batchnorm` `Batchnorm_attributes`	`batchnorm`
Batch Norm bprop	`batchnorm_backward` `Batchnorm_backward_attributes`	`batchnorm_backward`
Generate stats of output	`genstats` `Genstats_attributes`	`genstats`
Batch Norm Finalize of stats	`bn_finalize` `BN_finalize_attributes`	`bn_finalize`
Dbn weight	`dbn_weight` `DBN_weight_attributes`	`dbn_weight`
Resampling	`resample` `Resample_attributes`	`resample`
Scale dot product attention	`sdpa` `SDPA_attributes`	`sdpa`
Scale dot product attention backward	`sdpa_backward` `SDPA_backward_attributes`	`sdpa_backward`
Scale dot product attention FP8	`sdpa_fp8` `SDPA_fp8_attributes`	`sdpa_fp8`
Scale dot product attention backward FP8	`sdpa_fp8_backward` `SDPA_fp8_backward_attributes`	`sdpa_fp8_backward`

Creating the Graph#

Instantiate an object of class cudnn_frontend::graph::Graph which will house tensors and operations.

Optional graph level attributes can be set on the object:

cudnn_frontend::graph::Graph& set_io_data_type(cudnn_frontend::DataType_t)
cudnn_frontend::graph::Graph& set_intermediate_data_type(cudnn_frontend::DataType_t)
cudnn_frontend::graph::Graph& set_compute_data_type(cudnn_frontend::DataType_t)

These attributes are meant to be used as the default in case they are not provided for constituent tensors and operations.

Defining Tensors#

You can create input tensors to provide operations within a graph. To add tensors in a graph, use:

std::shared_ptr<cudnn_frontend::graph::Tensor_attributes> cudnn_frontend::graph::tensor(cudnn_frontend::graph::Tensor_attributes)

As the API returns a shared pointer, both the user and the frontend graph are owners of the tensor.

Tensor attributes is a lightweight structure with setters for each attribute.

cudnn_frontend::graph::Tensor_attributes& set_data_type(cudnn_frontend::DataType_t)
cudnn_frontend::graph::Tensor_attributes& set_dim(std::vector<int64_t>&)
cudnn_frontend::graph::Tensor_attributes& set_stride(std::vector<int64_t>&)
cudnn_frontend::graph::Tensor_attributes& set_is_virtual(bool)
cudnn_frontend::graph::Tensor_attributes& set_is_pass_by_value(bool)
cudnn_frontend::graph::Tensor_attributes& set_reordering_type(cudnn_frontend::TensorReordering_t)
cudnn_frontend::graph::Tensor_attributes& set_name(std::string&)

Defining Operations#

Operations take in mandatory input tensors via positional arguments. Optional input tensors are provided using corresponding setters in operation attributes.

Operations return an ordered array of output tensors. Any optional outputs if not present will have their shared pointers pointing to std::nullptr.

Refer to the Operations section for more details.

Validating the Graph#

The validate API ensures the API usage is sound, checks against dangling tensors, and so on. Internally, any unspecified properties like dimensions, strides, and so on, are inferred.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::validate()

Building the Backend Graph#

This method creates the cuDNN backend descriptors for all constituents of the graph.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::build_operation_graph(cudnnHandle_t handle)

Creating the Execution Plan#

This method internally queries the heuristics for engine configs for the given heuristics modes.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::get_execution_plans(std::vector<heur_mode_t>)

Getting the Execution Plan Count#

This method returns the number of execution plans returned by cuDNN heuristics. Each plan gets an index from 0 to #plans-1, with 0 having top priority.

cudnn_frontend::int64_t
cudnn_frontend::Graph::get_execution_plan_count() const;

Checking Graph Support#

This method guarantees that executing the graph using plans queried will succeed.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::check_support(cudnnHandle_t h);

Building the Execution Plan#

This function builds execution plans queried with the create_execution_plan(...) API.

There are two flavors of this API:

To build execution plans according to a policy. Allows cuDNN heuristics to construct and return the best execution plan.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::build_plan(
    cudnnHandle_t const &handle,
    cudnn_frontend::BuildPlanPolicy_t const policy,
    bool const do_multithreaded_builds
);

To build individual plan indices. Main use case is to build execution plans in parallel when autotuning. The plan index to be used here can be queried with the get_execution_plan_count(...) API.
cudnn_frontend::error_t cudnn_frontend::Graph::build_plan_at_index( cudnnHandle_t const &handle, int64_t plan_index );

Filtering Plans (Optional)#

You can filter plans on numerical, behavioral notes, or plans that do not provide the desired functional correctness.

cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::select_numeric_notes(std::vector<cudnn_frontend::NumericalNote_t> const&);
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::select_behavior_notes(std::vector<cudnn_frontend::BehaviorNote_t> const&);

cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_numeric_notes(std::vector<cudnn_frontend::NumericalNote_t> const&);
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_behavior_notes(std::vector<cudnn_frontend::BehaviorNote_t> const&);
cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_workspace_greater_than(int64_t const workspace);

Autotuning#

Autotuning provides a way to execute different execution plans for a given graph and measure your relative performance under runtime conditions. This generally helps validate and improve upon the results provided by the heuristics.

Executing the Graph#

Executing the graph requires device pointers to all input output tensors and a user-allocated device workspace pointer.

Two flavors of execution exist, corresponding to the build_plans(...) API. This API already has a candidate execution plan set.

The candidate execution plan gets internally set when either the:

build_policy_t::HEURISTIC_CHOICE is used, or
if any plan is built by build_plan API using plan_index.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::execute(
    cudnnHandle_t handle,
    std::unordered_map<std::shared_ptr<Tensor>, void *> var_pack,
    void* workspace
);

The execute API also takes a plan index to target a specific plan. This may be used when autotuning, in conjunction with the build_plan_at_index(...) API.

cudnn_frontend::error_t
cudnn_frontend::graph::Graph::execute(
    cudnnHandle_t handle,
    std::unordered_map<std::shared_ptr<Tensor>, void *> var_pack,
    void* workspace,
    int64_t plan_index
);

Miscellaneous APIs#

Use get_workspace to execute the currently selected execution plan.

You can also take in a plan index to query the workspace for. This may be used when autotuning, in conjunction with the build_plan_at_index(...) API.

int64_t get_workspace_size() const int64_t get_workspace_size_plan_index(int64_t plan_index) const

Use get_autotune_workspace to run autotune on all plans.

get_autotune_workspace_size() const

Serialization#

The frontend API provides two flavors of serialization. One is to checkpoint after the initial graph specification (before calling validate) and other after building the execution plan (to save on plan creation).

void serialize(json &j) const
void deserialize(const json &j)

The above two APIs are meant to capture the user-specified input tensors and nodes into the graph. This can be used to generate the log (for debugging) or to visualize the graph being created.

error_t serialize(std::vector<uint8_t> &data) const
error_t deserialize(cudnnHandle_t handle, std::vector<uint8_t> const &data)

A fully built graph can be serialized into a binary blob of data with the above two APIs.

Note

Not all engine configs support serialization.
It is the user’s responsibility to make sure the UIDs of tensor being passed to the variant pack remain consistent before and after serialization.

Error Handling#

The C++ API returns an error object which has an error code and an error message.

The Python API throws an exception with a similar error message to be handled in the Python API.

Operations#

Refer to the Operations section for APIs of different operation types.