Odds and Ends#

This section includes a random set of topics and concepts.

cuDNN Library Configuration#

cuDNN is delivered as a collection of sub-libraries. The default usage of cuDNN requires all sub-libraries; however, there are some sub-libraries that can be dropped and cuDNN will still work, saving binary size with some reduction in support surface and performance. Here, we document which subsets of sub-libraries are valid configurations of the library.

A configuration is enabled through an environment variable, CUDNN_LIB_CONFIG. The following table lists the valid configurations. With the environment variable set appropriately, you can remove all but the required sub-libraries from your installation of cuDNN. Over time, we expect to support more configurations.

cuDNN Sub-Library Configuration#
`CUDNN_LIB_CONFIG` Setting	Summary	Required Sub-Libraries
`FULL`	Default usage of cuDNN	All
`GRAPH_JIT_ONLY`	Support is limited to the graph API, using only JIT engines (no precompiled kernels).	`libcudnn.so`, `libcudnn_graph.so`, `libcudnn_engines_runtime_compiled.so` - does not work with static libs.

The library overview section of the API reference has more details on the sub-library breakdown and dependency graph.

`FULL` Configuration#

This is the default usage of cuDNN. We expect most users to continue using this, unless they want to explore tradeoffs for binary size reduction.

`GRAPH_JIT_ONLY` Configuration#

This configuration provides:

Significant binary size reduction vs FULL
JIT engines only:
- Generic Runtime Fusion Engines (for example, Convolution fusions and Matmul fusions)
- Specialized Runtime Fusion Engines (for example, SDPA)
Support for GPU architectures NVIDIA Ampere and later
Functional coverage of the graph API (with some caveats below)
Performance approaching FULL in many cases, with caveats listed below

Because the GRAPH_JIT_ONLY configuration does not include engines with precompiled kernels, which reside in the largest sub-library, it has significantly smaller binary size vs FULL. To achieve this reduction, retain only the libraries in the Required Sub-Libraries column in the table above, and remove the remaining, non-required, cuDNN sub-libraries.

The tradeoff for this reduced binary size is that, depending on use case, you may see functional or performance gaps vs FULL. In particular, there are cases where you may see a runtime error due to missing functional support. In particular, this may happen for graph patterns supported by the Specialized Pre-compiled Engines. All other graph patterns supported by FULL are also supported by GRAPH_JIT_ONLY.

Also, for functionally supported patterns, there may be a performance difference between FULL and GRAPH_JIT_ONLY configs, depending on the graph pattern and problem size. As a rough guidance:

Single-op ConvolutionBwdFilter has large performance gaps, on average.

Single-op Matmul and single-op ConvolutionBwdData have small performance gaps, on average.

Single-op ConvolutionFwd is at performance parity on average, but you may see performance gaps depending on the problem size.

Multi-op fusions should be at performance parity when moving to GRAPH_JIT_ONLY

This is an initial version of GRAPH_JIT_ONLY. We are actively working on closing gaps with FULL. In the future, we expect GRAPH_JIT_ONLY to have the same functional coverage of the graph API as FULL, with some performance gaps for some specialized cases that have highly optimized pre-compiled kernels in FULL only.

Note

GRAPH_JIT_ONLY supports NVIDIA Ampere and later, but does not support GPU architectures prior to Ampere.

Sub-Library Configuration Instructions#

Because the FULL configuration is the default, there’s no action needed to enable it.

To use a non-default configuration (and currently, the only one is GRAPH_JIT_ONLY), do the following:

Set the environment variable CUDNN_LIB_CONFIG to GRAPH_JIT_ONLY.

Ensure the required sub-libraries are in the LD_LIBRARY_PATH (or PATH on Windows).

Note

The GRAPH_JIT_ONLY configuration is only supported with the cuDNN dynamic libraries and does not currently work with static libraries. We are working on adding this support.

Thread Safety#

The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.

When creating a per-thread cuDNN handle, it is recommended that a single synchronous call of cudnnCreate() be made first before each thread creates its own handle asynchronously.

Per cudnnCreate(), for multi-threaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handles per thread and use that cuDNN handle for the entire life of the thread.

cuDNN Backend Requirements#

For the compatible versions of the cuDNN backend, refer to Supported Products.

SM Carveout#

Starting in cuDNN 8.9.5, SM carveout is supported on NVIDIA Hopper GPUs, allowing expert users to reserve SMs for concurrent execution on a separate CUDA stream. Users can set a target SM count to cuDNN heuristics, and get a list of engine configs that will use that number of SMs during execution. For advanced use cases without cuDNN heuristics, users can also create the engine config from scratch with the SM carveout configured (the engines that support this feature are listed in the table below).

The following code snippet is a sample for heuristics use cases.

// Create heuristics descriptor
cudnnBackendDescriptor_t engHeur;
cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &engHeur);
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &opGraph);
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_MODE, CUDNN_TYPE_HEUR_MODE, 1, &heurMode);
// SM carveout
int32_t targetSMCount = 66;
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_SM_COUNT_TARGET, CUDNN_TYPE_INT32, 1, &targetSMCount);
cudnnBackendFinalize(engHeur);
// Create engine config descriptor
cudnnBackendDescriptor_t engConfig;
cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &engConfig);
// Retrieve optimal engine config(s) from heuristics
cudnnBackendGetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &returnedCount, engConfig);
// "engConfig" should now be ready with target SM count as 66

This feature is currently supported by normal convolutions (Fprop, Dgrad, and Wgrad) as well as the Conv-Bias-Act fusions.

cuDNN Backend Engines that Support SM Carveout#
Convolution Forward	Convolution Backward Data	Convolution Backward Filter	`cudnnConvolutionBiasActivationForward`
6 58 61 62 64 65 66 67 68 69	7 63 66 67 68 69 70 71 72 73 75 76	17 62 64 65 66 68	14 39 40 41 42 43

Native CUDA Graph API#

For select engines, cuDNN now provides a way to directly build a CUDA graph (not to be confused with a cuDNN graph) from an execution plan and a set of variant pointers. This is a more flexible alternative to using CUDA graph capture, as it allows the updating of an existing CUDA graph with new variant pack pointers. Like other CUDA graphs, the generated graph can be executed arbitrarily many times from different streams, and can be embedded as a child node of a larger CUDA graph.

Engines supporting these APIs have the behavior note CUDNN_BEHAVIOR_NOTE_SUPPORTS_CUDA_GRAPH_NATIVE_API.

Both the C++ and Pyhton APIs consist of two functions:

populate_cuda_graph()

update_cuda_graph()

For an example usage of these functions in C++, refer to the cudagraphs.cpp sample.

Odds and Ends#

cuDNN Library Configuration#

FULL Configuration#

GRAPH_JIT_ONLY Configuration#

Sub-Library Configuration Instructions#

Thread Safety#

cuDNN Backend Requirements#

SM Carveout#

Native CUDA Graph API#

`FULL` Configuration#

`GRAPH_JIT_ONLY` Configuration#