Odds and Ends#
This section includes a random set of topics and concepts.
cuDNN Library Configuration#
cuDNN is delivered as a collection of sub-libraries. The default usage of cuDNN requires all sub-libraries; however, there are some sub-libraries that can be dropped and cuDNN will still work, saving binary size with some reduction in support surface and performance. Here, we document which subsets of sub-libraries are valid configurations of the library.
A configuration is enabled through an environment variable, CUDNN_LIB_CONFIG
. The following table lists the valid configurations. With the environment variable set appropriately, you can remove all but the required sub-libraries from your installation of cuDNN. Over time, we expect to support more configurations.
|
Summary |
Required Sub-Libraries |
---|---|---|
|
Default usage of cuDNN |
All |
|
Support is limited to the graph API, using only JIT engines (no precompiled kernels). |
|
The library overview section of the API reference has more details on the sub-library breakdown and dependency graph.
FULL
Configuration#
This is the default usage of cuDNN. We expect most users to continue using this, unless they want to explore tradeoffs for binary size reduction.
GRAPH_JIT_ONLY
Configuration#
This configuration provides:
Significant binary size reduction vs
FULL
JIT engines only:
Generic Runtime Fusion Engines (for example, Convolution fusions and Matmul fusions)
Specialized Runtime Fusion Engines (for example, SDPA)
Support for GPU architectures NVIDIA Ampere and later
Functional coverage of the graph API (with some caveats below)
Performance approaching
FULL
in many cases, with caveats listed below
Because the GRAPH_JIT_ONLY
configuration does not include engines with precompiled kernels, which reside in the largest sub-library, it has significantly smaller binary size vs FULL
. To achieve this reduction, retain only the libraries in the Required Sub-Libraries column in the table above, and remove the remaining, non-required, cuDNN sub-libraries.
The tradeoff for this reduced binary size is that, depending on use case, you may see functional or performance gaps vs FULL
. In particular, there are cases where you may see a runtime error due to missing functional support. In particular, this may happen for graph patterns supported by the Specialized Pre-compiled Engines. All other graph patterns supported by FULL
are also supported by GRAPH_JIT_ONLY
.
Also, for functionally supported patterns, there may be a performance difference between FULL
and GRAPH_JIT_ONLY
configs, depending on the graph pattern and problem size. As a rough guidance:
Single-op
ConvolutionBwdFilter
has large performance gaps, on average.Single-op
Matmul
and single-opConvolutionBwdData
have small performance gaps, on average.Single-op
ConvolutionFwd
is at performance parity on average, but you may see performance gaps depending on the problem size.Multi-op fusions should be at performance parity when moving to
GRAPH_JIT_ONLY
This is an initial version of GRAPH_JIT_ONLY
. We are actively working on closing gaps with FULL
. In the future, we expect GRAPH_JIT_ONLY
to have the same functional coverage of the graph API as FULL
, with some performance gaps for some specialized cases that have highly optimized pre-compiled kernels in FULL
only.
Note
GRAPH_JIT_ONLY
supports NVIDIA Ampere and later, but does not support GPU architectures prior to Ampere.
Sub-Library Configuration Instructions#
Because the FULL
configuration is the default, there’s no action needed to enable it.
To use a non-default configuration (and currently, the only one is GRAPH_JIT_ONLY
), do the following:
Set the environment variable
CUDNN_LIB_CONFIG
toGRAPH_JIT_ONLY
.Ensure the required sub-libraries are in the
LD_LIBRARY_PATH
(orPATH
on Windows).Note
The
GRAPH_JIT_ONLY
configuration is only supported with the cuDNN dynamic libraries and does not currently work with static libraries. We are working on adding this support.
Thread Safety#
The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.
When creating a per-thread cuDNN handle, it is recommended that a single synchronous call of cudnnCreate() be made first before each thread creates its own handle asynchronously.
Per cudnnCreate(), for multi-threaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handles per thread and use that cuDNN handle for the entire life of the thread.
cuDNN Backend Requirements#
For the compatible versions of the cuDNN backend, refer to Supported Products.
SM Carveout#
Starting in cuDNN 8.9.5, SM carveout is supported on NVIDIA Hopper GPUs, allowing expert users to reserve SMs for concurrent execution on a separate CUDA stream. Users can set a target SM count to cuDNN heuristics, and get a list of engine configs that will use that number of SMs during execution. For advanced use cases without cuDNN heuristics, users can also create the engine config from scratch with the SM carveout configured (the engines that support this feature are listed in the table below).
The following code snippet is a sample for heuristics use cases.
// Create heuristics descriptor cudnnBackendDescriptor_t engHeur; cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &engHeur); cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &opGraph); cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_MODE, CUDNN_TYPE_HEUR_MODE, 1, &heurMode); // SM carveout int32_t targetSMCount = 66; cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_SM_COUNT_TARGET, CUDNN_TYPE_INT32, 1, &targetSMCount); cudnnBackendFinalize(engHeur); // Create engine config descriptor cudnnBackendDescriptor_t engConfig; cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &engConfig); // Retrieve optimal engine config(s) from heuristics cudnnBackendGetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &returnedCount, engConfig); // "engConfig" should now be ready with target SM count as 66
This feature is currently supported by normal convolutions (Fprop
, Dgrad
, and Wgrad
) as well as the Conv-Bias-Act fusions.
Convolution Forward |
Convolution Backward Data |
Convolution Backward Filter |
|
---|---|---|---|
|
|
|
|
Native CUDA Graph API#
For select engines, cuDNN now provides a way to directly build a CUDA graph (not to be confused with a cuDNN graph) from an execution plan and a set of variant pointers. This is a more flexible alternative to using CUDA graph capture, as it allows the updating of an existing CUDA graph with new variant pack pointers. Like other CUDA graphs, the generated graph can be executed arbitrarily many times from different streams, and can be embedded as a child node of a larger CUDA graph.
Engines supporting these APIs have the behavior note CUDNN_BEHAVIOR_NOTE_SUPPORTS_CUDA_GRAPH_NATIVE_API
.
Both the C++ and Pyhton APIs consist of two functions:
populate_cuda_graph()
update_cuda_graph()
For an example usage of these functions in C++, refer to the cudagraphs.cpp sample.