FE-OSS APIs Overview#

FE-OSS APIs are experimental and subject to change.

This folder documents the Python FE APIs implemented under python/cudnn. For details on currently implemented operations, see:

Installation and setup#

All Frontend OSS APIs come installed with the nvidia-cudnn-frontend package. However, each API may require additional optional dependencies defined in the pyproject.toml file. For instance, GEMM + Amax and GEMM + SwiGLU require the cute-dsl optional dependency, which can be installed via:

pip install nvidia-cudnn-frontend[cutedsl]

After installation, you can import the APIs directly from the cudnn package, i.e. from cudnn import {your_operation}

API Usage#

Each operation exposes two APIs:

1. High-level wrapper#

Single pythonic function call
Allocates and returns output tensors
No explicit compilation step – internally caches compiled kernels via a simple dictionary lookup
When to use:
- Fast prototyping and common cases
- You want automatic allocation and minimal boilerplate
- You are okay with the library managing the compiled-kernel cache

from cudnn import {your_operation}_wrapper
outputs = {your_operation}_wrapper(
    inputs,
    ...,
    config_options,
    ...
    stream=None,
)

2. Class API#

Explicit lifecycle with compile and execute steps
Reusable object with underlying compiled kernel for multiple executions
Requires preallocated output tensors
When to use:
- You need to reuse a compiled kernel across many calls
- You want explicit control over compilation and lifecycle management

from cudnn import {your_operation}

op = {your_operation}(
    sample_inputs,
    ...,
    sample_outputs,
    ...,
    config_options,
    ...
)
op.compile(
    current_stream=None,
)
op.execute(
    inputs,
    ...
    outputs,
    ...
    current_stream=None,
    skip_compile=False,
)

Methods:

check_support() — validates target problem configuration (i.e. tensor shapes, tensor strides, dtypes, tiling/cluster/kernel configurations, environment, etc.)
compile(current_stream) — compiles the kernel with the provided sample tensors and parameters.
execute(inputs, ..., outputs, ..., current_stream, skip_compile) — runs the kernel with the provided inputs and outputs.

Common Parameters and Conventions#

CUDA stream (current_stream in class API, stream in wrapper)
- The cuda stream to use for operation kernel execution.
- Default: None (uses default stream)
skip_compile: bool (used by class API execute method)
- If False, the class API must explicitly call compile to compile the kernel before calling execute. execute calls use the precompiled kernel
- If True, runs a JIT path to (re)compile the kernel on each call.
- Default: False

File structure and examples#

All FE OSS APIs are implemented in the python/cudnn directory.
Correctness tests/samples are implemented in the test/python/fe_api directory.