Debugging#

This page provides an overview of debugging techniques and tools for CuTe DSL programs.

Getting Familiar with the Limitations#

Before diving into comprehensive debugging capabilities, it’s important to understand the limitations of CuTe DSL. Understanding these limitations will help you avoid potential pitfalls from the start.

Please refer to Limitations for more details.

Source Code Correlation#

CuTe DSL provides Python code to PTX/SASS correlation to enable the profiling/debugging of generated kernels with debug symbols by generating line info when compiling the kernel.

You can enable that globally via the environment variable CUTE_DSL_LINEINFO=1. Alternative, you can use compilation options to enable that per kernel. Please refer to JIT Compilation Options for more details.

DSL Debugging#

CuTe DSL provides built-in logging mechanisms to help you understand the code execution flow and some of the internal state.

Enabling Logging#

CuTe DSL provides environment variables to control logging level:

# Enable console logging (default: False)
export CUTE_DSL_LOG_TO_CONSOLE=1

# Log to file instead of console (default: False)
export CUTE_DSL_LOG_TO_FILE=my_log.txt

# Control log verbosity (0, 10, 20, 30, 40, 50, default: 10)
export CUTE_DSL_LOG_LEVEL=20

Log Categories and Levels#

Similar to standard Python logging, different log levels provide varying degrees of detail:

Level	Description
0	Disabled
10	Debug
20	Info
30	Warning
40	Error
50	Critical

Dump the generated IR#

For users familiar with MLIR and compilers, CuTe DSL supports dumping the Intermediate Representation (IR). This helps you verify whether the IR is generated as expected.

# Dump Generated CuTe IR (default: False)
export CUTE_DSL_PRINT_IR=1

# Keep Generated CuTe IR in a file (default: False)
export CUTE_DSL_KEEP_IR=1

Dump the generated PTX & CUBIN#

For users familiar with PTX and SASS, CuTe DSL supports dumping the generated PTX and CUBIN.

# Dump generated PTX in a .ptx file (default: False)
export CUTE_DSL_KEEP_PTX=1

# Dump generated cubin in a .cubin file (default: False)
export CUTE_DSL_KEEP_CUBIN=1

To further get SASS from cubin, users can use nvdisasm (usually installed with CUDA toolkit) to disassemble the cubin.

nvdisasm your_dsl_code.cubin > your_dsl_code.sass

Access the dumped contents programmatically#

For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmatically as well through following attributes:

__ptx__: The generated PTX code of the compiled kernel.
__cubin__: The generated CUBIN data of the compiled kernel.
__mlir__: The generated IR code of the compiled kernel.

compiled_foo = cute.compile(foo, ...)
print(f"PTX: {compiled_foo.__ptx__}")
with open("foo.cubin", "wb") as f:
    f.write(compiled_foo.__cubin__)

Change the dump directory#

By default, all dumped files are saved in the current working directory. To specify a different directory for the dumped files, please set the environment variable CUTE_DSL_DUMP_DIR accordingly.

Kernel Functional Debugging#

Using Python’s `print` and CuTe’s `cute.printf`#

CuTe DSL programs can use both Python’s native print() as well as our own cute.printf() to print debug information during kernel generation and execution. They differ in a few key ways:

Python’s print() executes during compile-time only (no effect on the generated kernel) and is typically used for printing static values (e.g. a fully static layouts).
cute.printf() executes at runtime on the GPU itself and changes the PTX being generated. This can be used for printing values of tensors at runtime for diagnostics, but comes at a performance overhead similar to that of printf() in CUDA C.

For detailed examples of using these functions for debugging, please refer to the associated notebook referenced in Educational Notebooks.

Handling Unresponsive/Hung Kernels#

When a kernel becomes unresponsive and SIGINT (CTRL+C) fails to terminate it, you can follow these steps to forcefully terminate the process:

Use CTRL+Z to suspend the unresponsive kernel
Execute the following command to terminate the suspended process:

# Terminate the most recently suspended process
kill -9 $(jobs -p | tail -1)

CuTe DSL can also be debugged using standard NVIDIA CUDA tools.

Using Compute-Sanitizer#

For detecting memory errors and race conditions:

compute-sanitizer --some_options python your_dsl_code.py

Please refer to the compute-sanitizer documentation for more details.

Set function name prefix#

By default, the function name (host function or kernel function) is automatically generated based on the function name and its parameters. Sometimes you may want to attach some runtime information to the function name to make performance profiling and debugging easier, e.g., the kernel configs or the rank ids. You can assign a name prefix to the name by calling the set_name_prefix method on the host function or kernel function.

@cute.kernel
def kernel(arg1, arg2, ...):
    ...
@cute.jit
def launch_kernel():
    kernel.set_name_prefix("your_custom_name_prefix")
    kernel(arg1, arg2, ...).launch(grid=[1, 1, 1], block=[1, 1, 1], ...)

For above example, the generated kernel name will be “your_custom_name_prefix_xxx”.

Conclusion#

This page covered several key methods for debugging CuTe DSL programs. Effective debugging typically requires a combination of these approaches. If you encounter issues with DSL, you can enable logging and share the logs with the CUTLASS team as a GitHub issue to report a bug.