Debugging#
This page provides an overview of debugging techniques and tools for CuTe DSL programs.
Getting Familiar with the Limitations#
Before diving into comprehensive debugging capabilities, it’s important to understand the limitations of CuTe DSL. Understanding these limitations will help you avoid potential pitfalls from the start.
Please refer to Limitations for more details.
DSL Debugging#
CuTe DSL provides built-in logging mechanisms to help you understand the code execution flow and some of the internal state.
Enabling Logging#
CuTe DSL provides environment variables to control logging level:
# Enable console logging (default: False)
export CUTE_DSL_LOG_TO_CONSOLE=1
# Log to file instead of console (default: False)
export CUTE_DSL_LOG_TO_FILE=my_log.txt
# Control log verbosity (0, 10, 20, 30, 40, 50, default: 10)
export CUTE_DSL_LOG_LEVEL=20
Log Categories and Levels#
Similar to standard Python logging, different log levels provide varying degrees of detail:
Level |
Description |
---|---|
0 |
Disabled |
10 |
Debug |
20 |
Info |
30 |
Warning |
40 |
Error |
50 |
Critical |
Dump the generated IR#
For users familiar with MLIR and compilers, CuTe DSL supports dumping the Intermediate Representation (IR). This helps you verify whether the IR is generated as expected.
# Dump Generated CuTe IR (default: False)
export CUTE_DSL_PRINT_IR=1
# Keep Generated CuTe IR in a file (default: False)
export CUTE_DSL_KEEP_IR=1
Kernel Functional Debugging#
Using Python’s print
and CuTe’s cute.printf
#
CuTe DSL programs can use both Python’s native print()
as well as our own cute.printf()
to
print debug information during kernel generation and execution. They differ in a few key ways:
Python’s
print()
executes during compile-time only (no effect on the generated kernel) and is typically used for printing static values (e.g. a fully static layouts).cute.printf()
executes at runtime on the GPU itself and changes the PTX being generated. This can be used for printing values of tensors at runtime for diagnostics, but comes at a performance overhead similar to that of printf() in CUDA C.
For detailed examples of using these functions for debugging, please refer to the associated notebook referenced in Educational Notebooks.
Handling Unresponsive/Hung Kernels#
When a kernel becomes unresponsive and SIGINT
(CTRL+C
) fails to terminate it,
you can follow these steps to forcefully terminate the process:
Use
CTRL+Z
to suspend the unresponsive kernelExecute the following command to terminate the suspended process:
# Terminate the most recently suspended process
kill -9 $(jobs -p | tail -1)
CuTe DSL can also be debugged using standard NVIDIA CUDA tools.
Using Compute-Sanitizer#
For detecting memory errors and race conditions:
compute-sanitizer --some_options python your_dsl_code.py
Please refer to the compute-sanitizer documentation for more details.
Conclusion#
This page covered several key methods for debugging CuTe DSL programs. Effective debugging typically requires a combination of these approaches. If you encounter issues with DSL, you can enable logging and share the logs with the CUTLASS team as a GitHub issue to report a bug.