FAQs#

General#

Are the DSLs replacing C++ templates?

TL;DR: No - but also yes. The CUTLASS 4.0 release (CuTe DSL), along with all future extensions to our Python-native programming models, does not come at the expense of CUTLASS C++. CUTLASS 2.x and 3.x C++ APIs are both going to continue receiving fixes and updates for the architectures we support them for. However, CUTLASS 4.x CuTe DSL is fully isomorphic in its programming model and performance with CuTe C++ for Blackwell, and it is our hope that the community embraces this for much easier while still equally performant custom kernel development. This is why we are releasing CuTe DSL with support for all architectures starting with the NVIDIA Ampere Architecture.

What is the difference between CuTe DSL, CUTLASS Python, and CUTLASS DSLs?

CUTLASS Python was the Python interface for instantiating C++ kernels via a Python frontend. This is now deprecated with the release of CUTLASS 4.0. CUTLASS DSLs are a family of Python DSLs for native device programming in Python. Currently, this is limited to our initial release of CuTe DSL, but future versions will include higher-level abstractions that gradually trade off control for convenience.

What should I learn, CUTLASS C++ or the Python DSLs?

We believe the Python DSLs will significantly improve the learning curve and recommend starting with them for all newcomers, as they eliminate the inherent complexity of learning C++ metaprogramming for GPU kernel programming. Since CuTe C++ and CuTe DSL share fully isomorphic programming models and patterns, any knowledge gained can eventually be applied to C++.

Where will the code live? PIP wheel or GitHub repo? Do I have to build it myself?

This is a major change compared to CUTLASS C++ and Python DSLs. Going forward, the GitHub code only exists as a way for users to file issues and pull requests against. While it can be used with the pip wheel, we do not recommend most users do so unless they are hacking on the DSL itself. For all other users, we recommend they simply pip install nvidia-cutlas-dsl and use the pip wheel as the single source of truth for the dialect compiler and DSL implementation. CUTLASS GitHub repository will contain a requirements.txt file pinning the version of the wheel consistent with the state of the OSS repository (please see Quick Start Guide). This means getting started with CUTLASS is easier than ever: no more CMake command lines to learn and no more builds to kick off. Simply install the pip wheel and start running the examples.

Migration#

Should I port my code from C++ templates to Python?

Almost certainly not, unless you need extremely fast JIT times for your kernel and C++ compile times are a blocker for you. The 2.x and 3.x APIs will continue to be supported, and Nvidia’s Hopper and Blackwell architectures 3.x will continue to improve in terms of features and performance.

Are portability promises different with Python?

For the initial release while the DSL is still in beta, we do not promise any portability as we may make changes to the DSL itself. While we do not expect any changes to the CuTe operations, the DSL utilities, decorators, helper classes like pipelines and schedulers may change as we refine them with community feedback. We encourage users to file issues and discussions on GitHub during this beta period with their feedback!

In the long term, we plan to continue to treat the OSS community with care. Just like the prior history of CUTLASS, we plan not to break users unless necessary, but we reserve the right to make limited breaking changes in case we believe it is a net benefit to the community and project. These will be announced ahead of time and/or clearly highlighted in the CHANGELOG of each release.

Technical#

What NVIDIA architectures will it support?

CuTe DSL will support all NVIDIA GPU architectures starting with NVIDIA Ampere Architecture (SM80).

Will it be compatible with DL frameworks (e.g., PyTorch, JAX)?

Yes, we will provide utilities to convert from DLPack-supported tensor formats to cute.Tensor. This should allow a user to never have to leave Python when writing model code in their framework of choice. Our JAX interoperability story is not as strong as PyTorch’s today, however, we are actively working on improving it and welcome contributions in this space.

Does it compile to PTX or SASS?

CuTe DSL compiles the program down to PTX. After that, we currently use the PTX compiler that ships with the CUDA toolkit to compile the PTX down to SASS. We plan to remove this limitation in the future and allow the use of the PTX JIT that is included in the CUDA driver in case a user does not have a CUDA toolkit installed.

Do I need to use NVCC or NVRTC?

No, the nvidia-cutlass-dsl wheel packages is everything needed to generate GPU kernels. It shares the driver requirements of the 12.9 toolkit which can be found here.

How would one debug the code?

Since CuTe DSL is not native python and an embedded DSL instead, tools like pdb cannot be used. However, if you have experience with GPU kernel programming, the debugging techniques will be nearly identical. Typically, compile time and runtime printing of types and values are the most expedient. Please see documentation on printing to learn how to print types and values at both compile time and runtime. You can also use cuda-gdb to set breakpoints in the program and step through the execution or use tools such as compute-sanitizer to detect and triage bugs in your program. As the DSL matures, our source location tracking from Python user programs will also improve to provide more helpful source-level mapping when setting breakpoints and using other tools such as nsight.

How would one implement warp specialization in CuTe DSL?

Exactly the same way you would in C++ but in a Python-native syntax instead. Consult our Control Flow and “Blackwell kernel example” for a detailed how-to guide.

Can I call functions from other functions or use OOP?

Yes. We frequently call functions from one another and set up class hierarchies to organize and modularize our code for pipelines and schedulers. Consult the Introduction documentation or our examples for more details.

License#

What is the license for CuTe DSL and the associated GitHub samples?

CuTe DSL components available on Github and via the nvidia-cutlass-dsl Python pip wheel are released under the “NVIDIA Software End User License Agreement (EULA)”. Because the pip package includes a compiler that shares several components with the CUDA Toolkit, it is subject to usage terms and restrictions similar to those of the CUDA SDK. Please refer to the EULA for specific terms of use.

CuTe DSL samples and Jupyter notbooks, released on GitHub are provided under the BSD 3-Clause License and may be used and redistributed under those terms. This distinction ensures that developers have flexibility when using or modifying the code samples, independent of the compiler and runtime components governed by the EULA.

If you have any questions or need clarification, feel free to contact us.