CUTLASS 3.0 GEMM Backwards Compatibility#
Although CUTLASS 3.0 restructures the GEMM hierarchy and introduces new types for the
threadblock layer and below, we intend the entire source code to be usable in user applications.
We expect users to be able to #include
any source file from CUTLASS 3.0, whether
they implement the 2.x or the 3.x API, without breaking user builds. This means that a single
translation unit should be able to contain any valid kernel regardless of its API version. The
sections below discuss how device
and kernel
layer type names are made compatible across the
two API versions, and what the users can expect out of the threadblock
layer API going forward.
Compatible Device API#
The entry point for CUTLASS’s Device GEMM API
is the class
cutlass::gemm::device::GemmUniversalAdapter
.
This class lives in the header file
include/cutlass/gemm/device/gemm_universal_adapter.h.
GemmUniversalAdapter
is a “universal adapter”
and serves as a common device interface
for both CUTLASS 3.x and CUTLASS 2.x kernels.
Its template parameter GemmKernel
,
the GEMM kernel type, can be any of the following:
cutlass::gemm::kernel::GemmUniversal
, implementing CUTLASS 3.x API kernels;cutlass::gemm::kernel::GemmUniversal
, implementing CUTLASS 2.x API kernels;Any valid CUTLASS 2.x
kernel
layer GEMM that was previously composable withdevice::GemmUniversalAdapter
Users implementing new kernels in either API should prefer
using kernel::GemmUniversal
as the kernel type
and compose it with device::GemmUniversalAdapter
.
Users with existing kernel::Gemm
kernels
can continue to use them as template arguments
of device::GemmUniversalAdapter
. They can adopt
GemmUniversal
as a gradual migration path,
since GemmUniversal
accepts either 3.0 or 2.x collectives.
Please see the next section for kernel::GemmUniversal
for details.
GemmUniversalAdapter
presents a single
host-side interface to both 3.0 and 2.x kernels.
CUTLASS accomplishes this by
specializing GemmUniversalAdapter
’s implementation
on either 2.x API implementing kernel layer GEMMs, or 3.x API
implementing kernel layer GEMMs (as detected by gemm::detail::IsCutlass3GemmKernel
discussed below). As a result, GemmUniversalAdapter
’s behavior
might differ between the two specializations.
Device API design differences#
In CUTLASS 2.x, the Device API was more closely tied to the Kernel API. In CUTLASS 3.0, the Device API accepts any kernel type that meets the Kernel API interface requirements. CUTLASS 3.0’s Device API code is parameterized by the kernel type, but this code is generic; the same code works for any kernel type.
The device layer compatibility interface, device::GemmUniversalAdapter
,
also provides reflective mappings from 3.0-specific types
back to the closest possible 2.x equivalent types. This is discussed further in the section below.
CUTLASS 3.0’s device::GemmUniversalAdapter
also exposes some new APIs that the 2.x device::GemmUniversalAdapter
implementation does not. Most notably, this includes the ability to bypass the GemmKernel::Arguments
to GemmKernel::Params
lowering.
// Primary run() entry point API that is static allowing users to create and manage their own params.
static Status
run(Params& params, cudaStream_t stream = nullptr);
This new API is useful for the following scenarios.
Running again does not require reinvoking
GemmKernel::to_underlying_arguments()
Manual control over construction of
GemmKernel::Params
for custom kernels with custom stride typesFully static problem shapes and strides for bespoke kernels where no argument mapping needs to take place
Compatible Kernel API#
CUTLASS 3.x API shares the kernel layer API with CUTLASS 2.x
through the single entry point type cutlass::gemm::kernel::GemmUniversal
.
All kernel layer GEMMs are viewed as a composition of a collective mainloop
and a collective epilogue.
kernel::GemmUniversal
implements both 2.x and 3.x APIs
The entry point for CUTLASS’s kernel API is the class
cutlass::gemm::kernel::GemmUniversal
.
This class’ declaration lives in the header file
include/cutlass/gemm/kernel/gemm_universal.hpp.
/*
* Stateless universal device GEMM kernel type that treats GEMM as
* a composition of a collective mainloop and a collective epilogue.
* SFIANE shims both 2.x and 3.0 API kernels based on ProblemShapeOrThreadblockMma_.
**/
template <
class ProblemShapeOrThreadblockMma_,
class CollectiveMainloopOrEpilogue_,
class CollectiveEpilogueOrThreadblockSwizzle_,
class TileScheduler_ = void,
class Enable = void
>
class GemmUniversal;
We call this class “universal” because it can be built
using either the CUTLASS 3.0 or the 2.x mainloops and epilogues.
If GemmUniversal
’s first template argument
(ProblemShapeOrThreadblockMma_
) is a cute::tuple
,
then GemmUniversal
assumes that
the remaining three template arguments
(the mainloop, epilogue, and grid swizzle)
implement the 3.0 APIs.
Otherwise, GemmUniversal
assumes that
the remaining three template arguments
implement the 2.x APIs.
All the template arguments must be either
CUTLASS 3.0 or CUTLASS 2.x types. For example,
GemmUniversal
does not permit using
a 2.x mainloop with a 3.0 collective epilogue.
CUTLASS 3.x implements various embodiments of kernel::GemmUniversal
.
Each kernel layer schedule is specialized
for a GEMM scheduling algorithm and GPU architecture.
Specializations of kernel::GemmUniversal
for 3.0 APIs live in
any of various gemm_*.hpp
files in the directory
include/cutlass/gemm/kernel/.
The specialization to which to dispatch is decided through the dispatch policy’s Schedule
type.
Specializations for 2.x APIs live in the header file include/cutlass/gemm/kernel/gemm_universal.h.
Kernel API design differences#
The CUTLASS 2.x Kernel API was more closely tied to the Device API, as we mentioned above. In particular, the 2.x Device API specified the grid shape used to launch the Kernel API. In CUTLASS 3.0, the Kernel API controls its own grid shape, while the device adapter simply queries the kernel with which it needs to be launched.
This change is required to support various kernel schedules that may need their own schedule specific grid planning logic. For example, persistent kernel schedules generally only launch with as many threadblocks as the number of multiprocessors on the GPU.
All CUTLASS 3 kernel::GemmUniversal
specializations expose the following (static) API:
// Returns true if the kernel can execute the provided GEMM arguments.
static bool
can_implement(Arguments const& args);
// Returns a dim3 representing the threadblock shape.
static dim3
get_block_shape();
// Returns a dim3 representing the grid shape in terms of threadblocks.
static dim3
get_grid_shape(Params const& params);
The device adapter simply queries the kernel for these three before launching it on the device.
CUTLASS 3.0 provides a meta-function to detect whether a cutlass::gemm::kernel::*
implements
the 3.x API or 2.x API:
// include/cutlass/gemm/gemm.h
namespace cutlass:gemm::detail {
// The following metafunction is used to detect whether a
// `kernel::Gemm` or `kernel::GemmUniversal` implements the CUTLASS 3.x API,
// by checking whether the problem shape type is aliased within.
template <class GemmKernel, class = void>
struct IsCutlass3GemmKernel;
} // namespace cutlass:gemm::detail
Users can dispatch their generic code against 2.x and 3.x specializations with this as a type trait for the kernel API version.
Threadblock API and Inner Loops#
Much of the CUTLASS 3 GEMM hierarchy for mainloops and inner loops diverges
from that of CUTLASS 2.x. With that also comes the introduction of the
cutlass::gemm::collective
layer as a direct replacement and a superset
of the 2.x cutlass::gemm::threadblock
layer. Going forward,
CUTLASS 3.x will discontinue new developments in the following namespaces.
cutlass::*::threadblock::*
cutlass::*::warp::*
cutlass::gemm::thread::*
cutlass::arch::*
(exceptbarrier.h
)
cutlass::gemm::collective
s are a superset of the threadblock layer where
all new mainloops will be developed. Users should look to the CollectiveMma
type
if they wish to author custom mainloop code in the 3.x API.
Similarly, for the GEMM inner loops, cute::MMA_Atom
s replace the
gemm::warp
and gemm::thread
layer code. Going forward, all new PTX instructions
and associated metadata development will occur directly inside cute/arch/*.hpp
and cute/atom/*.hpp
.
The desired inner loop MMA iteration order and tiling can be achieved through careful
selection of the atom layout, value layout, and permutations of the cute::TiledMma
.
For epilogues, the cutlass::epilogue::collective
layer replaces cutlass::threadblock::collective
. However, the thread-level epilogue elementwise operations
in cutlass::epilogue::thread
will continue to be used in 3.x kernels as well, albeit, with
a more idiomatic epilogue vectorization strategy.
Example 50
shows how to use 2.x epilogue thread operators with 3.0 API kernels.
Porting from 2.x to 3.0 API#
Copyright#
Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: BSD-3-Clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.