Changelog#
CUTLASS 4.x#
4.2.1 (2025-09-22)#
CuTe DSL#
- Bug fixings and improvements - Fixed an issue when running DSL codes with cuda-python 13.0 
- Fixed an issue when running inductor with DSL codes 
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer 
- Fixed the issue reported in https://github.com/NVIDIA/cutlass/issues/2647 
- Fixed an issue when conditional define of variables outside of dynamic control flow 
 
CUTLASS C++#
- Bypass EVT for nosmem blockwise kernels on Blackwell. 
- Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen. 
4.2.0 (2025-09-15)#
CuTe DSL#
- More Python versions are now supported for both x86-64 and aarch64, including - Python 3.10, 3.11, 3.12, and 3.13 
 
- Added new example and updated notebook to get started with CuTe DSL 
- Updates on TensorSSA demonstration - Added a section for introducing the broadcast 
 
 
- API updates - Please refer to DSL API changelog for details 
 
- Bug fixings and improvements - Fixed - cute.print_tensorfor coordinate tensor
- Fixed - cute.printfor tuple of layouts
- Fixed frozen object is not properly updated after fully assigned in dynamic control flow 
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure 
- Improved error message when CUDA context is not initialized 
- Improved docstring of congruent and weakly_congruent 
 
CUTLASS C++#
- Support for Blackwell SM103 kernels for B300 GPUs. - Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop 
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders. 
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel. 
 
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture: 
- Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM - Unit test files with prefix name of - sm103_under GEMM device unit tests.
 
- Support for Blackwell SM121 kernels for DGX Spark GPUs. - Share the major codes with Blackwell SM120 kernels. 
 
- Add support for heuristics-based kernel filtering and autotuning using - nvidia-matmul-heuristicsto find the best kernels for a given scenario.- Details please refer to heuristics doc. 
 
- Further enhance Blackwell SM100 Attention kernels in example 77. - Add fused reduction kernel support for cutlass MLA. 
- Add softmax skip correction. 
- Support for GQA in FMHA backward kernel. 
- Fix an issue where - get_unmasked_trip_countmay return a negative value.
- Fix an issue where mbarriers are initialized with a zero arrival count. 
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q. 
- Remove tma padding for forward kernel inputs. 
 
- Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome. 
- Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell - On Blackwell SM120, a blockwise gemm kernel is added: example 87. 
- On Hopper, add K major scale factor support for SM90 blockwise kernels. 
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size. 
- On Hopper, grouped version supports the case when k = 0. 
 
- Support for Blackwell SM100 fp4 gemv kernels. - Kernel codes: Gemv kernel. 
- Example codes: example 91 
 
- Support for Blackwell SM100 legacy mixed input GEMM kernels. - Collective mainloop codes: Mixed input mainloop. 
- Kernel codes: Mixed input kernel. 
- Example codes: example 86. 
 
- Support for Blackwell SM100 cpasync kernel. - Collective mainloop codes: cpasync mainloop. 
- Kernel codes: cpasync kernel. 
 
- Support Blackwell SM120 mixed input blockscaled grouped GEMM. 
- Instantiating more Blackwell kernels in profiler. - Blackwell SM100 and SM103 kernels support - CUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate all possible combinations.
- To use this feature, - CUTLASS_LIBRARY_KERNELSmust be non-empty. Profiler will combine- CUTLASS_LIBRARY_KERNELSand- CUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate specific kernels.
- Details please check Profiler Doc. 
 
- Fix some profiler issues: - Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line. 
- Fix some no output and timeout issues. 
- Fix Pingpong Blockwise Hopper library generation. 
 
- From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110. - For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs. 
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid. 
 
- Rename legacy Python API package from - cutlassto- cutlass_cppgenand add Blackwell EVT support to legacy Python interface.- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface’s - EpilogueDescriptors.
- Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter. 
- Added some support for running SM100 kernels via the Python interface. 
 
- CuTe changes: - Fix inaccurate GridDim calculation under CuTe tutorial. 
- Add movmatrix support. 
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels. 
- Support fp16 accmulator for sm89 fp8 mma. 
- Shorten - nullspaceimplementation.
- Isolate and comment on - cosizerisky changes.
- Important documentation correction: - E<0,1> == 1@0@1.
 
- Fix some kernel issues: - Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers. 
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel. 
 
- Add following unit tests: 
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 13.0U1. 
4.1.0 (2025-07-16)#
CuTe DSL#
- Add aarch64 support, you can now pip install - nvidia-cutlass-dslon GB200 systems!
- More examples demonstrating how to use CuTe DSL to write peak-performance kernels 
- API updates - Please refer to DSL API changelog for details 
 
CUTLASS C++#
- Further enhance Blackwell SM100 Attention kernels in example 77. - Add variable sequence length support for FMHA Backward kernel. 
- Add varlen test support to Backward runner. 
- Codes support empty batch sequences. 
 
- Replace - subbyte_iteratorwith- cute::recast_ptrwhen constructing logical iterators/arrays.
- CuTe changes: - Rewrite ArithTuple and ScaledBasis for robustness and clarity. 
- Remove buggy and kludgy - get_layoutA|B|C_MNand friends from Atoms/TiledX.
- Factor out - print_latexand friends and rewrite.
- Factor out - print_svgand friends and rewrite.
 
- Support Blackwell SM100 SIMT packed fp32x2 kernels. 
- Support residual add for implicit gemm kernels. 
- Various fixes for CUTLASS C++ Python interface’s EVT tracer: - Add verifier for sm90 to report the invalid input. 
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges. 
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend. 
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback. 
 
- Fix profiler bugs in exhaustive perf search. - Fix incorrect cluster shape output issue when doing exhaustive search. 
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders. 
 
- Fix some profiler issues. - Complete the reference for Blackwell blockwise gemm kernels. 
- Fix incorrect regex logic for L1 test. 
 
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 12.9. 
4.0.0 (2025-06-03)#
CuTe DSL#
- CuTe DSL, a Python DSL centered around CuTe’s abstractions 
- Set of examples demonstrating how to use CuTe DSL to write peak-performance kernels 
- API updates - Please refer to DSL API changelog for details 
 
CUTLASS C++#
- Support Family Specific Architecture Features which was introduced in CUDA 12.9 - 100f, 101f, 120f were added to support Family Specific Architecture Features which allows running the same binary on different chips belonging to the same Family (e.g. sm100) without recompiling. Note 101a is supported since CUTLASS 3.9 
 
- Instruction shapes and redundant accumulation type have been removed from CUTLASS 3.x-style library kernel names to disambiguate kernels and shorten names. - For example: - (old) cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma
- (new) cutlass3x_sm90_tensorop_gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma
 
- If you are using the CUTLASS library kernel names directly (e.g. to compile a subset of the CUTLASS library with - -DCUTLASS_LIBRARY_KERNELS, filter kernels in the CUTLASS profiler with- --kernels), please update your uses accordingly, this is a breaking change.
 
- Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell. - Added non-power-of-two tile sizes. 
- Improved performance for K-major scale factors. 
- The argument - mma_promotion_intervalhas been removed from non-grouped GEMM to align with the grouped and Blackwell SM100 versions.
 
- Enhance Blackwell SM100 Attention kernels in example 77. - Support LSE output in FMHA Forward kernel. 
- Enhance performance measurement: support of different warmup iterations; buffer rotation to keep L2 cold; separate testing of persistent and non-persistent. 
- Enhance testing of variable sequence length. 
- Disable B2B mode in MLA to simplify the sample. 
- Clarify that - fmha_gensample only supports head dim 128.
- Fixes for split-kv output in MLA. 
 
- Improve Blackwell and Hopper grouped GEMM performance, functionality, and profiler support. - Enable runtime datatype for Blackwell SM100 grouped GEMM. Profiler support is also added. 
- Enable kernel parameter exploration for Blackwell SM100 grouped GEMM - raster_order, swizzle. 
 
- Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests. 
- Add dynamic and preferred cluster support for convolution Blackwell SM100 kernels. 
- Fix profiler issues which cause no output or not supported error for some kernels. 
- Optimizations for Blackwell SM100 and SM120 block scaled kernels. 
- Support for Blackwell SM120 blockwise dense gemm in CUTLASS library and profiler. 
- New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA. 
- CuTe changes: - Rework - cute::copy_ifso that the predicate tensor is also a true CuTe Tensor rather than a lambda and introduces transform-tensors to avoid any extra register or load/store overhead in using bool-tensors.
- New CuTe tutorial to show the usage of copy_if in tile copy. 
- Add CuTe C++ reduce op. - Add several unit tests for CuTe tensor algorithms. 
 
 
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 12.9. 
CUTLASS 3.x#
3.9.2 (2025-05-03)#
3.9.1 (2025-04-30)#
3.9.0 (2025-04-24)#
- Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API: 
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture: - Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor. 
- Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation. 
- Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor. 
- Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor. 
- Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor. 
 
- Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM. 
- Support for Blackwell SM100 Sparse kernels: - Collective mainloop that target for 
 
- Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM: 
- Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM. 
- A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case. 
- A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance. 
- A new distributed GEMM example for SM100 Blackwell architecture. 
- Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures: - Enhancement of blockwise GEMM for Hopper architecture. 
- Enhancement of groupwise GEMM for Hopper architecture. 
- Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture. 
- Support for grouped-wise GEMM in CUTLASS profiler. 
- Support for blockwise GEMM for Blackwell architecture. 
- Support for groupwise GEMM for Blackwell architecture. 
- Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture. 
 
- Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler: - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels. 
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance. 
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration. 
- More detailed introductions and examples to leverage this feature can be found in profiler.md. 
 
- Support - voidas the D element in sm100 kernel epilogues.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 12.8U1. 
3.8.0 (2025-01-25)#
- Support for new CuTe building blocks specifically for Blackwell SM100 architecture: - 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms. 
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms. 
- Exposure of Blackwell’s new tensor memory (note: distinct from TMA) as - tmemacross CuTe as a first class data locale.
- Exposure of - tmem->rmem,- rmem->tmemand- smem->tmem data movement instructionsas copy atoms in CuTe.
- make_tmem_copy()utility method to ease creation of tiled copies for tmem copy atoms.
- Support for new variants of LDSM on Blackwell via CuTe Copy atoms. 
 
- Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture: - Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8 
- Pipelines that implement Blackwell specific synchronization. 
- Cluster launch control API supporting preferred and fallback cluster shapes. 
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types. 
- Tile schedulers using Blackwell’s Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K. 
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler. 
 
- Full support for Blackwell SM100 kernels in CUTLASS 3.x API: - Blackwell specific kernel layers that - Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture. 
- Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators. 
- Support stream-K load balancing for all kernel types everywhere via composable scheduler support. 
 
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for - Non-block scaled data types without support for pointer array and grouped GEMM with TMA 
- Non-block scaled data types with support for pointer array and grouped GEMM with TMA 
- Block scaled data types without support for pointer array and grouped GEMM with TMA 
- Block scaled data types with support for pointer array and grouped GEMM with TMA 
 
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad. 
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders. 
- Blackwell epilogue that supports loading accumulators from - tmemand full set of EVT fusions.
 
- CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification. - Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes. 
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors. 
- Support for mixed input GEMM kernels on Hopper in the profiler. 
 
- New CUTLASS profiler flag - use-cuda-graphsto reduce overheads when benchmarking launch-bound kernels.
- A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler ( - ./cutlass_profiler --operation=GroupedGemm --helpfor details).
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture: - Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API. 
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell. 
- Block scaled data type GEMMs targeting Blackwell’s native block scaled Tensor Cores: 
- GEMM example demonstrating Blackwell’s new preferred cluster support via dynamic cluster shapes for increased occupancy. 
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs. 
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128. 
- A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations. 
 
- Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture: - A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes. 
 
- Documentation updates: 
- Detailed Blackwell block-scaled GEMM functionality documentation 
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures. 
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture. 
- Updates to profiler documentation for testing mixed input GEMM kernels on Hopper. 
 
3.7.0 (2025-01-11)#
- Hopper blockwise scaling FP8 GEMM uses 2D scaling tensor, assigning one value per threadblock. This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop. Details with figures are here. 
- Distributed GEMM is a new (experimental) API which can turn existing CUTLASS GEMM kernels into pipelined Tensor Parallel GEMMs that run efficiently on NVLink-based network of GPUs. Its pipelining schedules can hide most of the communication behind computation, and relies on point-to-point communication, which can simply use CUDA runtime’s peer device access feature. It also utilizes remote TMA loads and memcopies with CUDA graphs to handle communication primarily through the Copy Engine, leaving all SMs free for Hopper’s persistent kernels. For more details you can refer to the DistGEMM blog post. 
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new - make_kernel_hardware_infoAPI as shown in example 48.
- Enabled high precision accumulation for Hopper FP8 Sparse GEMM. 
- Potential API breaking changes: - Fix - cute::UniversalCopyfor type safety.
- No longer implicitly select - cute::SM80_CP_ASYNC_*based on input tensors. This avoids implicit downstream synchronization requirements. To use- SM80_CP_ASYNC, users must explicitly select the appropriate CopyAtom.
- Fix - cute::SM80_CP_ASYNC_CACHEALWAYS,- cute::SM80_CP_ASYNC_CACHEGLOBAL,- cute::SM80_CP_ASYNC_CACHEALWAYS_ZFILL,- cute::SM80_CP_ASYNC_CACHEGLOBAL_ZFILLto avoid implicitly selecting- ZFILLbehavior on predication.
- Remove - cute::copy_vec<T>in favor of- cute::copy_alignedand- cute::copy(AutoVectorizingCopyWithAssumedAlignment<NumBits>,...).
- A refactor of default epilogue struct - DefaultEpilogueAPI to avoid reading non-void- ElementCvalue for- ElementC = voidkernel.
 
- New CUTLASS profiler flags: - profiling-duration,- min-iterations, and- kernels-filedocumented in profiler.md.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 12.6. 
3.6.0 (2024-10-03)#
- A refactor to the CUTLASS 3.x convolution - kernel::ConvUniversalAPI to bring it in line with- gemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API.
- An improved mixed input GEMM and a lookup table implementation for - INT4x- FP8scale-only mode.
- EVT nodes for Top-K selection and softmax and GEMM example using those. 
- Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations. 
- A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details. 
- A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support. 
- A SIMT-enabled pointer-array epilogue. 
- A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations. 
- A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler. 
- A new hardware support for comparisons and computations of - cutlass::bfloat16_t
- Fixed use of isnan on Windows for - half_t.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs! 
- Optimal code generation with CUDA toolkit versions 12.6. 
3.5.1 (2024-07-25)#
- Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and example 48. 
- TMA store based and EVT supported epilogues for Hopper pointer array batched kernels. 
- A new - GemmSparseUniversalAPI for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inferrence:
- CUDA host adapter extensions to support TMA descriptor construction driver APIs. 
- Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler. 
- Support for residual add (beta != 0) in convolution kernels. 
- A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output. 
- A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them. 
- A guide for setting up VSCode to work well with CUTLASS and expanded code style guide. 
- Better support for MSVC as a host compiler. 
- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2. 
- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1. 
3.5.0 (2024-04-09)#
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col - Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs. 
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion. 
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API. 
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design! 
 
- Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer. 
- Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x - Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs. 
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores. 
 
- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices. 
- Updates to CuTe documentation for - cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
- Extensions to CuTe to support L2 prefetching and TMA store+reductions. 
- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17. 
- Fixes to greatly reduce build warnings. 
- Updates and bugfixes from the community (thanks!) 
3.4.1 (2024-02-14)#
- Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users’ side. 
- Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs. 
- Updates and bugfixes from the community (thanks!). 
3.4.0 (2024-01-12)#
- Expanded Mixed-input Hopper GEMMs support covering {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors. 
- Performance improvements to Mixed-input Hopper GEMMs 
- Beta release of Pointer-Array Batched GEMMs now available on Hopper GPUs utilizing TMA and WGMMA (requires CUDA 12.3 or above). 
- Beta release of Group-GEMM utilizing TMA and WGMMA (requires CUDA 12.3 or above). 
- Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now. 
- NamedBarriers usability improvement and list of ReservedNamedBarriers has been officially released. 
- Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved. 
3.3 (2023-10-31)#
- Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input operand types. 
- Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8}, and upcast on operandA {s8, u8} x {fp16, bf16}. 
- Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors. 
- Kernel schedules and Builder support for mixed precision and Copy Async GEMMs with < 16B aligned input tensors. 
- Profiler support for lower-aligned Hopper GEMMs. 
- Performance Improvements to Scatter-Gather Hopper Example. 
- Sub-Byte type fixes and improvements. 
- EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details. 
- Fusion support for backprop fusions including drelu, dgelu, and dbias. 
- Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface 
3.2.2 (2023-10-25)#
- Minor patch for issue/1138 
3.2.1 (2023-09-22)#
- Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0. 
- SM80 EVT support in C++ and Python. 
- Other SM90 epilogue improvements. 
- Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See 1105 for details. 
- Making - tools/library/scriptspackageable -- tools/library/scriptsis now moving to- python/cutlass_library. See the Python README for details.
- SM90 TF32 kernel improvements for all layouts. 
- SM90 rasterization direction support in the CUTLASS profiler. 
- Improvement for CUTLASS profiler build times. 
- Remove Python-C++ bindings. 
3.2.0 (2023-08-03)#
- New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs. FP8 GEMMs come with a fast accumulation mode. When enabled, problem execution might be faster but at the cost of lower accuracy because intermediate results will not periodically be promoted to a higher precision. 
- New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue. 
- Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release. 
- Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler. 
- Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA. 
- Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue. 
- New CUTLASS 2D Convolution Python interface. New example here. 
- Support for Windows (MSVC) builds. Tested with Visual Studio 2019 v16.11.27 on Windows 10.0. 
- Optimal performance using CUDA 12.2u1 
- Updates and bugfixes from the community (thanks!) 
3.1.0 (2023-04-14)#
- New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples. 
- New efficient epilogues using TMA for Hopper. 
- Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues. 
- New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA. 
- New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper. 
- An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper. 
- Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization. 
- Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler. 
- Performance optimizations for the warp-specialized persistent ping-pong kernel. 
- Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying - Paramsstructs.
- FMHA Backward Pass from Meta xFormers. 
- Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM. 
- Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel. 
- Batched Strided GEMV support both row major and column major input matrix. 
- Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue. 
- Row Broadcast can be fused in the epilogue. 
- The GitHub branch is renamed from - masterto- mainin this release.
- Optimal performance using CUDA 12.1 
- Updates and bugfixes from the community (thanks!) 
3.0.0 (2023-01-23)#
- CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors. 
- A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0’s GEMM API changes. 
- Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same - device::GemmUniversalAdapterand- kernel::GemmUniversaltypes, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section.
- Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3. 
- Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture. 
- New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. 
- Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations. 
- CUTLASS library integration for 3.x API kernels built through the new - CollectiveBuilderAPI, enabling CUTLASS profiler.
- Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features. 
- Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50. 
CUTLASS 2.x#
2.11.0 (2022-11-19)#
- Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one. 
- Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel. 
- Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency. 
- Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8. 
- BLAS3 functions with Hoppers new double precision matrix multiplication instructions. 
- ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary. 
- Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N. 
- Optimized DepthWise Conv. Two new modes are added - kOptimized - use direct conv to compute instead of implicit GEMM. - The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration. 
 
- kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded. - The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration. 
 
 
- Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC’22 Spring talk. 
- Updates and bugfixes from the community (thanks!). Big shout out to Meta’s xFormers. 
- Deprecation announcement: CUTLASS plans to deprecate the following: - Maxwell and Pascal GPU architectures 
- Ubuntu 16.04 
- CUDA 10.2 
 
2.10.0 (2022-08-23)#
- CUTLASS Python now supports GEMM, CONV, Group GEMM for different data types as well as different epilogue flavours. 
- Optimizations for CUTLASS’s Grouped GEMM kernel. Threadblock scheduling part is improved. Some computation can be moved to the host side if applicable. Grouped Syr2k kernels are added, too. 
- Optimizations for GEMM+Softmax. All the reduction computation is fused into the previous GEMM. More template arguments are provided to fine tune the performance. 
- Grouped GEMM for Multihead Attention. This general group gemm based MHA does not require the sequence length of all GEMMs to be the same which makes it most useful for natural language processing. 
- GEMM + Layer norm fusion for Ampere splits the layernorm into two parts and both of them can be fused into the GEMMs before and after separately. In addition to use square sum to compute variance of layernorm, Shift-K is provided if square sum raise numerical issues. 
- GEMM Epilogue Permutation Fusion can apply user provided permutation layout mapping in the GEMM epilogue. 
- Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. The restrictions are: 1) input and output channel number should be multiple of group number. 2) split-K is not supported. The implementation has 2 modes: - kSingleGroup: output channel per group is multiple of Threadblock tile N. 
- kMultipleGroup: Threadblock tile N is multiple of output channel per group. 
 
- Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now. The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number. 
- Back-to-back GEMM/CONV relaxes the requirement that the first GEMM K dimension needs to be the multiple of Threadblock Tile K dimension. 
- Optimal performance using CUDA 11.6u2 
- Updates and bugfixes from the community (thanks!) 
2.9.0 (2022-04-21)#
- First layer Convolution kernels specialized for small channel counts and reduced alignment - Few channels specialization for reduced alignment capabilities 
- Fixed channels further specialized when channel count perfectly matches the access vector size 
- Python-based instance emitter in the CUTLASS Library and support in the Profiler 
 
- BLAS3 operators accelerated by Tensor Cores 
- CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python - Python-based runtime interoperable with existing emitters 
 
- Gather and Scatter Fusion with GEMM can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel. - It can select random rows in a row major matrix. 
- It can select random columns in a column major matrix. 
 
- Back-to-back GEMM/CONV fully supports buffering the first GEMM/CONV results in the shared memory for the latter one to use. It can eliminate register spill when the tile size is big. Additionally, bias vector add is supported in the first GEMM/CONV. - Supported kernels: GEMM and CONV. 
- Supported types: fp16 and int8. 
- Supported architectures: Turing and Ampere. 
 
- Transposed Convolution (a.k.a Deconvolution) support which reuses Dgrad implementation. 
- Utility functions that can pad NHWC and convert between NCHW and NHWC. 
- Small alignment implicit gemm support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores in these kernels. 
- Epilogue enhancement: - Eliminate bank conflicts in int8 tensor core kernels. 
- Half2 usage if epilogue compute type is fp16. 
- More activation functions: Silu, Hardswish, Leaky Relu. 
- New elementwise fusion pattern for residual block. 
 
- Group GEMM thread block number calculation fix which helps to launch the intended number of threadblocks to fully occupy the GPUs. 
- Parallel GEMM splitk support in the CUTLASS profiler. 
- Optimal performance using CUDA 11.6u2 
- Updates and bugfixes from the community (thanks!) 
2.8.0 (2021-11-19)#
- TF32x3: emulated single-precision using Tensor Cores - 45+ TFLOPs on NVIDIA A100 
- GEMM SDK example (real) 
- COMPLEX GEMM SDK example (complex) 
 
- Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu 
- Grouped GEMM: similar to batched GEMM with distinct problem size per group - SDK example with performance comparison with Batched Strided GEMM 
 
- Implicit GEMM Convolution fusion supports staging 1st convolution’s output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue. 
- Optimal performance using CUDA 11.5 
- Updates from the community (thanks!) 
- Deprecation announcement: CUTLASS plans to deprecate the following: - Maxwell and Pascal GPU architectures 
- Ubuntu 16.04 
- CUDA 10.2 
 
2.7.0 (2021-09-24)#
- Mainloop fusion for GEMM: summation over A or B 
- Half-precision GELU_taylor activation functions - Use these when accumulation and epilogue compute types are all - cutlass::half_t
 
- Tuning and bug fixes to fused GEMM + GEMM example 
- Support for smaller than 128b aligned Convolutions: see examples 
- Caching of results to accelerate Convolution unit tests - Can be enabled or disabled by running - cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
 
- Corrections and bug fixes reported by the CUTLASS community - Thank you for filing these issues! 
 
2.6.1 (2021-09-03)#
- Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators) 
- Tuning for GEMMs fused with partial reductions 
- Corrections and bug fixes reported by the CUTLASS community - Thank you for filing these issues! 
 
2.6.0 (2021-07-22)#
- Optimal performance when compiled with the CUDA 11.4 Toolkit - Adopt the new L2 prefetch feature in cp.async and global load 
 
- Fused operators with GEMM and Convolution 
- 64b tensor strides and leading dimensions support for GEMMs 
- Affine rank=2 matrix layouts - Row stride and column stride for matrices using cutlass::layout::AffineRank2 
- Support FP64 tensor core and SIMT GEMM. 
 
- Batched GEMV preview implementation 
- New strided Dgrad implementation - Accelerates over previous implementation by cutting down redundant math by 4x 
- Support using new - Dyand- wanalytic iterators and existing- cutlass::conv::device::ImplicitGemmConvolutioninterface
 
- Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores) - Updates to quaternion.h and functional.h 
- SDK Example for GEMM and Convolution 
 
- Many improvements to the epilogue. - Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations 
- Performance improvement for FP16 tensor core kernels 
- Bug fixes 
 
- Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere. 
- Updated minimum CUDA Toolkit requirement to 10.2 - CUDA 11.4 Toolkit recommended 
 
- Corrections and bug fixes reported by the CUTLASS community - Thank you for filing these issues! 
 
2.5.0 (2021-02-26)#
- Tensor reductions - m-to-n reductions of tensors with affine layout 
- Specializations for reductions including contiguous dimension 
- Specializations for reductions excluding contiguous dimension 
- Custom reduction functors such as - cutlass::logical_and
- Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31) 
 
- Optimizations for 3-D convolution - Optimized tile iterators using precomputed delta table for 3-D convolution 
- Full coverage of forward and backwards passes for 3D convolution 
 
- Corrections and bug fixes reported by the CUTLASS community - Thank you for filing these issues! 
 
2.4.0 (2020-11-19)#
- Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs - Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution 
- Data type: FP32, complex - , Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32 
- Spatial dimensions: 1-D, 2-D, and 3-D 
- Layout: NHWC, NCxHWx 
 
- Implicit GEMM convolution components: - Global memory iterators supporting Fprop, Dgrad, and Wgrad 
- MmaMultistagefor implicit GEMM convolution for NVIDIA Ampere architecture
- MmaPipelinefor implicit GEMM convolution for NVIDIA Volta and Turing architectures
- Documentation describing Implicit GEMM Convolution algorithm and implementation 
 
2.3.0 (2020-09-23)#
- NVIDIA Ampere Architecture features - Sparse Tensor Core GEMM kernels: - Direct access to Sparse Tensor Cores and maximum performance via - mma.sp.sync
 
- Fast SGEMM targeting GeForce RTX 30-series CUDA Cores 
 
- Minor Features: - Activation functions such as GeLU and Sigmoid 
- Small matrix and quaternion template classes in device code 
 
- NVIDIA Ampere GPU Architecture examples and documentation: - Tensor Float 32 and 
- Documentation added on CUTLASS efficient row-major epilogue 
 
2.2.0 (2020-06-08)#
- NVIDIA Ampere Architecture features - Fast Tensor Core operations: 
- Maximum performance via - mma.sync
- Tensor Float 32, BFloat16, and double-precision data types 
- Mixed integer data types (int8, int4, bin1) 
- Asynchronous copy for deep software pipelines via - cp.async
- Described in GTC 2020 Webinar (SR 21745) (free registration required) 
 
- Features: - SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM 
- Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32 
- Gaussian complex GEMMs using 3m complex multiply algorithm 
- Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions 
 
- Policy updates: - CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features 
- Disabled F16C by default for compatibility - enable on cmake command line with - -DCUTLASS_ENABLE_F16C=ON
 
2.1.0 (2020-04-06)#
- BLAS-style host-side API added to CUTLASS Library - API to launch compiled kernel instances for GEMM and planar complex GEMM 
 
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Computes complex matrix products on matrices stored as disjoint real and imaginary parts 
 
- Minor enhancements and bug fixes 
2.0.0 (2019-11-19)#
- Substantially refactored for - Better performance, particularly for native Turing Tensor Cores 
- Robust and durable templates spanning the design space 
- Encapsulated functionality embodying modern C++11 programming techniques 
- Optimized containers and data types for efficient, generic, portable device code 
 
- Updates to: 
- Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores 
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands 
 
- Coverage of existing CUTLASS functionality - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs 
- Volta Tensor Cores through native mma.sync and through WMMA API 
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions 
- Batched GEMM operations 
- Complex-valued GEMMs 
 
- Note: a host compiler supporting C++11 or greater is required. 
CUTLASS 1.x#
1.3.2 (2019-07-09)#
- Performance improvement for Volta Tensor Cores TN and TT layouts. 
1.3.1 (2019-04-09)#
- Corrected NVRTC unit tests. 
1.3.0 (2019-03-20)#
- Efficient GEMM kernel targeting Volta Tensor Cores via - mma.syncinstruction added in CUDA 10.1.
1.2.0 (2018-10-26)#
- Parallelized reductions across threadblocks (“Split-K”) - Improved IGEMM performance 
 
- Batched strided WMMA GEMMs 
1.1.0 (2018-09-19)#
- Turing Features - WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit 
 
- Batched Strided GEMM 
- Threadblock rasterization strategies - Improved performance for adverse problem sizes and data layouts 
 
- Extended CUTLASS Core comonents - Tensor views support arbitrary matrix and tensor layouts 
- Zip iterators for structuring multiple data streams 
 
- Enhanced CUTLASS utilities - Reference code for tensor operations in host and device code 
- Added HostMatrix<> for simplified matrix creation 
 
- Examples - Basic GEMM, tensor views, CUTLASS utilities, batched GEMM, WMMA GEMM 
 
1.0.1 (2018-06-11)#
- Intra-threadblock reduction added for small threadblock tile sizes - sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16 
- igemm_32x32x128 
 
- GEMM K residue handled during prologue prior to mainloop 
- Replaced Google Test copy with submodule. Use - git submodule init --recursive --update
1.0.0 (2018-05-16)#
- Substantial rewrite to accommodate new architecture 
- Kernels: SGEMM, DGEMM, IGEMM, HGEMM, WMMA GEMM 
- Unit and performance tests 
0.0.1 (2017-12-04)#
- Initial release 
Copyright#
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: BSD-3-Clause
  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:
  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.
  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.
  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.