Skip to main content
Ctrl+K
NVIDIA CUTLASS Documentation - Home NVIDIA CUTLASS Documentation - Home

NVIDIA CUTLASS Documentation

NVIDIA CUTLASS Documentation - Home NVIDIA CUTLASS Documentation - Home

NVIDIA CUTLASS Documentation

Table of Contents

  • Changelog

CuTe DSL

  • Overview
  • Functionality
  • Quick Start Guide
  • CuTe DSL
    • Introduction
    • Code Generation
    • Control Flow
    • JIT Argument Generation
    • JIT Argument: Layouts
    • JIT Caching
    • Integration with Frameworks
    • Debugging with the DSL
    • Autotuning with the DSL
    • Educational Notebooks
  • CuTe DSL API
    • cute
    • cute_arch
    • cute_nvgpu
      • Common
      • warp submodule
      • warpgroup submodule
      • cpasync submodule
      • tcgen05 submodule
    • utils
  • Limitations
  • FAQs

CUTLASS C++

  • Overview
  • Getting Started
    • Quickstart
    • IDE Setup
    • Build
      • Building on Windows with Visual Studio
      • Building with Clang as host compiler
    • Functionality
    • Terminology
    • Fundamental Types
    • Programming Guidelines
  • Efficient GEMM in CUDA
  • Synchronization primitives
  • CUTLASS Profiler
  • Dependent Kernel Launch
  • Blackwell Specific
    • Blackwell SM100 GEMMs
    • Blackwell Cluster Launch Control
  • CuTe
    • 00_quickstart
    • 01_layout
    • 02_layout_algebra
    • 03_tensor
    • 04_algorithms
    • 0t_mma_atom
    • 0x_gemm_tutorial
    • 0y_predication
    • 0z_tma_tensors
  • CUTLASS 3.x
    • Design
    • GEMM Backwards Compatibility
    • GEMM API
  • CUTLASS 2.x
    • Layouts and Tensors
    • GEMM API
    • Tile Iterator Concepts
    • Utilities
  • Code Organization
  • Grouped Kernel Schedulers
  • CUTLASS Convolution

Reference

  • Software License Agreement
  • CUTLASS 3.x

CUTLASS 3.x#

  • Design
    • CUTLASS 3.0 design goals
    • A new Conceptual GEMM Hierarchy
    • Adoption of CuTe Layout and Tensors
    • Reducing the number of named types and iterator concepts
    • Correctness by default, Performance through clear, individual points of tuning
  • GEMM Backwards Compatibility
    • Compatible Device API
    • Compatible Kernel API
    • Threadblock API and Inner Loops
    • Porting from 2.x to 3.0 API
  • GEMM API
    • CUTLASS GEMM Model
    • CUTLASS GEMM Components
    • Kernel API
    • Device API
    • Tiled MMA and Copy
    • Atom API

previous

CuTe TMA Tensors

next

CUTLASS 3.0 Design

NVIDIA NVIDIA

Copyright © 2025, NVIDIA Corporation.

Last updated on May 14, 2025.