Tile Compilation in CUDA

Beginning in CUDA 13.3, CUDA C++ supports both Tile and SIMT code in the same translation unit 1. Tile compilation support can be enabled by specifying -enable-tile (disabled by default). Tile code generation is available for all architectures greater than sm_75 (Turing). The CUDA Tile C++ builtin functions are provided by the cuda_tile.h header file, this file needs C++20 or later dialect to be enabled.

When Tile compilation and code generation is enabled, in the nvcc default ‘whole program compilation’ mode, there will be two separate fatbinaries embedded in the host object file; one containing the generated SIMT code (PTX and/or SIMT cubin) and the other containing the generated Tile code (cuda_tile and/or the Tile cubin). In separate compilation mode, at present, the Tile code in the translation unit is still compiled in whole program mode i.e. it cannot refer to functions or variables that are not defined in the given translation unit.

The flags -simt-only and -tile-only can be used to enable Tile parser support, while skipping SIMT and Tile code generation, respectively.

#in host object file, embed separate cubins for Tile and SIMT code, generated for target sm_100
nvcc -enable-tile -std=c++20 -gencode arch=compute_100,code=sm_100 p1.cu -c

#since default architecture (sm_75) does not support Tile compilation, skip Tile codegen.
#in host object file, embed PTX and cubin generated for target sm_75
nvcc -enable-tile -std=c++20 p1.cu -c

#in host object file, embed cuda_tile IR and PTX generated for target compute_100
nvcc -enable-tile -std=c++20 -gencode arch=compute_100,code=compute_100 j1.cu -c

#skip SIMT codegen, in host object file, embed only cuda_tile IR generated for target compute_100
nvcc -enable-tile -tile-only -std=c++20 -gencode arch=compute_100,code=compute_100 j1.cu -c

#skip Tile codegen, in host object file, embed only PTX generated for target compute_100
nvcc -enable-tile -simt-only -std=c++20 -gencode arch=compute_100,code=compute_100 j1.cu -c

#in separate compilation mode (-rdc=true), SIMT cubin is in relocatable form, while Tile cubin is in non-relocatable form
nvcc -enable-tile -rdc=true -std=c++20 -gencode arch=compute_100,code=sm_100 p1.cu -c

Note

A CUDA runtime error is generated when invoking a SIMT or Tile kernel whose code generation is skipped. This error can be checked by reading the last error after launching a kernel:

#include <cstdio>

__tile_global__ void tile_kernel() { }
__global__ void simt_kernel() { }

#define checkCudaError(X) do {\
  auto ret = X;\
  if (ret != cudaSuccess) {\
    printf("\n error on line %d, CUDART error string : %s\n", __LINE__, cudaGetErrorString(ret));\
    exit(1);\
  }\
} while (0)

int main() {
  tile_kernel<<<1,1>>>();

  // Triggers when -simt-only is passed or arch <= sm_75 is targeted
  checkCudaError(cudaGetLastError());

  // Triggers when an execution error occurs
  checkCudaError(cudaDeviceSynchronize());


  simt_kernel<<<1,1>>>();

  // Triggers when -tile-only is passed
  checkCudaError(cudaGetLastError());

  // Triggers when an execution error occurs
  checkCudaError(cudaDeviceSynchronize());
}

Tile code generation is skipped when targeting an architecture less than sm_80 (Ampere) or when the -simt-only flag is specified. SIMT code generation is skipped when the -tile-only flag is specified.

1

Refer to the CUDA Programming Guide for details on Tile and SIMT programming.