Quickstart#
This page will guide you through getting setup and running with cuTile Python, including running a first example.
Prerequisites#
cuTile Python requires the following:
Linux x86_64, Linux aarch64 or Windows x86_64
A GPU with compute capability 10.x or 12.x
NVIDIA Driver r580 or later
CUDA Toolkit 13.1 or later
Python version 3.10, 3.11, 3.12 or 3.13
Installing cuTile Python#
With the prerequisites met, installing cuTile Python is a simple pip install:
pip install cuda-tile
Other Packages#
Some of the cuTile Python samples also use other Python packages.
The quickstart sample on this page uses cupy, which can be installed with:
pip install cupy-cuda13x
The cuTile Python samples in the samples/ directory also use pytest, torch, and numpy packages.
For PyTorch installation instructions, see https://pytorch.org/get-started/locally/.
Pytest and Numpy can be installed with:
pip install pytest numpy
Example Code#
The following example shows vector addition, a typical first kernel for CUDA, but uses cuTile for tile-based programming. This makes use of a 1-dimensional tile to add two 1-dimensional vectors.
This example shows a structure common to cuTile kernels:
Load one or more tiles from GPU memory
Perform computation(s) on the tile(s), resulting in new tile(s)
Write the resulting tile(s) out to GPU memory
In this case, the kernel loads tiles from two vectors, a and b. These loads create tiles called a_tile and b_tile. These tiles are added together to form a third tile, called result. In the last step, the kernel stores the result tile to the output vector c.
More samples can be found in the cuTile Python repository.
# SPDX-FileCopyrightText: Copyright (c) <2025> NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
"""
Example demonstrating simple vector addition.
Shows how to perform elementwise operations on vectors.
"""
import cupy as cp
import numpy as np
import cuda.tile as ct
@ct.kernel
def vector_add(a, b, c, tile_size: ct.Constant[int]):
# Get the 1D pid
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
b_tile = ct.load(b, index=(pid,), shape=(tile_size,))
# Perform elementwise addition
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid, ), tile=result)
def test():
# Create input data
vector_size = 2**12
tile_size = 2**4
grid = (ct.cdiv(vector_size, tile_size), 1, 1)
a = cp.random.uniform(-1, 1, vector_size)
b = cp.random.uniform(-1, 1, vector_size)
c = cp.zeros_like(a)
# Launch kernel
ct.launch(cp.cuda.get_current_stream(),
grid, # 1D grid of processors
vector_add,
(a, b, c, tile_size))
# Copy to host only to compare
a_np = cp.asnumpy(a)
b_np = cp.asnumpy(b)
c_np = cp.asnumpy(c)
# Verify results
expected = a_np + b_np
np.testing.assert_array_almost_equal(c_np, expected)
print("✓ vector_add_example passed!")
if __name__ == "__main__":
test()
Run this from a command line as shown below. If everything has been setup correctly, the test will print that the example passed.
$ python3 samples/quickstart/VectorAdd_quickstart.py
✓ vector_add_example passed!
To run more of the cuTile Python examples, you can directly run the samples by invoking them in the same way as the quickstart example:
$ python3 samples/FFT.py
# output not shown
You can also use pytest to run all the samples:
$ pytest samples
========================= test session starts =========================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/ascudiero/sw/cutile-python
configfile: pytest.ini
collected 6 items
samples/test_samples.py ...... [100%]
========================= 6 passed in 30.74s ==========================
Developer Tools#
NVIDIA Nsight Compute can profile cuTile Python kernels in the same way as SIMT CUDA kernels. With NVIDIA Nsight Compute installed, the quickstart vector addition kernel introduced here can be profiled using the following command to create a profile:
ncu -o VecAddProfile --set detailed python3 VectorAdd_quickstart.py
This profile can then be loaded in a graphical instance of Nsight Compute and the kernel vector_add selected to see statistics about the kernel.
Note
Capturing detailed statistics for cuTile Python kernels requires running on NVIDIA Driver r590 or later.