cuda.tile.TiledView#

class cuda.tile.TiledView#

Class for tiled view objects.

property dtype: DType#

The data type of the elements in the tiled view.

Return type:: DType (constant)

property tile_shape: tuple[int, ...]#

The shape of tiles produced by each indexed access.

Return type:: tuple[const int,…]

num_tiles(axis)#

The number of tiles along a tiled view’s given axis.

Parameters:: axis (const int) – The axis of the tile index space.
Return type:: int32

property traversal_steps: tuple[int, ...]#

Number of elements between consecutive tile origins along each axis.

Defaults to tile_shape when not explicitly provided. If tile_shape is (), traversal_steps is (1,) * tiled view’s rank.

Return type:: tuple[const int,…]

load(index, *, latency=None, allow_tma=None)#

Loads a tile from the tiled view at the given tile index.

The returned tile has shape tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are filled according to the view’s padding mode. If the tile lies entirely outside the tiled view, the behavior is undefined.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
latency (const int) – A hint indicating how heavy DRAM traffic will be. It shall be an integer between 1 (low) and 10 (high). By default, the compiler will infer the latency.
allow_tma (const bool) – If False, the load will not use TMA. By default, TMA is allowed.

Return type:

Tile

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    tile = tv.load(0)
    print(tile)

x = torch.arange(8, device='cuda')
ct.launch(stream, (1,), kernel, (x,))

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    tile = tv.load(0)
    print(tile)

x = torch.arange(8, device='cuda')
ct.launch(stream, (1,), kernel, (x,))

torch.cuda.synchronize()

Output

[0, 1, 2, 3]

store(index, tile, *, latency=None, allow_tma=None)#

Stores a tile into the tiled view at the given tile index.

The tile’s shape must be broadcastable to tile_shape. If the tile’s dtype differs from the view’s dtype, an implicit cast is performed.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
tile (Tile) – The tile to store.
latency (const int) – A hint indicating how heavy DRAM traffic will be. It shall be an integer between 1 (low) and 10 (high). By default, the compiler will infer the latency.
allow_tma (const bool) – If False, the store will not use TMA. By default, TMA is allowed.

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    tile = ct.full((4,), 99, dtype=ct.int32)
    tv.store(0, tile)

x = torch.zeros(8, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    tile = ct.full((4,), 99, dtype=ct.int32)
    tv.store(0, tile)

x = torch.zeros(8, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[99, 99, 99, 99, 0, 0, 0, 0]

atomic_store_add(index, update, /)#

Atomically adds update to the tiled view at the given tile index.

If update’s dtype differs from the view’s dtype, an implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_add for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_add(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_add(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[1, 1, 1, 1]

atomic_store_max(index, update, /)#

Atomically applies element-wise maximum with update to the tiled view at the given tile index.

If update’s dtype differs from the view’s dtype, an implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_max for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_max(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_max(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[1, 1, 1, 1]

atomic_store_min(index, update, /)#

Atomically applies element-wise minimum with update to the tiled view at the given tile index.

If update’s dtype differs from the view’s dtype, an implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_min for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_min(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_min(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[0, 0, 0, 0]

atomic_store_and(index, update, /)#

Atomically applies bitwise AND of update to the tiled view at the given tile index.

update’s dtype must exactly match the view’s dtype; no implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_and for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_and(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_and(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[0, 0, 0, 0]

atomic_store_or(index, update, /)#

Atomically applies bitwise OR of update to the tiled view at the given tile index.

update’s dtype must exactly match the view’s dtype; no implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_or for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_or(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_or(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[1, 1, 1, 1]

atomic_store_xor(index, update, /)#

Atomically applies bitwise XOR of update to the tiled view at the given tile index.

update’s dtype must exactly match the view’s dtype; no implicit cast is performed.

This method does not return a value.

For each individual element, the operation is performed atomically, but the operation as a whole is not atomic, and the order of individual writes is unspecified.

update’s shape must be broadcastable to tile_shape.

For a tile that partially extends beyond the tiled view boundaries, out-of-bound elements are ignored. If the tile lies entirely outside the tiled view, the behavior is undefined.

Use this operation instead of ct.atomic_xor for better performance when modified value is not needed.

Parameters:

index (tuple[int,...]) – An index in the tiled view’s tile space.
update (Tile) – The update values.

Return type:

None

Examples

SnippetComplete Example

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_xor(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

import cuda.tile as ct
import torch

torch.cuda.init()
stream = torch.cuda.current_stream()

@ct.kernel
def kernel(x):
    tv = x.tiled_view(4)
    update = ct.full((4,), 1, dtype=ct.int32)
    tv.atomic_store_xor(0, update)

x = torch.zeros(4, dtype=torch.int32, device='cuda')
ct.launch(stream, (1,), kernel, (x,))
print(x.tolist())

torch.cuda.synchronize()

Output

[1, 1, 1, 1]