cuda.tile.Array#
- class cuda.tile.Array#
Class for global array objects.
- property shape: tuple[int, ...]#
The number of elements in each of the array’s dimensions.
- Return type:
tuple[int32,…]
- property strides: tuple[int, ...]#
The number of elements to step in each dimension while traversing the array.
- Return type:
tuple[int32,…]
- slice(axis, start, stop)#
Creates a view of the array sliced along a single axis.
The returned array references the same underlying memory as array, but with a restricted range from index start (inclusive) to stop (exclusive) along the specified axis. No data is copied.
axis must be a constant integer. Negative values are supported and count from the last dimension (e.g.,
axis=-1refers to the last axis).start and stop must be integers (scalars or 0D tiles). They must satisfy
0 <= start < Nandstart <= stop <= N, whereNis the size of array along the sliced axis.For example, consider a 2-dimensional array
Aof shape(M, N). Slicing along axis 0 from start to stop produces an array of shape(stop - start, N):@ct.kernel def kernel(x): sub = x.slice(axis=0, start=1, stop=3) print(ct.load(sub, (0, 0), shape=(2, 4))) x = torch.arange(16, device='cuda').reshape(4, 4) ct.launch(stream, (1,), kernel, (x,))
import cuda.tile as ct import torch torch.cuda.init() stream = torch.cuda.current_stream() @ct.kernel def kernel(x): sub = x.slice(axis=0, start=1, stop=3) print(ct.load(sub, (0, 0), shape=(2, 4))) x = torch.arange(16, device='cuda').reshape(4, 4) ct.launch(stream, (1,), kernel, (x,)) torch.cuda.synchronize()
Output
[[4, 5, 6, 7], [8, 9, 10, 11]]
Using NumPy slice notation for illustration, this is equivalent to:
sub = A[start:stop, :] # NumPy notation for reference only
The slice bounds can be dynamic (runtime values):
@ct.kernel def kernel(x, offset, length): sub = x.slice(axis=0, start=offset, stop=offset+length) print(ct.load(sub, (0,), shape=(4,))) print(ct.load(sub, (1,), shape=(4,))) x = torch.arange(16, device='cuda') ct.launch(stream, (1,), kernel, (x, 8, 8))
import cuda.tile as ct import torch torch.cuda.init() stream = torch.cuda.current_stream() @ct.kernel def kernel(x, offset, length): sub = x.slice(axis=0, start=offset, stop=offset+length) print(ct.load(sub, (0,), shape=(4,))) print(ct.load(sub, (1,), shape=(4,))) x = torch.arange(16, device='cuda') ct.launch(stream, (1,), kernel, (x, 8, 8)) torch.cuda.synchronize()
Output
[8, 9, 10, 11] [12, 13, 14, 15]
- tiled_view(
- tile_shape,
- *,
- padding_mode=PaddingMode.UNDETERMINED,
- traversal_steps=None,
Creates a tiled view of this array with a fixed tile_shape.
The resulting
TiledViewpartitions this array into a grid of equally sized tiles.- Parameters:
tile_shape (tuple[const int,...]) – The shape of each tile in the view. Must have the same rank as this array.
padding_mode (PaddingMode) – The value used to pad tiles that extend beyond the array boundaries. By default, the padding value is undetermined.
traversal_steps (tuple[const int, ...], optional) –
Number of elements between consecutive tile origins along each axis. Must have the same rank as the array, or be
None(default).Noneortraversal_steps[i] == tile_shape[i]: tiles partition axis i with no overlap or gaps.traversal_steps[i] < tile_shape[i]: tiles overlap along axis i.traversal_steps[i] > tile_shape[i]: gaps between tiles along axis i.
(Since CTK 13.3)
- Return type:
Examples
@ct.kernel def kernel(x): tv = x.tiled_view((2, 4)) print(tv.load((0, 0))) print(tv.load((1, 0))) # traversal_steps=(1, 4): advance 1 row per step, tiles overlap tv2 = x.tiled_view((2, 4), traversal_steps=(1, 4)) print(tv2.load((0, 0))) print(tv2.load((1, 0))) x = torch.arange(16, device='cuda').reshape(4, 4) ct.launch(stream, (1,), kernel, (x,))
import cuda.tile as ct import torch torch.cuda.init() stream = torch.cuda.current_stream() @ct.kernel def kernel(x): tv = x.tiled_view((2, 4)) print(tv.load((0, 0))) print(tv.load((1, 0))) # traversal_steps=(1, 4): advance 1 row per step, tiles overlap tv2 = x.tiled_view((2, 4), traversal_steps=(1, 4)) print(tv2.load((0, 0))) print(tv2.load((1, 0))) x = torch.arange(16, device='cuda').reshape(4, 4) ct.launch(stream, (1,), kernel, (x,)) torch.cuda.synchronize()
Output
[[0, 1, 2, 3], [4, 5, 6, 7]] [[8, 9, 10, 11], [12, 13, 14, 15]] [[0, 1, 2, 3], [4, 5, 6, 7]] [[4, 5, 6, 7], [8, 9, 10, 11]]
See also