transpose.h¶

Functions handling transposes.

Functions

void nvte_cast_transpose(const NVTETensor input, const NVTETensor scale, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor amax, NVTETensor scale_inv, cudaStream_t stream)¶

Cast and transpose the input.

This function casts the input and produces 2 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.

Parameters

input – [in] Input tensor of shape [N, H].
scale – [in] Scaling factor used for outputs.
cast_output – [out] Result of the cast. Shape: [N, H].
transposed_output – [out] Result of the cast and transpose. Shape: [H, N].
amax – [inout] AMAX value of the output tensor.
scale_inv – [out] Inverse of the output’s scaling factor.
stream – [in] CUDA stream used for the operation.

void nvte_transpose(const NVTETensor input, NVTETensor transposed_output, cudaStream_t stream)¶

Transpose the input.

Parameters

input – [in] Input tensor of shape [N, H].
transposed_output – [out] Result of the transpose. Shape: [H, N].
stream – [in] CUDA stream used for the operation.

void nvte_cast_transpose_dbias(const NVTETensor input, const NVTETensor scale, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor amax, NVTETensor dbias, NVTETensor scale_inv, NVTETensor workspace, cudaStream_t stream)¶

Cast and transpose the input. Additionally, reduce the input along the first dimension.

This function casts the input and produces 3 results:

cast_output is the result of the cast
transposed_output is the transposed result of the cast.
dbias is the result of the reduction of the input along the first dimension.

Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters

input – [in] Input tensor of shape [N, H].
scale – [in] Scaling factor used for outputs.
cast_output – [out] Result of the cast. Shape: [N, H].
transposed_output – [out] Result of the cast and transpose. Shape: [H, N].
amax – [inout] AMAX value of the output tensor.
dbias – [out] Result of the reduction of the input along the first dimension. Shape: [H].
scale_inv – [out] Inverse of the output’s scaling factor.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_cast_transpose_dbias_dgelu(const NVTETensor input, const NVTETensor gelu_input, const NVTETensor scale, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor amax, NVTETensor dbias, NVTETensor scale_inv, NVTETensor workspace, cudaStream_t stream)¶

Compute backward of GELU operation on the input, then cast and transpose. Additionally, reduce the result of the GELU backward along the first dimension.

This function produces 3 results:

cast_output is equal to cast(dGELU(input))
transposed_output is equal to transpose(cast(dGELU(input)))
dbias is equal to reduce(dGELU(input), axis=0)

Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters

input – [in] Input tensor of shape [N, H].
gelu_input – [in] Tensor used as input to the forward of GELU operation. Shape [N, H].
scale – [in] Scaling factor used for outputs.
cast_output – [out] Result of the cast. Shape: [N, H].
transposed_output – [out] Result of the cast and transpose. Shape: [H, N].
amax – [inout] AMAX value of the output tensor.
dbias – [out] Result of the reduction of the dGELU(input) along the first dimension. Shape: [H].
scale_inv – [out] Inverse of the output’s scaling factor.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.