Functions handling transposes.
void nvte_cast_transpose(const NVTETensor input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)
Cast and transpose the input.
This function casts the input and produces 2 results:
is the result of the casttransposed_output
is the transposed result of the cast.
- Parameters:
input – [in] Input tensor of shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
stream – [in] CUDA stream used for the operation.
void nvte_transpose(const NVTETensor input, NVTETensor transposed_output, cudaStream_t stream)
Transpose the input.
- Parameters:
input – [in] Input tensor of shape [N, H].
transposed_output – [out] Result of the transpose. Shape: [H, N].
stream – [in] CUDA stream used for the operation.
void nvte_cast_transpose_dbias(const NVTETensor input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
Cast and transpose the input. Additionally, reduce the input along the first dimension.
This function casts the input and produces 3 results:
is the result of the casttransposed_output
is the transposed result of the cast.dbias
is the result of the reduction of the input along the first dimension.
Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.
- Parameters:
input – [in] Input tensor of shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the input along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.
void nvte_fp8_transpose_dbias(const NVTETensor input, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
Transpose the FP8 input. Additionally, reduce the input along the first dimension.
This function takes FP8 input and produces 2 results:
is the transposed result of the input.dbias
is the result of the reduction of the input along the first dimension.
Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.
- Parameters:
input – [in] Input tensor of shape [N, H].
transposed_output – [inout] Result of the transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the input along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.
void nvte_multi_cast_transpose(size_t num_tensors, const NVTETensor *input_list, NVTETensor *cast_output_list, NVTETensor *transposed_output_list, cudaStream_t stream)
Cast and transpose multiple tensors.
This function casts each input tensor and produces 2 results:
is the result of the casttransposed_output
is the transposed result of the cast.
- Parameters:
num_tensors – [in] Number of tensors.
input_list – [in] List of 2D input tensors.
cast_output_list – [inout] List of casted tensors. Dimensions match tensors in input_list.
transposed_output_list – [inout] List of casted and transposed tensors. Dimensions are transpose of tensors in input_list.
stream – [in] CUDA stream used for the operation.
void nvte_cast_transpose_dbias_dgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
Compute backward of ActLU operation on the input, then cast and transpose. Additionally, reduce the result of the SiLU backward along the first dimension.
This function produces 3 results:
is equal tocast(dact(input))
is equal totranspose(cast(dact(input)))
is equal toreduce(dact(input), axis=0)
Calling this function with workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.
Supported activations: GeLU, SiLU, ReLU, QuickGeLU, SquaredReLU
- Parameters:
input – [in] Input tensor of shape [N, H].
act_input – [in] Tensor used as input to the forward of SiLU operation. Shape [N, H].
cast_output – [inout] Result of the cast. Shape: [N, H].
transposed_output – [inout] Result of the cast and transpose. Shape: [H, N].
dbias – [out] Result of the reduction of the dSiLU(input) along the first dimension. Shape: [H].
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.
void nvte_cast_transpose_dbias_dsilu(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
void nvte_cast_transpose_dbias_drelu(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
void nvte_cast_transpose_dbias_dqgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
void nvte_cast_transpose_dbias_dsrelu(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)
void nvte_dgeglu_cast_transpose(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)
Compute dgeglu of the input, additionally does cast and transpose the dgeglu output.
This function produces 2 results:
is the result of the casttransposed_output
is the transposed result of the cast.
Supported activations: GeLU, SiLU, ReLU, QuickGeLU, SquaredReLU
- Parameters:
input – [in] Input tensor of shape [N, H].
gated_act_input – [in] Tensor used as input to the forward of GeGLU operation. Shape [N, H * 2].
cast_output – [inout] Result of the cast. Shape: [N, H * 2].
transposed_output – [inout] Result of the cast and transpose. Shape: [H * 2, N].
stream – [in] CUDA stream used for the operation.
void nvte_dswiglu_cast_transpose(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)
void nvte_dreglu_cast_transpose(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)
void nvte_dqgeglu_cast_transpose(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)
void nvte_dsreglu_cast_transpose(const NVTETensor input, const NVTETensor act_input, NVTETensor cast_output, NVTETensor transposed_output, cudaStream_t stream)