fused_attn.h¶
Enums and functions for fused attention.
Enums
- 
enum NVTE_QKV_Layout¶
- Memory layouts of QKV tensors - S,- B,- H,- D, and- Tstand for sequence length, batch size, the number of heads, head size, and the total number of sequences in a batch, i.e.- t = sum(s_i) for i = 0...b-1.- SBHDand- BSHD-based layouts are used when sequences in a batch are of equal length or padded to the same length, and- THD-based layouts are used when sequences have different lengths in a batch.- Note - { - NVTE_QKV_INTERLEAVED,- NVTE_KV_INTERLEAVEDand- NVTE_NOT_INTERLEAVEDwill be deprecated in the next release. Please use their equivalent enums instead, i.e.- NVTE_T3HD,- NVTE_THD_T2HDand- NVTE_THD_THD_THDwhen sequences are of variable lengths, and- NVTE_BS3HD,- NVTE_BSHD_BS2HDand- NVTE_BSHD_BSHD_BSHDwhen sequences are of equal length or padded to equal length.}- Values: - 
enumerator NVTE_NOT_INTERLEAVED¶
- Separate Q, K, V tensors. - Q: [total_seqs_q, num_heads, head_dim] | Q Q Q ... Q | \___________ _____________/ total_seqs_q <| \/ | num_heads * head_dim K: [total_seqs_kv, num_heads, head_dim] | K K K ... K | \___________ _____________/ total_seqs_kv <| \/ | num_heads * head_dim V: [total_seqs_kv, num_heads, head_dim] | V V V ... V | \___________ _____________/ total_seqs_kv <| \/ | num_heads * head_dim 
 - 
enumerator NVTE_QKV_INTERLEAVED¶
- Packed QKV. - QKV: [total_seqs, 3, num_heads, head_dim] | Q Q Q ... Q K K K ... K V V V ... V | \___________ _____________/ total_seqs <| \/ | num_heads * head_dim 
 - 
enumerator NVTE_KV_INTERLEAVED¶
- Q and packed KV. - Q: [total_seqs_q, num_heads, head_dim] | Q Q Q ... Q | \___________ _____________/ total_seqs_q <| \/ | num_heads * head_dim KV: [total_seqs_kv, 2, num_heads, head_dim] | K K K ... K V V V ... V | \___________ _____________/ total_seqs_kv <| \/ | num_heads * head_dim 
 - 
enumerator NVTE_SB3HD¶
 - 
enumerator NVTE_SBH3D¶
 - 
enumerator NVTE_SBHD_SB2HD¶
 - 
enumerator NVTE_SBHD_SBH2D¶
 - 
enumerator NVTE_SBHD_SBHD_SBHD¶
 - 
enumerator NVTE_BS3HD¶
 - 
enumerator NVTE_BSH3D¶
 - 
enumerator NVTE_BSHD_BS2HD¶
 - 
enumerator NVTE_BSHD_BSH2D¶
 - 
enumerator NVTE_BSHD_BSHD_BSHD¶
 - 
enumerator NVTE_T3HD¶
 - 
enumerator NVTE_TH3D¶
 - 
enumerator NVTE_THD_T2HD¶
 - 
enumerator NVTE_THD_TH2D¶
 - 
enumerator NVTE_THD_THD_THD¶
 
- 
enumerator NVTE_NOT_INTERLEAVED¶
- 
enum NVTE_QKV_Layout_Group¶
- Grouping of QKV layouts. - Values: - 
enumerator NVTE_3HD¶
- 3HD QKV layouts, e.g. BS3HD 
 - 
enumerator NVTE_H3D¶
- H3D QKV layouts, e.g. BSH3D 
 - 
enumerator NVTE_HD_2HD¶
- HD_2HD QKV layouts, e.g. BSHD_BS2HD 
 - 
enumerator NVTE_HD_H2D¶
- HD_H2D QKV layouts, e.g. BSHD_BSH2D 
 - 
enumerator NVTE_HD_HD_HD¶
- HD_HD_HD QKV layouts, e.g. BSHD_BSHD_BSHD 
 
- 
enumerator NVTE_3HD¶
- 
enum NVTE_QKV_Format¶
- Dimension formats for QKV tensors. - Values: - 
enumerator NVTE_SBHD¶
- SBHD QKV format 
 - 
enumerator NVTE_BSHD¶
- BSHD QKV format 
 - 
enumerator NVTE_THD¶
- THD QKV format 
 
- 
enumerator NVTE_SBHD¶
- 
enum NVTE_Bias_Type¶
- Bias types. - Values: - 
enumerator NVTE_NO_BIAS¶
- No bias 
 - 
enumerator NVTE_PRE_SCALE_BIAS¶
- Bias before scale 
 - 
enumerator NVTE_POST_SCALE_BIAS¶
- Bias after scale 
 
- 
enumerator NVTE_NO_BIAS¶
- 
enum NVTE_Mask_Type¶
- Attention mask types. - Values: - 
enumerator NVTE_NO_MASK¶
- No masking 
 - 
enumerator NVTE_PADDING_MASK¶
- Padding attention mask 
 - 
enumerator NVTE_CAUSAL_MASK¶
- Causal attention mask 
 
- 
enumerator NVTE_NO_MASK¶
- 
enum NVTE_Fused_Attn_Backend¶
- Fused attention backends. - Values: - 
enumerator NVTE_No_Backend¶
- No supported backend 
 - 
enumerator NVTE_F16_max512_seqlen¶
- cuDNN-based FP16/BF16 fused attention for <= 512 sequence length 
 - 
enumerator NVTE_F16_arbitrary_seqlen¶
- cuDNN-based FP16/BF16 fused attention for any sequence length 
 - 
enumerator NVTE_FP8¶
- cuDNN-based FP8 fused attention for <= 512 sequence length 
 
- 
enumerator NVTE_No_Backend¶
Functions
- 
NVTE_QKV_Layout_Group nvte_get_qkv_layout_group(NVTE_QKV_Layout qkv_layout)¶
- Get layout group for a given QKV layout. - Parameters
- qkv_layout – [in] QKV layout, e.g. sbh3d. 
- Returns
- qkv layout group, e.g. h3d. 
 
- 
NVTE_QKV_Format nvte_get_qkv_format(NVTE_QKV_Layout qkv_layout)¶
- Get QKV format for a given QKV layout. - Parameters
- qkv_layout – [in] QKV layout, e.g. sbh3d. 
- Returns
- qkv format, e.g. sbhd. 
 
- 
NVTE_Fused_Attn_Backend nvte_get_fused_attn_backend(NVTEDType q_dtype, NVTEDType kv_dtype, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, float dropout, size_t max_seqlen_q, size_t max_seqlen_kv, size_t head_dim)¶
- Get fused attention backend based on input parameters. - Parameters
- q_dtype – [in] The data type of Tensor Q. 
- kv_dtype – [in] The data type of Tensors K, V. 
- qkv_layout – [in] The layout of Tensors Q, K, V. 
- bias_type – [in] The attention bias type. 
- attn_mask_type – [in] The attention mask type. 
- dropout – [in] The dropout probability. 
- max_seqlen_q – [in] The sequence length of Q. 
- max_seqlen_kv – [in] The sequence length of K, V. 
- head_dim – [in] The head dimension of Q, K, V. 
 
 
- 
void nvte_fused_attn_fwd_qkvpacked(const NVTETensor QKV, const NVTETensor Bias, NVTETensor S, NVTETensor O, NVTETensorPack *Aux_CTX_Tensors, const NVTETensor cu_seqlens, const NVTETensor rng_state, size_t max_seqlen, bool is_training, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute dot product attention with packed QKV input. - Computes: - P = Q * Transpose(K) + Bias 
- S = ScaleMaskSoftmax(P) 
- D = Dropout(S) 
- O = D * Transpose(V) 
 - Support Matrix: - | backend | precision | qkv layout | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | QKV_INTERLEAVED | NO/POST_SCALE_BIAS | PADDING/CAUSAL/NO_MASK | Yes | <= 512 | 64 | | 1 | FP16/BF16 | QKV_INTERLEAVED | NO_BIAS | CAUSAL_MASK | Yes | > 512 | 64, 128 | | 2 | FP8 | QKV_INTERLEAVED | NO_BIAS | PADDING_MASK | Yes | <= 512 | 64 | - Parameters
- QKV – [in] The QKV tensor in packed format, [total_seqs, 3, num_heads, head_dim]. 
- Bias – [in] The Bias tensor. 
- S – [inout] The S tensor. 
- O – [out] The output O tensor. 
- Aux_CTX_Tensors – [out] Auxiliary output tensors when training, e.g. M, ZInv, rng_state. 
- cu_seqlens – [in] Accumulative sequence lengths, [batch_size + 1]. 
- rng_state – [in] Seed and offset of CUDA random number generator. 
- max_seqlen – [in] Max sequence length used for computing, it may be >= max(seqlen_i) for i=0,…batch_size-1. 
- is_training – [in] Whether this is in training mode or inference. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensor’s layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation. 
 
 
- 
void nvte_fused_attn_bwd_qkvpacked(const NVTETensor QKV, const NVTETensor O, const NVTETensor dO, const NVTETensor S, NVTETensor dP, const NVTETensorPack *Aux_CTX_Tensors, NVTETensor dQKV, NVTETensor dBias, const NVTETensor cu_seqlens, size_t max_seqlen, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute the backward of the dot product attention with packed QKV input. - Support Matrix: - | backend | precision | qkv layout | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | QKV_INTERLEAVED | NO/POST_SCALE_BIAS | PADDING/CAUSAL/NO_MASK | Yes | <= 512 | 64 | | 1 | FP16/BF16 | QKV_INTERLEAVED | NO_BIAS | CAUSAL_MASK | Yes | > 512 | 64, 128 | | 2 | FP8 | QKV_INTERLEAVED | NO_BIAS | PADDING_MASK | Yes | <= 512 | 64 | - Parameters
- QKV – [in] The QKV tensor in packed format, [total_seqs, 3, num_heads, head_dim]. 
- O – [in] The O tensor from forward. 
- dO – [in] The gradient of the O tensor. 
- S – [in] The S tensor. 
- dP – [inout] The gradient of the P tensor. 
- Aux_CTX_Tensors – [in] Auxiliary tensors from context when in training mode, e.g. M, ZInv, rng_state. 
- dQKV – [out] The gradient of the QKV tensor. 
- dBias – [out] The gradient of the Bias tensor. 
- cu_seqlens – [in] Accumulative sequence lengths, [batch_size + 1]. 
- max_seqlen – [in] Max sequence length used for computing, it may be >= max(seqlen_i) for i=0,…batch_size-1. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensor’s layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation. 
 
 
- 
void nvte_fused_attn_fwd_kvpacked(const NVTETensor Q, const NVTETensor KV, const NVTETensor Bias, NVTETensor S, NVTETensor O, NVTETensorPack *Aux_CTX_Tensors, const NVTETensor cu_seqlens_q, const NVTETensor cu_seqlens_kv, const NVTETensor rng_state, size_t max_seqlen_q, size_t max_seqlen_kv, bool is_training, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute dot product attention with packed KV input. - Computes: - P = Q * Transpose(K) + Bias 
- S = ScaleMaskSoftmax(P) 
- D = Dropout(S) 
- O = D * Transpose(V) 
 - Support Matrix: - | backend | precision | qkv layout | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | KV_INTERLEAVED | NO/POST_SCALE_BIAS | PADDING/CAUSAL/NO_MASK | Yes | <= 512 | 64 | - Parameters
- Q – [in] The Q tensor, [total_seqs_q, num_heads, head_dim]. 
- KV – [in] The KV tensor, [total_seqs_kv, 2, num_heads, head_dim]. 
- Bias – [in] The Bias tensor. 
- S – [inout] The S tensor. 
- O – [out] The output O tensor. 
- Aux_CTX_Tensors – [out] Auxiliary output tensors when training, e.g. M, ZInv, rng_state. 
- cu_seqlens_q – [in] Accumulative sequence lengths for Q, [batch_size + 1]. 
- cu_seqlens_kv – [in] Accumulative sequence lengths for KV, [batch_size + 1]. 
- rng_state – [in] Seed and offset of CUDA random number generator. 
- max_seqlen_q – [in] Max sequence length used for computing for Q. it may be >= max(seqlen_q_i) for i=0,…batch_size-1. 
- max_seqlen_kv – [in] Max sequence length used for computing for KV. it may be >= max(seqlen_kv_i) for i=0,…batch_size-1. 
- is_training – [in] Whether this is in training mode or inference. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensor’s layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation. 
 
 
- 
void nvte_fused_attn_bwd_kvpacked(const NVTETensor Q, const NVTETensor KV, const NVTETensor O, const NVTETensor dO, const NVTETensor S, NVTETensor dP, const NVTETensorPack *Aux_CTX_Tensors, NVTETensor dQ, NVTETensor dKV, NVTETensor dBias, const NVTETensor cu_seqlens_q, const NVTETensor cu_seqlens_kv, size_t max_seqlen_q, size_t max_seqlen_kv, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute the backward of the dot product attention with packed KV input. - Support Matrix: - | backend | precision | qkv layout | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | KV_INTERLEAVED | NO/POST_SCALE_BIAS | PADDING/CAUSAL/NO_MASK | Yes | <= 512 | 64 | - Parameters
- Q – [in] The Q tensor, [total_seqs_q, num_heads, head_dim]. 
- KV – [in] The KV tensor, [total_seqs_kv, 2, num_heads, head_dim]. 
- O – [in] The O tensor from forward. 
- dO – [in] The gradient of the O tensor. 
- S – [in] The S tensor. 
- dP – [inout] The gradient of the P tensor. 
- Aux_CTX_Tensors – [in] Auxiliary tensors from context when in training mode, e.g. M, ZInv, rng_state. 
- dQ – [out] The gradient of the Q tensor. 
- dKV – [out] The gradient of the KV tensor. 
- dBias – [out] The gradient of the Bias tensor. 
- cu_seqlens_q – [in] Accumulative sequence lengths for Q, [batch_size + 1]. 
- cu_seqlens_kv – [in] Accumulative sequence lengths for KV, [batch_size + 1]. 
- max_seqlen_q – [in] Max sequence length used for computing for Q. it may be >= max(seqlen_q_i) for i=0,…batch_size-1. 
- max_seqlen_kv – [in] Max sequence length used for computing for KV. it may be >= max(seqlen_kv_i) for i=0,…batch_size-1. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensor’s layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation. 
 
 
- 
void nvte_fused_attn_fwd(const NVTETensor Q, const NVTETensor K, const NVTETensor V, const NVTETensor Bias, NVTETensor S, NVTETensor O, NVTETensorPack *Aux_CTX_Tensors, const NVTETensor cu_seqlens_q, const NVTETensor cu_seqlens_kv, const NVTETensor rng_state, size_t max_seqlen_q, size_t max_seqlen_kv, bool is_training, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute dot product attention with separate Q, K and V. - Computes: - P = Q * Transpose(K) + Bias 
- S = ScaleMaskSoftmax(P) 
- D = Dropout(S) 
- O = D * Transpose(V) 
 - Support Matrix: - | backend | precision | qkv format | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | SBHD, BSHD | NO/POST_SCALE_BIAS | PADDING/CAUSAL_MASK | Yes | <= 512 | 64 | | 1 | FP16/BF16 | SBHD, BSHD | NO/POST_SCALE_BIAS | CAUSAL_MASK | Yes | > 512 | 64, 128 | | 2 | FP8 | THD | NO_BIAS | PADDING_MASK | Yes | <= 512 | 64 | - Parameters
- Q – [in] The Q tensor. 
- K – [in] The K tensor. 
- V – [in] The V tensor. 
- Bias – [in] The Bias tensor. 
- S – [inout] The S tensor. 
- O – [out] The output O tensor. 
- Aux_CTX_Tensors – [out] Auxiliary output tensors when training, e.g. M, ZInv, rng_state. 
- cu_seqlens_q – [in] Cumulative sequence lengths for Q, [batch_size + 1]. 
- cu_seqlens_kv – [in] Cumulative sequence lengths for K and V, [batch_size + 1]. 
- rng_state – [in] Seed and offset of CUDA random number generator. 
- max_seqlen_q – [in] Max sequence length used for computing for Q. it may be >= max(seqlen_q_i) for i=0,…batch_size-1. 
- max_seqlen_kv – [in] Max sequence length used for computing for K and V. it may be >= max(seqlen_kv_i) for i=0,…batch_size-1. 
- is_training – [in] Whether this is in training mode or inference. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensors’ layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation. 
 
 
- 
void nvte_fused_attn_bwd(const NVTETensor Q, const NVTETensor K, const NVTETensor V, const NVTETensor O, const NVTETensor dO, const NVTETensor S, NVTETensor dP, const NVTETensorPack *Aux_CTX_Tensors, NVTETensor dQ, NVTETensor dK, NVTETensor dV, NVTETensor dBias, const NVTETensor cu_seqlens_q, const NVTETensor cu_seqlens_kv, size_t max_seqlen_q, size_t max_seqlen_kv, float attn_scale, float dropout, NVTE_QKV_Layout qkv_layout, NVTE_Bias_Type bias_type, NVTE_Mask_Type attn_mask_type, NVTETensor workspace, cudaStream_t stream)¶
- Compute the backward of the dot product attention with separate Q, K and V. - Support Matrix: - | backend | precision | qkv format | bias | mask | dropout | sequence length | head_dim | | 0 | FP16/BF16 | SBHD, BSHD | NO/POST_SCALE_BIAS | PADDING/CAUSAL_MASK | Yes | <= 512 | 64 | | 1 | FP16/BF16 | SBHD, BSHD | NO/POST_SCALE_BIAS | CAUSAL_MASK | Yes | > 512 | 64, 128 | | 2 | FP8 | THD | NO_BIAS | PADDING_MASK | Yes | <= 512 | 64 | - Parameters
- Q – [in] The Q tensor. 
- K – [in] The K tensor. 
- V – [in] The V tensor. 
- O – [in] The O tensor from forward. 
- dO – [in] The gradient of the O tensor. 
- S – [in] The S tensor. 
- dP – [inout] The gradient of the P tensor. 
- Aux_CTX_Tensors – [in] Auxiliary tensors from context when in training mode, e.g. M, ZInv, rng_state. 
- dQ – [out] The gradient of the Q tensor. 
- dK – [out] The gradient of the K tensor. 
- dV – [out] The gradient of the V tensor. 
- dBias – [out] The gradient of the Bias tensor. 
- cu_seqlens_q – [in] Cumulative sequence lengths for Q, [batch_size + 1]. 
- cu_seqlens_kv – [in] Cumulative sequence lengths for K and V, [batch_size + 1]. 
- max_seqlen_q – [in] Max sequence length used for computing for Q. it may be >= max(seqlen_q_i) for i=0,…batch_size-1. 
- max_seqlen_kv – [in] Max sequence length used for computing for K and V. it may be >= max(seqlen_kv_i) for i=0,…batch_size-1. 
- attn_scale – [in] Scaling factor for Q * K.T. 
- dropout – [in] Dropout probability. 
- qkv_layout – [in] QKV tensors’ layout. 
- bias_type – [in] Bias type. 
- attn_mask_type – [in] Attention mask type. 
- workspace – [in] Workspace tensor. 
- stream – [in] CUDA stream used for this operation.