core.packed_seq_params#

Module Contents#

Classes#

PackedSeqParams

parameters to TEDotProductAttention and fused rope kernels for the thd (packed) sequence format

API#

class core.packed_seq_params.PackedSeqParams#

parameters to TEDotProductAttention and fused rope kernels for the thd (packed) sequence format

qkv_format: str#

None

cu_seqlens_q: torch.Tensor#

None

cu_seqlens_kv: torch.Tensor#

None

cu_seqlens_q_padded: torch.Tensor#

None

cu_seqlens_kv_padded: torch.Tensor#

None

max_seqlen_q: int#

None

max_seqlen_kv: int#

None

local_cp_size: int#

None

cp_group: torch.distributed.ProcessGroup#

None

total_tokens: int#

None

seq_idx: torch.Tensor#

None

__post_init__()#

Pre-compute seq_idx for Mamba mixer CUDA graph compatibility.

If total_tokens is 16 (for example), this method takes packed_seq_params.cu_seqlens_q_padded (or cu_seqlens_q) which is of the form [0, 5, 7, 11] and returns a tensor of the form [0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3], which is [0](5-0) + [1](7-5) + [2](11-7) + [3](16-11) In the above example, there are three sequences in the pack. In general, the output has an additional sequence index (e.g. 0, 1, 2, 3) so that any tokens beyond the last padded input sequence are accounted for as an extra sequence. However, If cu_seqlens_q_padded[-1] == max_seqlen then this additional sequence index will not be included.