nemo_automodel.components.datasets.llm.packed_sequence
nemo_automodel.components.datasets.llm.packed_sequence
Module Contents
Functions
Data
API
Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.
Pads a pack to packed_sequence_size.
seq_lens contains original lengths. seq_lens_padded applies CP padding (if cp_size > 1) and pack-level padding.
If max packs is set, stop packing when we reach that number.
Splits the current pack at the boundary, processes it, adds it to packs.
…and returns the start of the next pack.
TODO(@akoumparouli): refactor.
converts to tensors, pads a pack and returns it.
Build a [B, 1, T, T] additive block-causal mask directly on device.
In-document causal attention is allowed (0); cross-document and padding
positions are finfo(dtype).min. seq_lens is the [B, max_docs]
0-padded per-document length tensor; each row’s non-zero entries sum to
seq_length (trailing pad folded into the final document).
Creates causal mask block for specified lengths.
In particular, given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample’s seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]
Parameters:
Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
Returns: torch.Tensor
Block causal mask of shape (batch_size, packed_sequence_size, packed_sequence_size).
Pack the dataset to defined length.
In particulat, it will iterate through the dataset. Use a buffer to hold samples until packed_sequence_size, then append the buffer to packs as a single “packed” sample. Continue until max_packs or end of dataset.
Parameters:
Actual dataset (can be ‘train’, ‘val’ or ‘test’)
Whether the dataset is ‘train’, ‘val’ or ‘test’
Number of tokens in a pack
Maximum number of packs. Default: None
If True, drop samples that are longer than packed_sequence_size.
Context parallel size. When > 1, each sequence will be padded to be divisible by 2*cp_size for context parallel processing. Default: 1 (no CP).
Create a 2D block causal document mask for a batch of packed sequences.
Parameters:
Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
Returns:
BlockMask or Tensor if torch version < 2.5.0.