nemo_automodel.components.datasets.llm.packed_sequence

Module Contents

Functions

Name	Description
`_convert_to_tensors`	Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.
`_fill_labels_with_cross_entropy_ignore_idx`	-
`_pad_pack`	Pads a pack to `packed_sequence_size`.
`_should_stop_packing`	If max packs is set, stop packing when we reach that number.
`_split_and_add_pack`	Splits the current pack at the boundary, processes it, adds it to `packs`.
`_tensorize_and_pad_pack`	converts to tensors, pads a pack and returns it.
`build_block_causal_additive_mask`	Build a `[B, 1, T, T]` additive block-causal mask directly on `device`.
`create_block_causal_mask`	Creates causal mask block for specified lengths.
`pack_dataset`	Pack the dataset to defined length.
`packed_block_causal_mask`	Create a 2D block causal document mask for a batch of packed sequences.

Data

CROSS_ENTROPY_IGNORE_IDX

PACK_TYPE

logger

API

nemo_automodel.components.datasets.llm.packed_sequence._convert_to_tensors(
    pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE
) -> nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE

Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.

nemo_automodel.components.datasets.llm.packed_sequence._fill_labels_with_cross_entropy_ignore_idx(
    labels: list[int],
    loss_mask: list[int]
) -> list[int]

nemo_automodel.components.datasets.llm.packed_sequence._pad_pack(
    pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
    padding_idx: int,
    packed_sequence_size: int,
    cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX,
    cp_size: int = 1
) -> nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE

Pads a pack to packed_sequence_size.

seq_lens contains original lengths. seq_lens_padded applies CP padding (if cp_size > 1) and pack-level padding.

nemo_automodel.components.datasets.llm.packed_sequence._should_stop_packing(
    max_packs: int,
    packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE]
) -> bool

If max packs is set, stop packing when we reach that number.

nemo_automodel.components.datasets.llm.packed_sequence._split_and_add_pack(
    current_pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
    packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE],
    previous_sample_boundary: int,
    packed_sequence_size: int,
    padding_idx: int,
    cross_entropy_ignore_idx = CROSS_ENTROPY_IGNORE_IDX,
    cp_size: int = 1
) -> nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE

Splits the current pack at the boundary, processes it, adds it to packs.

…and returns the start of the next pack.

TODO(@akoumparouli): refactor.

nemo_automodel.components.datasets.llm.packed_sequence._tensorize_and_pad_pack(
    pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
    padding_idx: int,
    packed_sequence_size: int,
    cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX,
    cp_size: int = 1
) -> None

converts to tensors, pads a pack and returns it.

nemo_automodel.components.datasets.llm.packed_sequence.build_block_causal_additive_mask(
    seq_lens: torch.Tensor,
    seq_length: int,
    dtype: torch.dtype,
    device: torch.device
) -> torch.Tensor

Build a [B, 1, T, T] additive block-causal mask directly on device.

In-document causal attention is allowed (0); cross-document and padding positions are finfo(dtype).min. seq_lens is the [B, max_docs] 0-padded per-document length tensor; each row’s non-zero entries sum to seq_length (trailing pad folded into the final document).

nemo_automodel.components.datasets.llm.packed_sequence.create_block_causal_mask(
    seq_lens: list[torch.Tensor]
) -> torch.Tensor

Creates causal mask block for specified lengths.

In particular, given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample’s seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]

Parameters:

seq_lens

List[torch.Tensor]

Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.

Returns: torch.Tensor

Block causal mask of shape (batch_size, packed_sequence_size, packed_sequence_size).

nemo_automodel.components.datasets.llm.packed_sequence.pack_dataset(
    dataset,
    split,
    packed_sequence_size,
    max_packs = None,
    padding_idx = 0,
    drop_long_samples = True,
    cp_size = 1
)

Pack the dataset to defined length.

In particulat, it will iterate through the dataset. Use a buffer to hold samples until packed_sequence_size, then append the buffer to packs as a single “packed” sample. Continue until max_packs or end of dataset.

Parameters:

dataset

Actual dataset (can be ‘train’, ‘val’ or ‘test’)

split

str

Whether the dataset is ‘train’, ‘val’ or ‘test’

packed_sequence_size

int

Number of tokens in a pack

max_packs

intDefaults to None

Maximum number of packs. Default: None

drop_long_samples

boolDefaults to True

If True, drop samples that are longer than packed_sequence_size.

cp_size

intDefaults to 1

Context parallel size. When > 1, each sequence will be padded to be divisible by 2*cp_size for context parallel processing. Default: 1 (no CP).

nemo_automodel.components.datasets.llm.packed_sequence.packed_block_causal_mask(
    seq_lens: list[torch.Tensor]
)

Create a 2D block causal document mask for a batch of packed sequences.

Parameters:

seq_lens

List[torch.Tensor]

Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.

Returns:

BlockMask or Tensor if torch version < 2.5.0.

nemo_automodel.components.datasets.llm.packed_sequence.CROSS_ENTROPY_IGNORE_IDX = -100

nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE = dict[str, torch.Tensor | list[int]]

nemo_automodel.components.datasets.llm.packed_sequence.logger = logging.getLogger(__name__)