`nemo_automodel.components.datasets.llm.packed_sequence`#

Module Contents#

Functions#

`_fill_labels_with_cross_entropy_ignore_idx`
`_pad_pack`	Pads a pack to `packed_sequence_size`.
`_convert_to_tensors`	Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.
`_tensorize_and_pad_pack`	converts to tensors, pads a pack and returns it.
`_should_stop_packing`	If max packs is set, stop packing when we reach that number.
`_split_and_add_pack`	Splits the current pack at the boundary, processes it, adds it to `packs`.
`pack_dataset`	Pack the dataset to defined length.
`create_block_causal_mask`	Creates causal mask block for specified lengths.
`packed_block_causal_mask`	Create a 2D block causal document mask for a batch of packed sequences.

Data#

`logger`
`CROSS_ENTROPY_IGNORE_IDX`
`PACK_TYPE`

API#

nemo_automodel.components.datasets.llm.packed_sequence.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.packed_sequence.CROSS_ENTROPY_IGNORE_IDX#: None

nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE#: None

nemo_automodel.components.datasets.llm.packed_sequence._fill_labels_with_cross_entropy_ignore_idx( labels: list[int], loss_mask: list[int], ) → list[int]#

nemo_automodel.components.datasets.llm.packed_sequence._pad_pack( pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE, padding_idx: int, packed_sequence_size: int, cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX, cp_size: int = 1, ) → nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE#

Pads a pack to packed_sequence_size.

seq_lens contains original lengths. seq_lens_padded applies CP padding (if cp_size > 1) and pack-level padding.

nemo_automodel.components.datasets.llm.packed_sequence._convert_to_tensors( pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE, ) → nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE#: Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.

nemo_automodel.components.datasets.llm.packed_sequence._tensorize_and_pad_pack( pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE, padding_idx: int, packed_sequence_size: int, cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX, cp_size: int = 1, ) → None#: converts to tensors, pads a pack and returns it.

nemo_automodel.components.datasets.llm.packed_sequence._should_stop_packing( max_packs: int, packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE], ) → bool#: If max packs is set, stop packing when we reach that number.

nemo_automodel.components.datasets.llm.packed_sequence._split_and_add_pack( current_pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE, packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE], previous_sample_boundary: int, packed_sequence_size: int, padding_idx: int, cross_entropy_ignore_idx=CROSS_ENTROPY_IGNORE_IDX, cp_size: int = 1, ) → nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE#

Splits the current pack at the boundary, processes it, adds it to packs.

…and returns the start of the next pack.

TODO(@akoumparouli): refactor.

nemo_automodel.components.datasets.llm.packed_sequence.pack_dataset( dataset, split, packed_sequence_size, max_packs=None, padding_idx=0, drop_long_samples=False, cp_size=1, )#

Pack the dataset to defined length.

In particulat, it will iterate through the dataset. Use a buffer to hold samples until packed_sequence_size, then append the buffer to packs as a single “packed” sample. Continue until max_packs or end of dataset.

Parameters:

dataset – Actual dataset (can be ‘train’, ‘val’ or ‘test’)
split (str) – Whether the dataset is ‘train’, ‘val’ or ‘test’
packed_sequence_size (int) – Number of tokens in a pack
max_packs (int) – Maximum number of packs. Default: None
drop_long_samples (bool) – If True, drop samples that are longer than packed_sequence_size.
cp_size (int) – Context parallel size. When > 1, each sequence will be padded to be divisible by 2*cp_size for context parallel processing. Default: 1 (no CP).

nemo_automodel.components.datasets.llm.packed_sequence.create_block_causal_mask(seq_lens: list[torch.Tensor]) → torch.Tensor#

Creates causal mask block for specified lengths.

In particular, given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample’s seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]

Parameters:: seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
Returns:: Block causal mask of shape (batch_size, packed_sequence_size, packed_sequence_size).
Return type:: Tensor

nemo_automodel.components.datasets.llm.packed_sequence.packed_block_causal_mask(seq_lens: list[torch.Tensor])#

Create a 2D block causal document mask for a batch of packed sequences.

Parameters:: seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
Returns:: BlockMask or Tensor if torch version < 2.5.0.
Return type:: _MaskType

nemo_automodel.components.datasets.llm.packed_sequence#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.packed_sequence`#