nemo_automodel.datasets.llm.packed_sequence#

Module Contents#

Classes#

PackedSequence

Implements Packed Sequence for input dataset.

Functions#

create_block_causal_mask

Creates causal mask block for specified lengths.

packed_block_causal_mask

Create a 2D block causal document mask for a batch of packed sequences.

Data#

API#

nemo_automodel.datasets.llm.packed_sequence.logger#

‘getLogger(…)’

nemo_automodel.datasets.llm.packed_sequence.CROSS_ENTROPY_IGNORE_IDX#

None

nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE#

None

class nemo_automodel.datasets.llm.packed_sequence.PackedSequence(
dataset,
split,
packed_sequence_size,
split_across_pack=False,
max_packs=None,
)[source]#

Implements Packed Sequence for input dataset.

Parameters:
  • dataset – Actual dataset (can be ‘train’, ‘val’ or ‘test’)

  • split (str) – Whether the dataset is ‘train’, ‘val’ or ‘test’

  • packed_sequence_size (int) – Number of tokens in a pack

  • split_across_pack (bool) – If the last sample in a pack does not fit in packed_sequence_size, split the sample into the next pack, or move it entirely to the beginning of the next pack. Default: False

  • max_packs (int) – Maximum number of packs. Default: None

Initialization

Packed Sequence constructor.

Given the dataset and the rest of the arguments, it will create (using the .pack) method another dataset containing packed sequences.

Parameters:
  • dataset – Actual dataset (can be ‘train’, ‘val’ or ‘test’)

  • split (str) – Whether the dataset is ‘train’, ‘val’ or ‘test’

  • packed_sequence_size (int) – Number of tokens in a pack

  • split_across_pack (bool) – If the last sample in a pack does not fit in packed_sequence_size, split the sample into the next pack, or move it entirely to the beginning of the next pack. Default: False

  • max_packs (int) – Maximum number of packs. Default: None

pack()[source]#

Pack the dataset to defined length.

In particulat, it will iterate through the dataset. Use a buffer to hold samples until packed_sequence_size, then append the buffer to self.packs as a single “packed” sample. Continue until max_packs or end of dataset.

_should_stop_packing() bool[source]#

If max packs is set, stop packing when we reach that number.

_split_and_add_pack(
current_pack: nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE,
) nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE[source]#

Splits the current pack at the boundary, processes it, adds it to self.packs.

…and returns the start of the next pack.

TODO(@akoumparouli): refactor.

_add_pack(
pack: nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE,
) None[source]#

Processes, pads and adds a pack to self.packs.

_convert_to_tensors(
pack: nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE,
) nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE[source]#

Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.

_pad_pack(
pack: nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE,
padding_idx: int,
) nemo_automodel.datasets.llm.packed_sequence.PACK_TYPE[source]#

Pads a pack to self.packed_sequence_size.

nemo_automodel.datasets.llm.packed_sequence.create_block_causal_mask(seq_lens: list[torch.Tensor]) torch.Tensor[source]#

Creates causal mask block for specified lengths.

In particular, given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample’s seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]

Parameters:

seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.

Returns:

Block causal mask of shape (batch_size, packed_sequence_size, packed_sequence_size).

Return type:

Tensor

nemo_automodel.datasets.llm.packed_sequence.packed_block_causal_mask(seq_lens: list[torch.Tensor])[source]#

Create a 2D block causal document mask for a batch of packed sequences.

Parameters:

seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.

Returns:

BlockMask or Tensor if torch version < 2.5.0.

Return type:

_MaskType