nemo_automodel.components.datasets.llm.packed_sequence#
Module Contents#
Functions#
Pads a pack to |
|
Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors. |
|
converts to tensors, pads a pack and returns it. |
|
If max packs is set, stop packing when we reach that number. |
|
Splits the current pack at the boundary, processes it, adds it to |
|
Pack the dataset to defined length. |
|
Creates causal mask block for specified lengths. |
|
Create a 2D block causal document mask for a batch of packed sequences. |
Data#
API#
- nemo_automodel.components.datasets.llm.packed_sequence.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.llm.packed_sequence.CROSS_ENTROPY_IGNORE_IDX#
None
- nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE#
None
- nemo_automodel.components.datasets.llm.packed_sequence._fill_labels_with_cross_entropy_ignore_idx(
- labels: list[int],
- loss_mask: list[int],
- nemo_automodel.components.datasets.llm.packed_sequence._pad_pack(
- pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
- padding_idx: int,
- packed_sequence_size: int,
- cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX,
- cp_size: int = 1,
Pads a pack to
packed_sequence_size.seq_lens contains original lengths. seq_lens_padded applies CP padding (if cp_size > 1) and pack-level padding.
- nemo_automodel.components.datasets.llm.packed_sequence._convert_to_tensors(
- pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
Converts a pack into tensors. Pack comes in as a dict of lists and is converted to tensors.
- nemo_automodel.components.datasets.llm.packed_sequence._tensorize_and_pad_pack(
- pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
- padding_idx: int,
- packed_sequence_size: int,
- cross_entropy_ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX,
- cp_size: int = 1,
converts to tensors, pads a pack and returns it.
- nemo_automodel.components.datasets.llm.packed_sequence._should_stop_packing(
- max_packs: int,
- packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE],
If max packs is set, stop packing when we reach that number.
- nemo_automodel.components.datasets.llm.packed_sequence._calculate_leftover_seq_len(
- current_pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
- split_across_pack,
- previous_sample_boundary,
- packed_sequence_size,
- nemo_automodel.components.datasets.llm.packed_sequence._split_and_add_pack(
- current_pack: nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE,
- packs: list[nemo_automodel.components.datasets.llm.packed_sequence.PACK_TYPE],
- split_across_pack: bool,
- previous_sample_boundary: int,
- packed_sequence_size: int,
- padding_idx: int,
- cross_entropy_ignore_idx=CROSS_ENTROPY_IGNORE_IDX,
- cp_size: int = 1,
Splits the current pack at the boundary, processes it, adds it to
packs.…and returns the start of the next pack.
TODO(@akoumparouli): refactor.
- nemo_automodel.components.datasets.llm.packed_sequence.pack_dataset(
- dataset,
- split,
- packed_sequence_size,
- split_across_pack=False,
- max_packs=None,
- padding_idx=0,
- drop_long_samples=False,
- cp_size=1,
Pack the dataset to defined length.
In particulat, it will iterate through the dataset. Use a buffer to hold samples until packed_sequence_size, then append the buffer to packs as a single “packed” sample. Continue until max_packs or end of dataset.
- Parameters:
dataset – Actual dataset (can be ‘train’, ‘val’ or ‘test’)
split (str) – Whether the dataset is ‘train’, ‘val’ or ‘test’
packed_sequence_size (int) – Number of tokens in a pack
split_across_pack (bool) – If the last sample in a pack does not fit in
packed_sequence_size, split the sample into the next pack, or move it entirely to the beginning of the next pack. Default: Falsemax_packs (int) – Maximum number of packs. Default: None
drop_long_samples (bool) – If True, drop samples that are longer than packed_sequence_size.
cp_size (int) – Context parallel size. When > 1, each sequence will be padded to be divisible by 2*cp_size for context parallel processing. Default: 1 (no CP).
- nemo_automodel.components.datasets.llm.packed_sequence.create_block_causal_mask(seq_lens: list[torch.Tensor]) torch.Tensor#
Creates causal mask block for specified lengths.
In particular, given a batch tensor of seq lens defining the lengths of samples in each pack, Construct a 2D block causal mask for each pack in the batch. For example, if a single sample’s seq_lens is [3, 2, 1], the mask would be:: mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]
- Parameters:
seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
- Returns:
Block causal mask of shape (batch_size, packed_sequence_size, packed_sequence_size).
- Return type:
Tensor
- nemo_automodel.components.datasets.llm.packed_sequence.packed_block_causal_mask(seq_lens: list[torch.Tensor])#
Create a 2D block causal document mask for a batch of packed sequences.
- Parameters:
seq_lens (List[torch.Tensor]) – Sequence lengths of samples in each pack in the batch, shape (batch_size, n), where n is the max number of sequences in a pack and can vary across packs.
- Returns:
BlockMask or Tensor if torch version < 2.5.0.
- Return type:
_MaskType