nemo_automodel.components.datasets.llm.megatron.helpers#

Module Contents#

Functions#

build_sample_idx

Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp

API#

nemo_automodel.components.datasets.llm.megatron.helpers.build_sample_idx(
sizes: numpy.ndarray,
document_indices: numpy.ndarray,
sequence_length: int,
num_epochs: int,
tokens_per_epoch: int,
drop_last_partial_sequence: bool = True,
add_extra_token_to_sequence: bool = True,
)#

Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp

Parameters:
  • sizes (numpy.ndarray) – The 1-D array of document lengths

  • document_indices (numpy.ndarray) – The 1-D array of document indices

  • sequence_length (int) – The sequence length

  • num_epochs (int) – The number of epochs

  • tokens_per_epoch (int) – The number of tokens per epoch

  • drop_last_partial_sequence (bool) – Whether to omit the last partial sequence in the sample index should it exist. Defaults to True.

  • add_extra_token_to_sequence (bool) – Whether to build samples with sequence length sequence_length + 1. Defaults to True.

Returns:

The 2-D sample index

Return type:

numpy.ndarray