nemo_automodel.components.datasets.llm.megatron.helpers
#
Module Contents#
Functions#
Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp |
API#
- nemo_automodel.components.datasets.llm.megatron.helpers.build_sample_idx(
- sizes: numpy.ndarray,
- document_indices: numpy.ndarray,
- sequence_length: int,
- num_epochs: int,
- tokens_per_epoch: int,
- drop_last_partial_sequence: bool = True,
- add_extra_token_to_sequence: bool = True,
Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp
- Parameters:
sizes (numpy.ndarray) – The 1-D array of document lengths
document_indices (numpy.ndarray) – The 1-D array of document indices
sequence_length (int) – The sequence length
num_epochs (int) – The number of epochs
tokens_per_epoch (int) – The number of tokens per epoch
drop_last_partial_sequence (bool) – Whether to omit the last partial sequence in the sample index should it exist. Defaults to True.
add_extra_token_to_sequence (bool) – Whether to build samples with sequence length
sequence_length + 1
. Defaults to True.
- Returns:
The 2-D sample index
- Return type:
numpy.ndarray