nemo_automodel.components.datasets.llm.megatron.helpers
nemo_automodel.components.datasets.llm.megatron.helpers
Module Contents
Functions
API
Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp
Parameters:
sizes
The 1-D array of document lengths
document_indices
The 1-D array of document indices
sequence_length
The sequence length
num_epochs
The number of epochs
tokens_per_epoch
The number of tokens per epoch
drop_last_partial_sequence
Whether to omit the last partial sequence in the sample index should it exist. Defaults to True.
add_extra_token_to_sequence
Whether to build samples with sequence length
sequence_length + 1. Defaults to True.
Returns:
numpy.ndarray: The 2-D sample index