nemo_automodel.components.datasets.llm.megatron.helpers

View as Markdown

Module Contents

Functions

NameDescription
build_sample_idxBuild the 2-D sample index using the properly typed templated C++ function from helpers.cpp

API

nemo_automodel.components.datasets.llm.megatron.helpers.build_sample_idx(
sizes: numpy.ndarray,
document_indices: numpy.ndarray,
sequence_length: int,
num_epochs: int,
tokens_per_epoch: int,
drop_last_partial_sequence: bool = True,
add_extra_token_to_sequence: bool = True
)

Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp

Parameters:

sizes
numpy.ndarray

The 1-D array of document lengths

document_indices
numpy.ndarray

The 1-D array of document indices

sequence_length
int

The sequence length

num_epochs
int

The number of epochs

tokens_per_epoch
int

The number of tokens per epoch

drop_last_partial_sequence
boolDefaults to True

Whether to omit the last partial sequence in the sample index should it exist. Defaults to True.

add_extra_token_to_sequence
boolDefaults to True

Whether to build samples with sequence length sequence_length + 1. Defaults to True.

Returns:

numpy.ndarray: The 2-D sample index