nemo_curator.stages.text.io.writer.megatron_tokenizer
nemo_curator.stages.text.io.writer.megatron_tokenizer
Module Contents
Classes
Data
API
Dataclass
Bases: BaseWriter
Writer that writes a DocumentBatch to Megatron ready tokenized files.
append_eod
cache_dir
fields
file_extension
hf_token
model_identifier
name
text_field
tokenization_batch_size
staticmethod
Build the sequence pointers per the sequence lengths and dtype size
Returns: list[int]: The pointer to the beginning of each sequence
Parameters:
sequence_lengths
The length of each sequence
token_size
The size of each token in bytes
Write tokens to the .bin file Args: tokens_batch (list[list[int]]): The batch of tokens to write
Write the .idx file data