nemo_automodel.components.datasets.llm.nanogpt_dataset
nemo_automodel.components.datasets.llm.nanogpt_dataset
PyTorch IterableDataset for .bin shards written by NanoGPT preprocessing scripts.
Supports both legacy fineweb.py format and the newer nanogpt_data_processor.py format.
Legacy format (fineweb.py)::
int32[256] header header[0] = 20240520 # magic number header[1] = 1 # version header[2] = num_tokens # number of uint16 tokens that follow header[3] = (unused) # defaults to 0
uint16[num_tokens] tokens
New format (nanogpt_data_processor.py)::
int32[256] header header[0] = 2788_95051 # magic number header[1] = 1 # version header[2] = num_tokens # number of tokens that follow header[3] = dtype.itemsize # bytes per token (2 for uint16, 4 for uint32)
uint16/uint32[num_tokens] tokens
Optionally, a corresponding .bos.idx file can exist alongside each .bin file::
int32[n_bos_tokens] bos_positions
Array of absolute byte positions where BOS tokens occur in the .bin file
The dataset streams one contiguous seq_len token slice at a time and
returns the pair (inputs, labels) where labels is shifted by one
position. Optionally, slices can be forced to start at the BOS token
(align_to_bos=True). When BOS alignment is enabled, the dataset will use
.bos.idx files for efficient BOS token lookup when available, falling back
to linear search otherwise.
This file is copied (with minimal adjustments) from
modded-nanogpt/data/bin_dataset.py so that projects depending on
nemo_automodel can directly import BinTokenDataset without taking a
runtime dependency on the NanoGPT codebase.
Module Contents
Classes
Functions
Data
API
Bases: IterableDataset
Dataset class for NanoGPT Dataset.
A NanoGPT Dataset is a dataset that stores tokens in a binary file. The header contains:
- 256x4-byte header (magic number, version, num_tokens, dtype.itemsize)
- And the tokens themselves.
Optionally, a corresponding .bos.idx file can be present alongside each .bin file
containing precomputed BOS token positions for efficient alignment when
align_to_bos=True. If the index file is not present, the dataset falls back
to linear search for BOS tokens.
Parameters:
str | Sequence[str]
Glob pattern (e.g. "data/fineweb_*_train_*.bin") or an explicit
list of file paths.
int
Length of the training sample returned (not counting the next-token
target). labels are simply inputs[1:].
bool, default False Shuffle the order of shards each epoch/iteration.
bool, default False
Ensure that every slice starts with bos_token. When enabled, the
dataset searches forward from the current position until it finds the
next BOS token and starts there. Uses .bos.idx files when available
for efficient search, falls back to linear search otherwise.
Requires bos_token to be provided.
int, optional, default None. Token ID marking beginning-of-document.
Iterate over training samples from the dataset.
Generate training samples from all assigned files, handling infinite iteration.
Parameters:
List of files assigned to this worker
Random number generator for shuffling
Whether we’re splitting a single file among workers
Starting position in file (for single file splitting)
Ending position in file (for single file splitting)
Process tokens from a single file and yield training samples.
Parameters:
Path to the .bin file to process
Whether we’re splitting a single file among workers
Starting position in the file (for single file splitting)
Ending position in the file (for single file splitting)
Set up worker-specific context including file assignment and splitting parameters.
Returns: tuple[List[str], random.Random, bool, int, int]
Tuple of (worker_files, rng, split_single_file, file_start_pos, file_end_pos)
Find the next BOS token position using the index.
Parameters:
Array of BOS token positions
Current position to search from
Maximum position to search up to
Returns: int
Position of next BOS token, or max_pos if none found.
Returns the torch.dtype for the given value.
Get the next BOS token position.
Parameters:
Tensor of tokens
BOS token ID
Array of BOS token positions
Current position
Maximum position
Returns: int
Next BOS token position
Get the start and end positions for a single file, accounting for the number of workers.
Parameters:
Total number of tokens in the file
Total number of workers
Global worker ID
Returns: tuple[int, int]
Tuple of (start position, end position)
Get the total number of workers.
Load BOS token positions from a .bos.idx file if it exists.
Parameters:
Path to the .bin file (will look for corresponding .bos.idx file)
Returns: np.ndarray | None
Array of BOS token positions if index file exists, None otherwise.
Returns total number of tokens from the shard header, without traversing the data. Supports both legacy fineweb.py and new nanogpt_data_processor.py formats.
Memory-map a .bin shard and return it as a 1-D torch.uint16/uint32 tensor.
The returned tensor shares memory with the underlying file and is therefore extremely cheap. Do not modify it in-place.