nemo_automodel.components.datasets.llm.nanogpt_dataset

View as Markdown

PyTorch IterableDataset for .bin shards written by NanoGPT preprocessing scripts.

Supports both legacy fineweb.py format and the newer nanogpt_data_processor.py format.

Legacy format (fineweb.py)::

int32[256] header header[0] = 20240520 # magic number header[1] = 1 # version header[2] = num_tokens # number of uint16 tokens that follow header[3] = (unused) # defaults to 0

uint16[num_tokens] tokens

New format (nanogpt_data_processor.py)::

int32[256] header header[0] = 2788_95051 # magic number header[1] = 1 # version header[2] = num_tokens # number of tokens that follow header[3] = dtype.itemsize # bytes per token (2 for uint16, 4 for uint32)

uint16/uint32[num_tokens] tokens

Optionally, a corresponding .bos.idx file can exist alongside each .bin file::

int32[n_bos_tokens] bos_positions

Array of absolute byte positions where BOS tokens occur in the .bin file

The dataset streams one contiguous seq_len token slice at a time and returns the pair (inputs, labels) where labels is shifted by one position. Optionally, slices can be forced to start at the BOS token (align_to_bos=True). When BOS alignment is enabled, the dataset will use .bos.idx files for efficient BOS token lookup when available, falling back to linear search otherwise.

This file is copied (with minimal adjustments) from modded-nanogpt/data/bin_dataset.py so that projects depending on nemo_automodel can directly import BinTokenDataset without taking a runtime dependency on the NanoGPT codebase.

Module Contents

Classes

NameDescription
NanogptDatasetDataset class for NanoGPT Dataset.

Functions

NameDescription
_find_next_bos_with_indexFind the next BOS token position using the index.
_get_dtype_from_valReturns the torch.dtype for the given value.
_get_next_bos_positionGet the next BOS token position.
_get_start_end_pos_single_fileGet the start and end positions for a single file, accounting for the number of workers.
_get_worker_id_and_total_workersGet the total number of workers.
_load_bos_indexLoad BOS token positions from a .bos.idx file if it exists.
_peek_num_tokensReturns total number of tokens from the shard header, without traversing the data.
load_bin_shardMemory-map a .bin shard and return it as a 1-D torch.uint16/uint32 tensor.

Data

HEADER_BYTES

HEADER_SIZE

LEGACY_MAGIC

MAGIC

VERSION

__all__

API

class nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset(
file_pattern: str | typing.Sequence[str],
seq_len: int,
bos_token: int | None = None,
shuffle_files: bool = False,
align_to_bos: bool = False
)

Bases: IterableDataset

Dataset class for NanoGPT Dataset.

A NanoGPT Dataset is a dataset that stores tokens in a binary file. The header contains:

  • 256x4-byte header (magic number, version, num_tokens, dtype.itemsize)
  • And the tokens themselves.

Optionally, a corresponding .bos.idx file can be present alongside each .bin file containing precomputed BOS token positions for efficient alignment when align_to_bos=True. If the index file is not present, the dataset falls back to linear search for BOS tokens.

Parameters:

file_pattern

str | Sequence[str] Glob pattern (e.g. "data/fineweb_*_train_*.bin") or an explicit list of file paths.

seq_len

int Length of the training sample returned (not counting the next-token target). labels are simply inputs[1:].

shuffle_files

bool, default False Shuffle the order of shards each epoch/iteration.

align_to_bos

bool, default False Ensure that every slice starts with bos_token. When enabled, the dataset searches forward from the current position until it finds the next BOS token and starts there. Uses .bos.idx files when available for efficient search, falls back to linear search otherwise. Requires bos_token to be provided.

bos_token

int, optional, default None. Token ID marking beginning-of-document.

files
List[str] = sorted(glob.glob(str(file_pattern)))
seq_len
= int(seq_len)
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__getitem__(
index: int
)
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__iter__() -> typing.Iterator[dict]

Iterate over training samples from the dataset.

nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__len__() -> int
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._get_file_iterator(
worker_files: typing.List[str],
rng: random.Random,
split_single_file: bool,
file_start_pos: int,
file_end_pos: int
) -> typing.Iterator[dict]

Generate training samples from all assigned files, handling infinite iteration.

Parameters:

worker_files
List[str]

List of files assigned to this worker

rng
random.Random

Random number generator for shuffling

split_single_file
bool

Whether we’re splitting a single file among workers

file_start_pos
int

Starting position in file (for single file splitting)

file_end_pos
int

Ending position in file (for single file splitting)

nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._process_file_tokens(
file: str,
split_single_file: bool,
file_start_pos: int,
file_end_pos: int
) -> typing.Iterator[dict]

Process tokens from a single file and yield training samples.

Parameters:

file
str

Path to the .bin file to process

split_single_file
bool

Whether we’re splitting a single file among workers

file_start_pos
int

Starting position in the file (for single file splitting)

file_end_pos
int

Ending position in the file (for single file splitting)

nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._setup_worker_context(
files,
shuffle
) -> tuple[typing.List[str], random.Random, bool, int, int]

Set up worker-specific context including file assignment and splitting parameters.

Returns: tuple[List[str], random.Random, bool, int, int]

Tuple of (worker_files, rng, split_single_file, file_start_pos, file_end_pos)

nemo_automodel.components.datasets.llm.nanogpt_dataset._find_next_bos_with_index(
bos_positions: numpy.ndarray,
start_pos: int,
max_pos: int
) -> int

Find the next BOS token position using the index.

Parameters:

bos_positions
np.ndarray

Array of BOS token positions

start_pos
int

Current position to search from

max_pos
int

Maximum position to search up to

Returns: int

Position of next BOS token, or max_pos if none found.

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_dtype_from_val(
n_bytes: int
) -> torch.dtype

Returns the torch.dtype for the given value.

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_next_bos_position(
tokens: torch.Tensor,
bos_token: int,
bos_positions: numpy.ndarray,
pos: int,
max_pos: int
) -> int

Get the next BOS token position.

Parameters:

tokens
torch.Tensor

Tensor of tokens

bos_token
int

BOS token ID

bos_positions
np.ndarray

Array of BOS token positions

pos
int

Current position

max_pos
int

Maximum position

Returns: int

Next BOS token position

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_start_end_pos_single_file(
total_tokens: int,
total_workers: int,
global_worker_id: int
) -> tuple[int, int]

Get the start and end positions for a single file, accounting for the number of workers.

Parameters:

total_tokens
int

Total number of tokens in the file

total_workers
int

Total number of workers

global_worker_id
int

Global worker ID

Returns: tuple[int, int]

Tuple of (start position, end position)

nemo_automodel.components.datasets.llm.nanogpt_dataset._get_worker_id_and_total_workers(
worker: torch.utils.data.get_worker_info
) -> tuple[int, int]

Get the total number of workers.

nemo_automodel.components.datasets.llm.nanogpt_dataset._load_bos_index(
path: str | os.PathLike
) -> numpy.ndarray | None

Load BOS token positions from a .bos.idx file if it exists.

Parameters:

path
str | os.PathLike

Path to the .bin file (will look for corresponding .bos.idx file)

Returns: np.ndarray | None

Array of BOS token positions if index file exists, None otherwise.

nemo_automodel.components.datasets.llm.nanogpt_dataset._peek_num_tokens(
path: str | os.PathLike
) -> int

Returns total number of tokens from the shard header, without traversing the data. Supports both legacy fineweb.py and new nanogpt_data_processor.py formats.

nemo_automodel.components.datasets.llm.nanogpt_dataset.load_bin_shard(
path: str | os.PathLike
) -> torch.Tensor

Memory-map a .bin shard and return it as a 1-D torch.uint16/uint32 tensor.

The returned tensor shares memory with the underlying file and is therefore extremely cheap. Do not modify it in-place.

nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_BYTES = 256 * 4
nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_SIZE = 256
nemo_automodel.components.datasets.llm.nanogpt_dataset.LEGACY_MAGIC = 20240520
nemo_automodel.components.datasets.llm.nanogpt_dataset.MAGIC = 278895051
nemo_automodel.components.datasets.llm.nanogpt_dataset.VERSION = 1
nemo_automodel.components.datasets.llm.nanogpt_dataset.__all__ = ['NanogptDataset', 'load_bin_shard']