> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.nanogpt_dataset

PyTorch IterableDataset for .bin shards written by NanoGPT preprocessing scripts.

Supports both legacy fineweb.py format and the newer nanogpt\_data\_processor.py format.

Legacy format (fineweb.py)::

int32\[256] header
header\[0] = 20240520        # magic number
header\[1] = 1               # version
header\[2] = num\_tokens      # number of uint16 tokens that follow
header\[3] = (unused)        # defaults to 0

uint16\[num\_tokens] tokens

New format (nanogpt\_data\_processor.py)::

int32\[256] header
header\[0] = 2788\_95051      # magic number
header\[1] = 1               # version
header\[2] = num\_tokens      # number of tokens that follow
header\[3] = dtype.itemsize  # bytes per token (2 for uint16, 4 for uint32)

uint16/uint32\[num\_tokens] tokens

Optionally, a corresponding .bos.idx file can exist alongside each .bin file::

int32\[n\_bos\_tokens] bos\_positions

# Array of absolute byte positions where BOS tokens occur in the .bin file

The dataset streams one contiguous *seq\_len* token slice at a time and
returns the pair `(inputs, labels)` where `labels` is shifted by one
position.  Optionally, slices can be forced to start at the BOS token
(`align_to_bos=True`). When BOS alignment is enabled, the dataset will use
.bos.idx files for efficient BOS token lookup when available, falling back
to linear search otherwise.

This file is copied (with minimal adjustments) from
`modded-nanogpt/data/bin_dataset.py` so that projects depending on
`nemo_automodel` can directly import `BinTokenDataset` without taking a
runtime dependency on the NanoGPT codebase.

## Module Contents

### Classes

| Name                                                                                       | Description                        |
| ------------------------------------------------------------------------------------------ | ---------------------------------- |
| [`NanogptDataset`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-NanogptDataset) | Dataset class for NanoGPT Dataset. |

### Functions

| Name                                                                                                                           | Description                                                                              |
| ------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| [`_find_next_bos_with_index`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_find_next_bos_with_index)               | Find the next BOS token position using the index.                                        |
| [`_get_dtype_from_val`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_get_dtype_from_val)                           | Returns the torch.dtype for the given value.                                             |
| [`_get_next_bos_position`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_get_next_bos_position)                     | Get the next BOS token position.                                                         |
| [`_get_start_end_pos_single_file`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_get_start_end_pos_single_file)     | Get the start and end positions for a single file, accounting for the number of workers. |
| [`_get_worker_id_and_total_workers`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_get_worker_id_and_total_workers) | Get the total number of workers.                                                         |
| [`_load_bos_index`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_load_bos_index)                                   | Load BOS token positions from a .bos.idx file if it exists.                              |
| [`_peek_num_tokens`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-_peek_num_tokens)                                 | Returns total number of tokens from the shard header, without traversing the data.       |
| [`load_bin_shard`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-load_bin_shard)                                     | Memory-map a *.bin* shard and return it as a 1-D `torch.uint16/uint32` tensor.           |

### Data

[`HEADER_BYTES`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-HEADER_BYTES)

[`HEADER_SIZE`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-HEADER_SIZE)

[`LEGACY_MAGIC`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-LEGACY_MAGIC)

[`MAGIC`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-MAGIC)

[`VERSION`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-VERSION)

[`__all__`](#nemo_automodel-components-datasets-llm-nanogpt_dataset-__all__)

### API

```python
class nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset(
    file_pattern: str | typing.Sequence[str],
    seq_len: int,
    bos_token: int | None = None,
    shuffle_files: bool = False,
    align_to_bos: bool = False
)
```

**Bases:** `IterableDataset`

Dataset class for NanoGPT Dataset.

A NanoGPT Dataset is a dataset that stores tokens in a binary file.
The header contains:

* 256x4-byte header (magic number, version, num\_tokens, dtype.itemsize)
* And the tokens themselves.

Optionally, a corresponding .bos.idx file can be present alongside each .bin file
containing precomputed BOS token positions for efficient alignment when
`align_to_bos=True`. If the index file is not present, the dataset falls back
to linear search for BOS tokens.

**Parameters:**

str | Sequence\[str]
Glob pattern (e.g. `"data/fineweb_*_train_*.bin"`) **or** an explicit
list of file paths.

int
Length of the training sample returned (not counting the next-token
target).  labels are simply `inputs[1:]`.

bool, default False
Shuffle the order of shards each epoch/iteration.

bool, default False
Ensure that every slice starts with `bos_token`.  When enabled, the
dataset searches forward from the current position until it finds the
next BOS token and starts there. Uses .bos.idx files when available
for efficient search, falls back to linear search otherwise.
Requires `bos_token` to be provided.

int, optional, default None.
Token ID marking beginning-of-document.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__getitem__(
    index: int
)
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__iter__() -> typing.Iterator[dict]
```

Iterate over training samples from the dataset.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._get_file_iterator(
    worker_files: typing.List[str],
    rng: random.Random,
    split_single_file: bool,
    file_start_pos: int,
    file_end_pos: int
) -> typing.Iterator[dict]
```

Generate training samples from all assigned files, handling infinite iteration.

**Parameters:**

List of files assigned to this worker

Random number generator for shuffling

Whether we're splitting a single file among workers

Starting position in file (for single file splitting)

Ending position in file (for single file splitting)

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._process_file_tokens(
    file: str,
    split_single_file: bool,
    file_start_pos: int,
    file_end_pos: int
) -> typing.Iterator[dict]
```

Process tokens from a single file and yield training samples.

**Parameters:**

Path to the .bin file to process

Whether we're splitting a single file among workers

Starting position in the file (for single file splitting)

Ending position in the file (for single file splitting)

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset._setup_worker_context(
    files,
    shuffle
) -> tuple[typing.List[str], random.Random, bool, int, int]
```

Set up worker-specific context including file assignment and splitting parameters.

**Returns:** `tuple[List[str], random.Random, bool, int, int]`

Tuple of (worker\_files, rng, split\_single\_file, file\_start\_pos, file\_end\_pos)

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._find_next_bos_with_index(
    bos_positions: numpy.ndarray,
    start_pos: int,
    max_pos: int
) -> int
```

Find the next BOS token position using the index.

**Parameters:**

Array of BOS token positions

Current position to search from

Maximum position to search up to

**Returns:** `int`

Position of next BOS token, or max\_pos if none found.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._get_dtype_from_val(
    n_bytes: int
) -> torch.dtype
```

Returns the torch.dtype for the given value.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._get_next_bos_position(
    tokens: torch.Tensor,
    bos_token: int,
    bos_positions: numpy.ndarray,
    pos: int,
    max_pos: int
) -> int
```

Get the next BOS token position.

**Parameters:**

Tensor of tokens

BOS token ID

Array of BOS token positions

Current position

Maximum position

**Returns:** `int`

Next BOS token position

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._get_start_end_pos_single_file(
    total_tokens: int,
    total_workers: int,
    global_worker_id: int
) -> tuple[int, int]
```

Get the start and end positions for a single file, accounting for the number of workers.

**Parameters:**

Total number of tokens in the file

Total number of workers

Global worker ID

**Returns:** `tuple[int, int]`

Tuple of (start position, end position)

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._get_worker_id_and_total_workers(
    worker: torch.utils.data.get_worker_info
) -> tuple[int, int]
```

Get the total number of workers.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._load_bos_index(
    path: str | os.PathLike
) -> numpy.ndarray | None
```

Load BOS token positions from a .bos.idx file if it exists.

**Parameters:**

Path to the .bin file (will look for corresponding .bos.idx file)

**Returns:** `np.ndarray | None`

Array of BOS token positions if index file exists, None otherwise.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset._peek_num_tokens(
    path: str | os.PathLike
) -> int
```

Returns total number of tokens from the shard header, without traversing the data.
Supports both legacy fineweb.py and new nanogpt\_data\_processor.py formats.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.load_bin_shard(
    path: str | os.PathLike
) -> torch.Tensor
```

Memory-map a *.bin* shard and return it as a 1-D `torch.uint16/uint32` tensor.

The returned tensor **shares** memory with the underlying file and is
therefore extremely cheap.  Do *not* modify it in-place.

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_BYTES = 256 * 4
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.HEADER_SIZE = 256
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.LEGACY_MAGIC = 20240520
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.MAGIC = 278895051
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.VERSION = 1
```

```python
nemo_automodel.components.datasets.llm.nanogpt_dataset.__all__ = ['NanogptDataset', 'load_bin_shard']
```