> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.megatron.indexed_dataset

A self-contained port of Megatron-Core's indexed dataset loader.

Supports the original mmap and file-pointer readers for local \*.bin / \*.idx
pairs, plus optional streaming readers for object storage (S3 and MSC).

All three calls below are equivalent for local data:

from nemo\_automodel.datasets.llm.indexed\_dataset import IndexedDataset

ds = IndexedDataset("/path/to/shard\_00\_text\_document")
print(len(ds), ds\[0]\[:20])

ds = IndexedDataset("/path/to/shard\_00\_text\_document.bin")
print(len(ds), ds\[0]\[:20])

ds = IndexedDataset("/path/to/shard\_00\_text\_document.idx")
print(len(ds), ds\[0]\[:20])

For object-storage data, pass an :class:`ObjectStorageConfig`:

cfg = ObjectStorageConfig(path\_to\_idx\_cache="/tmp/idx\_cache")
ds = IndexedDataset("s3://bucket/path/shard\_00\_text\_document", object\_storage\_config=cfg)

## Module Contents

### Classes

| Name                                                                                                                            | Description                                                             |
| ------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [`DType`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-DType)                                               | The NumPy data type Enum for reading the IndexedDataset indices         |
| [`IndexedDataset`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-IndexedDataset)                             | A fast, on-disk dataset backed by Megatron-style index + binary files.  |
| [`IndexedDatasetBuilder`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-IndexedDatasetBuilder)               | Builder class for the IndexedDataset class                              |
| [`ObjectStorageConfig`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-ObjectStorageConfig)                   | Configuration for reading `.bin`/`.idx` files from object storage.      |
| [`_BinReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_BinReader)                                     | Abstract class to read the data (.bin) file                             |
| [`_FileBinReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_FileBinReader)                             | A \_BinReader that reads from the data (.bin) file using a file pointer |
| [`_IndexReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_IndexReader)                                 | Object class to read the index (.idx) file                              |
| [`_IndexWriter`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_IndexWriter)                                 | Object class to write the index (.idx) file                             |
| [`_MMapBinReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_MMapBinReader)                             | A \_BinReader that memory maps the data (.bin) file                     |
| [`_MultiStorageClientBinReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_MultiStorageClientBinReader) | Read `.bin` data via NVIDIA's :mod:`multi_storage_client`.              |
| [`_S3BinReader`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_S3BinReader)                                 | Stream `.bin` data from S3 via chunked ranged `GetObject` calls.        |

### Functions

| Name                                                                                                                  | Description                                                           |
| --------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`_cache_index_file`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_cache_index_file)             | Download `.idx` from object storage to `local_path`.                  |
| [`_get_index_cache_path`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_get_index_cache_path)     | Return the local cache path for `idx_path` under `path_to_idx_cache`. |
| [`_is_object_storage_path`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_is_object_storage_path) | Return `True` if `path` is an `s3://` or `msc://` URI.                |
| [`_normalize_prefix`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_normalize_prefix)             | -                                                                     |
| [`_parse_s3_path`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_parse_s3_path)                   | Split an `s3://bucket/key` URI into `(bucket, key)`.                  |
| [`get_bin_path`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-get_bin_path)                       | Return the binary-data path for a Megatron dataset prefix.            |
| [`get_idx_path`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-get_idx_path)                       | Return the index-file path for a Megatron dataset prefix.             |

### Data

[`OBJECT_STORAGE_BIN_READERS`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-OBJECT_STORAGE_BIN_READERS)

[`_INDEX_HEADER`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_INDEX_HEADER)

[`_MSC_PREFIX`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_MSC_PREFIX)

[`_S3_PREFIX`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_S3_PREFIX)

[`logger`](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-logger)

### API

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType
```

**Bases:** `enum.Enum`

The NumPy data type Enum for reading the IndexedDataset indices

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset(
    path_prefix: str,
    multimodal: bool = False,
    mmap: bool = True,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None
)
```

**Bases:** `Dataset`

A fast, on-disk dataset backed by Megatron-style index + binary files.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset.__getitem__(
    idx: typing.Union[int, numpy.integer, slice]
) -> typing.Union[numpy.ndarray, typing.Tuple[numpy.ndarray, typing.Any], typing.List[numpy.ndarray], typing.Tuple[typing.List[numpy.ndarray], numpy.ndarray]]
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset.exists(
    path_prefix: str
) -> bool
```

staticmethod

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset.get(
    idx: int,
    offset: int = 0,
    length: typing.Optional[int] = None
) -> typing.Union[numpy.ndarray, typing.Tuple[numpy.ndarray, typing.Any]]
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset.initialize(
    path_prefix: str,
    multimodal: bool,
    mmap: bool,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None
) -> None
```

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder(
    bin_path: str,
    dtype: typing.Type[numpy.number] = numpy.int32,
    multimodal: bool = False
)
```

Builder class for the IndexedDataset class

**Parameters:**

The path to the data (.bin) file

The dtype of the index file. Defaults to numpy.int32.

Whether the dataset is multimodal. Defaults to False.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder.add_document(
    tensor: torch.Tensor,
    lengths: typing.List[int],
    modes: typing.Optional[typing.List[int]] = None
) -> None
```

Add an entire document to the dataset

**Parameters:**

The document to add

The lengths of each item in the document

The modes for each item in the document.
Defaults to None.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder.add_index(
    path_prefix: str
) -> None
```

Add an entire IndexedDataset to the dataset

**Parameters:**

The index (.idx) and data (.bin) prefix

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder.add_item(
    tensor: torch.Tensor,
    mode: int = 0
) -> None
```

Add a single item to the dataset

**Parameters:**

The item to add to the data file

The mode for the item. Defaults to 0.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder.end_document() -> None
```

Finalize the document, for use with IndexedDatasetBuilder.add\_item

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder.finalize(
    idx_path: str
) -> None
```

Clean up and write the index (.idx) file

**Parameters:**

The path to the index file

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig(
    path_to_idx_cache: str,
    bin_chunk_nbytes: int = 256 * 1024 * 1024
)
```

Dataclass

Configuration for reading `.bin`/`.idx` files from object storage.

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader()
```

Abstract

Abstract class to read the data (.bin) file

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader.read(
    dtype: typing.Type[numpy.number],
    count: int,
    offset: int
) -> numpy.ndarray
```

abstract

Read bytes into a numpy array.

**Parameters:**

Data-type of the returned array.

Number of items to read.

Start reading from this offset (in bytes).

**Returns:** `numpy.ndarray`

numpy.ndarray: An array with `count` items and data-type `dtype` constructed from
reading bytes from the data file starting at `offset`.

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(
    bin_path: str
)
```

**Bases:** [\_BinReader](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_BinReader)

A \_BinReader that reads from the data (.bin) file using a file pointer

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader.read(
    dtype: typing.Type[numpy.number],
    count: int,
    offset: int
) -> numpy.ndarray
```

Read bytes into a numpy array.

**Parameters:**

Data-type of the returned array.

Number of items to read.

Start reading from this offset (in bytes).

**Returns:** `numpy.ndarray`

numpy.ndarray: An array with `count` items and data-type `dtype` constructed from
reading bytes from the data file starting at `offset`.

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(
    idx_path: str,
    multimodal: bool
)
```

Object class to read the index (.idx) file

**Parameters:**

The path to the index file

Whether the dataset is multimodal

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader.__del__() -> None
```

Clean up the object

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader.__getitem__(
    idx: int
) -> typing.Tuple[numpy.int32, numpy.int64, typing.Optional[numpy.int8]]
```

Return the pointer, length, and mode at the index

**Parameters:**

The index into the dataset

**Returns:** `Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]`

Tuple\[numpy.int32, numpy.int64, Optional\[numpy.int8]]: The pointer, length and mode
at the index

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader.__len__() -> int
```

Get the number of sequences in the dataset

**Returns:** `int`

The number of sequences in the dataset

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(
    idx_path: str,
    dtype: typing.Type[numpy.number]
)
```

Object class to write the index (.idx) file

**Parameters:**

The path to the index file

The dtype of the index file

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter.__enter__() -> '_IndexWriter'
```

Enter the context introduced by the 'with' keyword

**Returns:** `'_IndexWriter'`

The instance

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter.__exit__(
    exc_type: typing.Optional[typing.Type[BaseException]],
    exc_val: typing.Optional[BaseException],
    exc_tb: typing.Optional[types.TracebackType]
) -> typing.Optional[bool]
```

Exit the context introduced by the 'with' keyword

**Parameters:**

Exception type

Exception value

Exception traceback object

**Returns:** `Optional[bool]`

Optional\[bool]: Whether to silence the exception

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter._sequence_pointers(
    sequence_lengths: typing.List[int]
) -> typing.List[int]
```

Build the sequence pointers per the sequence lengths and dtype size

**Parameters:**

The length of each sequence

**Returns:** `List[int]`

List\[int]: The pointer to the beginning of each sequence

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter.write(
    sequence_lengths: typing.List[int],
    sequence_modes: typing.Optional[typing.List[int]],
    document_indices: typing.List[int]
) -> None
```

Write the index (.idx) file

**Parameters:**

The length of each sequence

The mode of each sequences

The seqyebce indices demarcating the end of each document

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(
    bin_path: str
)
```

**Bases:** [\_BinReader](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_BinReader)

A \_BinReader that memory maps the data (.bin) file

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader.__del__() -> None
```

Clean up the object

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader.read(
    dtype: typing.Type[numpy.number],
    count: int,
    offset: int
) -> numpy.ndarray
```

Read bytes into a numpy array.

**Parameters:**

Data-type of the returned array.

Number of items to read.

Start reading from this offset (in bytes).

**Returns:** `numpy.ndarray`

numpy.ndarray: An array with `count` items and data-type `dtype` constructed from
reading bytes from the data file starting at `offset`.

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MultiStorageClientBinReader(
    bin_path: str,
    object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig
)
```

**Bases:** [\_BinReader](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_BinReader)

Read `.bin` data via NVIDIA's :mod:`multi_storage_client`.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MultiStorageClientBinReader.read(
    dtype: typing.Type[numpy.number],
    count: int,
    offset: int
) -> numpy.ndarray
```

Read `count` elements of `dtype` starting at byte `offset`.

```python
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader(
    bin_path: str,
    object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig
)
```

**Bases:** [\_BinReader](#nemo_automodel-components-datasets-llm-megatron-indexed_dataset-_BinReader)

Stream `.bin` data from S3 via chunked ranged `GetObject` calls.

A single in-memory chunk (sized by
:attr:`ObjectStorageConfig.bin_chunk_nbytes`) is cached so consecutive
reads within the same chunk avoid network round-trips. Random-access
reads outside the current chunk trigger a new ranged `GetObject`.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader.__del__() -> None
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader._extract_from_cache(
    offset: int,
    size: int
) -> bytes
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader.read(
    dtype: typing.Type[numpy.number],
    count: int,
    offset: int
) -> numpy.ndarray
```

Read `count` elements of `dtype` starting at byte `offset`.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._cache_index_file(
    remote_path: str,
    local_path: str
) -> None
```

Download `.idx` from object storage to `local_path`.

Rank 0 performs the download and other ranks wait on a `torch.distributed`
barrier. If the local file already exists this is a no-op.

**Raises:**

* `ImportError`: If the relevant client library (`boto3` for `s3://` or
  `multi_storage_client` for `msc://`) is not installed.
* `ValueError`: If `remote_path` is neither an `s3://` nor an
  `msc://` URI.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._get_index_cache_path(
    idx_path: str,
    object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig
) -> str
```

Return the local cache path for `idx_path` under `path_to_idx_cache`.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._is_object_storage_path(
    path: str
) -> bool
```

Return `True` if `path` is an `s3://` or `msc://` URI.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(
    path_prefix: str
) -> str
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._parse_s3_path(
    path: str
) -> typing.Tuple[str, str]
```

Split an `s3://bucket/key` URI into `(bucket, key)`.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(
    path_prefix: str
) -> str
```

Return the binary-data path for a Megatron dataset prefix.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(
    path_prefix: str
) -> str
```

Return the index-file path for a Megatron dataset prefix.

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.OBJECT_STORAGE_BIN_READERS: Dict[str, Type[_BinReader]] = {'s3': _S3BinReader, 'msc': _MultiStorageClientBinReader}
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER = b'MMIDIDX\x00\x00'
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MSC_PREFIX = 'msc://'
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3_PREFIX = 's3://'
```

```python
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger = logging.getLogger(__name__)
```