nemo_automodel.components.datasets.llm.megatron.indexed_dataset#

A self-contained port of Megatron-Core’s indexed dataset loader.

Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs, plus optional streaming readers for object storage (S3 and MSC).

All three calls below are equivalent for local data:

from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset

ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])

For object-storage data, pass an :class:ObjectStorageConfig:

cfg = ObjectStorageConfig(path_to_idx_cache="/tmp/idx_cache")
ds = IndexedDataset("s3://bucket/path/shard_00_text_document", object_storage_config=cfg)

Module Contents#

Classes#

ObjectStorageConfig

Configuration for reading .bin/.idx files from object storage.

DType

The NumPy data type Enum for reading the IndexedDataset indices

_IndexWriter

Object class to write the index (.idx) file

_IndexReader

Object class to read the index (.idx) file

_BinReader

Abstract class to read the data (.bin) file

_MMapBinReader

A _BinReader that memory maps the data (.bin) file

_FileBinReader

A _BinReader that reads from the data (.bin) file using a file pointer

_S3BinReader

Stream .bin data from S3 via chunked ranged GetObject calls.

_MultiStorageClientBinReader

Read .bin data via NVIDIA’s :mod:multi_storage_client.

IndexedDataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

IndexedDatasetBuilder

Builder class for the IndexedDataset class

Functions#

_is_object_storage_path

Return True if path is an s3:// or msc:// URI.

_parse_s3_path

Split an s3://bucket/key URI into (bucket, key).

_get_index_cache_path

Return the local cache path for idx_path under path_to_idx_cache.

_cache_index_file

Download .idx from object storage to local_path.

get_idx_path

Return the index-file path for a Megatron dataset prefix.

get_bin_path

Return the binary-data path for a Megatron dataset prefix.

_normalize_prefix

Data#

API#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#

‘getLogger(…)’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3_PREFIX#

‘s3://’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MSC_PREFIX#

‘msc://’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig#

Configuration for reading .bin/.idx files from object storage.

.. attribute:: path_to_idx_cache

Local directory where the .idx file is cached on first use. Re-used across ranks via a per-host directory layout.

.. attribute:: bin_chunk_nbytes

Size in bytes of each chunked range read against the .bin object. Defaults to 256 MiB. Larger values reduce request count but increase per-rank memory footprint.

path_to_idx_cache: str#

None

bin_chunk_nbytes: int#

None

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._is_object_storage_path(path: str) bool#

Return True if path is an s3:// or msc:// URI.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._parse_s3_path(path: str) Tuple[str, str]#

Split an s3://bucket/key URI into (bucket, key).

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._get_index_cache_path(
idx_path: str,
object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
) str#

Return the local cache path for idx_path under path_to_idx_cache.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._cache_index_file(remote_path: str, local_path: str) None#

Download .idx from object storage to local_path.

Rank 0 performs the download and other ranks wait on a torch.distributed barrier. If the local file already exists this is a no-op.

Raises:
  • ImportError – If the relevant client library (boto3 for s3:// or multi_storage_client for msc://) is not installed.

  • ValueError – If remote_path is neither an s3:// nor an msc:// URI.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#

b’MMIDIDX\x00\x00’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#

Bases: enum.Enum

The NumPy data type Enum for reading the IndexedDataset indices

Initialization

uint8#

1

int8#

2

int16#

3

int32#

4

int64#

5

float64#

6

float32#

7

uint16#

8

classmethod code_from_dtype(value: Type[numpy.number]) int#

Get the code from the dtype

Parameters:

value (Type[numpy.number]) – The dtype

Returns:

The code

Return type:

int

classmethod dtype_from_code(value: int) Type[numpy.number]#

Get the dtype from the code

Parameters:

value (int) – The code

Returns:

The dtype

Return type:

Type[numpy.number]

classmethod size(key: Union[int, Type[numpy.number]]) int#

Get the size of the dtype/code in bytes

Parameters:

key (Union[int, Type[numpy.number]]) – The dtype or code

Raises:

ValueError – If the key is neither dtype nor integer code

Returns:

The size of the dtype/code in bytes

Return type:

int

classmethod optimal_dtype(
cardinality: Optional[int],
) Type[numpy.number]#

Get the dtype to use for an index of a certain cardinality

Parameters:

cardinality (Optional[int]) – The number of elements to be indexed

Returns:

The dtype to use for the index

Return type:

Type[numpy.number]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#

Bases: object

Object class to write the index (.idx) file

Parameters:
  • idx_path (str) – The path to the index file

  • dtype (Type[numpy.number]) – The dtype of the index file

Initialization

__enter__() nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter#

Enter the context introduced by the ‘with’ keyword

Returns:

The instance

Return type:

_IndexWriter

__exit__(
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[types.TracebackType],
) Optional[bool]#

Exit the context introduced by the ‘with’ keyword

Parameters:
  • exc_type (Optional[Type[BaseException]]) – Exception type

  • exc_val (Optional[BaseException]) – Exception value

  • exc_tb (Optional[TracebackType]) – Exception traceback object

Returns:

Whether to silence the exception

Return type:

Optional[bool]

write(
sequence_lengths: List[int],
sequence_modes: Optional[List[int]],
document_indices: List[int],
) None#

Write the index (.idx) file

Parameters:
  • sequence_lengths (List[int]) – The length of each sequence

  • sequence_modes (Optional[List[int]]) – The mode of each sequences

  • document_indices (List[int]) – The seqyebce indices demarcating the end of each document

_sequence_pointers(
sequence_lengths: List[int],
) List[int]#

Build the sequence pointers per the sequence lengths and dtype size

Parameters:

sequence_lengths (List[int]) – The length of each sequence

Returns:

The pointer to the beginning of each sequence

Return type:

List[int]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#

Object class to read the index (.idx) file

Parameters:
  • idx_path (str) – The path to the index file

  • multimodal (bool) – Whether the dataset is multimodal

Initialization

__del__() None#

Clean up the object

__len__() int#

Get the number of sequences in the dataset

Returns:

The number of sequences in the dataset

Return type:

int

__getitem__(
idx: int,
) Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]#

Return the pointer, length, and mode at the index

Parameters:

idx (int) – The index into the dataset

Returns:

The pointer, length and mode at the index

Return type:

Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#

Bases: abc.ABC

Abstract class to read the data (.bin) file

abstractmethod read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that memory maps the data (.bin) file

Initialization

Initialize the _MMapBinReader

Parameters:

bin_path (str) – The path to the data (.bin) file.

read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

__del__() None#

Clean up the object

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that reads from the data (.bin) file using a file pointer

Initialization

Initialize the _FileBinReader

Parameters:

bin_path (str) – The path to the data (.bin) file.

read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader(
bin_path: str,
object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

Stream .bin data from S3 via chunked ranged GetObject calls.

A single in-memory chunk (sized by

Attr:

ObjectStorageConfig.bin_chunk_nbytes) is cached so consecutive reads within the same chunk avoid network round-trips. Random-access reads outside the current chunk trigger a new ranged GetObject.

Initialization

_extract_from_cache(offset: int, size: int) bytes#
read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read count elements of dtype starting at byte offset.

__del__() None#
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MultiStorageClientBinReader(
bin_path: str,
object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

Read .bin data via NVIDIA’s :mod:multi_storage_client.

Initialization

read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read count elements of dtype starting at byte offset.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.OBJECT_STORAGE_BIN_READERS: Dict[str, Type[nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader]]#

None

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset(
path_prefix: str,
multimodal: bool = False,
mmap: bool = True,
object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
)#

Bases: torch.utils.data.Dataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

Initialization

Initialize the IndexedDataset

Parameters:
  • path_prefix (str) – The index (.idx) and data (.bin) prefix. May be an S3 URI (s3://bucket/key) when object_storage_config is provided.

  • multimodal (bool) – Whether the dataset is multimodal. Defaults to False.

  • mmap (bool) – Whether to mmap the .bin files. Defaults to True. Must be False for object-storage paths.

  • object_storage_config (Optional[ObjectStorageConfig]) – When provided and path_prefix is an S3/MSC URI, the .idx file is downloaded to object_storage_config.path_to_idx_cache and the .bin file is streamed via chunked GETs.

initialize(
path_prefix: str,
multimodal: bool,
mmap: bool,
object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
) None#
__len__() int#
__getitem__(
idx: Union[int, numpy.integer, slice],
) Union[numpy.ndarray, Tuple[numpy.ndarray, Any], List[numpy.ndarray], Tuple[List[numpy.ndarray], numpy.ndarray]]#
get(
idx: int,
offset: int = 0,
length: Optional[int] = None,
) Union[numpy.ndarray, Tuple[numpy.ndarray, Any]]#
property sequence_lengths#
property document_indices#
static exists(path_prefix: str) bool#
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder(
bin_path: str,
dtype: Type[numpy.number] = numpy.int32,
multimodal: bool = False,
)#

Bases: object

Builder class for the IndexedDataset class

Parameters:
  • bin_path (str) – The path to the data (.bin) file

  • dtype (Type[numpy.number], optional) – The dtype of the index file. Defaults to numpy.int32.

  • multimodal (bool, optional) – Whether the dataset is multimodal. Defaults to False.

Initialization

add_item(tensor: torch.Tensor, mode: int = 0) None#

Add a single item to the dataset

Parameters:
  • tensor (torch.Tensor) – The item to add to the data file

  • mode (int, optional) – The mode for the item. Defaults to 0.

add_document(
tensor: torch.Tensor,
lengths: List[int],
modes: Optional[List[int]] = None,
) None#

Add an entire document to the dataset

Parameters:
  • tensor (torch.Tensor) – The document to add

  • lengths (List[int]) – The lengths of each item in the document

  • modes (Optional[List[int]], optional) – The modes for each item in the document. Defaults to None.

end_document() None#

Finalize the document, for use with IndexedDatasetBuilder.add_item

add_index(path_prefix: str) None#

Add an entire IndexedDataset to the dataset

Parameters:

path_prefix (str) – The index (.idx) and data (.bin) prefix

finalize(idx_path: str) None#

Clean up and write the index (.idx) file

Parameters:

idx_path (str) – The path to the index file

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) str#

Return the index-file path for a Megatron dataset prefix.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) str#

Return the binary-data path for a Megatron dataset prefix.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) str#