`nemo_automodel.components.datasets.llm.megatron.indexed_dataset`#

A self-contained port of Megatron-Core’s indexed dataset loader.

Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs, plus optional streaming readers for object storage (S3 and MSC).

All three calls below are equivalent for local data:

from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset

ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])

For object-storage data, pass an :class:ObjectStorageConfig:

cfg = ObjectStorageConfig(path_to_idx_cache="/tmp/idx_cache")
ds = IndexedDataset("s3://bucket/path/shard_00_text_document", object_storage_config=cfg)

Module Contents#

Classes#

`ObjectStorageConfig`	Configuration for reading `.bin`/`.idx` files from object storage.
`DType`	The NumPy data type Enum for reading the IndexedDataset indices
`_IndexWriter`	Object class to write the index (.idx) file
`_IndexReader`	Object class to read the index (.idx) file
`_BinReader`	Abstract class to read the data (.bin) file
`_MMapBinReader`	A _BinReader that memory maps the data (.bin) file
`_FileBinReader`	A _BinReader that reads from the data (.bin) file using a file pointer
`_S3BinReader`	Stream `.bin` data from S3 via chunked ranged `GetObject` calls.
`_MultiStorageClientBinReader`	Read `.bin` data via NVIDIA’s :mod:`multi_storage_client`.
`IndexedDataset`	A fast, on-disk dataset backed by Megatron-style index + binary files.
`IndexedDatasetBuilder`	Builder class for the IndexedDataset class

Functions#

`_is_object_storage_path`	Return `True` if `path` is an `s3://` or `msc://` URI.
`_parse_s3_path`	Split an `s3://bucket/key` URI into `(bucket, key)`.
`_get_index_cache_path`	Return the local cache path for `idx_path` under `path_to_idx_cache`.
`_cache_index_file`	Download `.idx` from object storage to `local_path`.
`get_idx_path`	Return the index-file path for a Megatron dataset prefix.
`get_bin_path`	Return the binary-data path for a Megatron dataset prefix.
`_normalize_prefix`

Data#

`logger`
`_S3_PREFIX`
`_MSC_PREFIX`
`_INDEX_HEADER`
`OBJECT_STORAGE_BIN_READERS`

API#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3_PREFIX#: ‘s3://’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MSC_PREFIX#: ‘msc://’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig#

Configuration for reading .bin/.idx files from object storage.

.. attribute:: path_to_idx_cache

Local directory where the .idx file is cached on first use. Re-used across ranks via a per-host directory layout.

.. attribute:: bin_chunk_nbytes

Size in bytes of each chunked range read against the .bin object. Defaults to 256 MiB. Larger values reduce request count but increase per-rank memory footprint.

path_to_idx_cache: str#: None

bin_chunk_nbytes: int#: None

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._is_object_storage_path(path: str) → bool#: Return True if path is an s3:// or msc:// URI.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._parse_s3_path(path: str) → Tuple[str, str]#: Split an s3://bucket/key URI into (bucket, key).

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._get_index_cache_path( idx_path: str, object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig, ) → str#: Return the local cache path for idx_path under path_to_idx_cache.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._cache_index_file(remote_path: str, local_path: str) → None#

Download .idx from object storage to local_path.

Rank 0 performs the download and other ranks wait on a torch.distributed barrier. If the local file already exists this is a no-op.

Raises:

ImportError – If the relevant client library (boto3 for s3:// or multi_storage_client for msc://) is not installed.
ValueError – If remote_path is neither an s3:// nor an msc:// URI.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#: b’MMIDIDX\x00\x00’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#

Bases: enum.Enum

The NumPy data type Enum for reading the IndexedDataset indices

Initialization

uint8#: 1

int8#: 2

int16#: 3

int32#: 4

int64#: 5

float64#: 6

float32#: 7

uint16#: 8

classmethod code_from_dtype(value: Type[numpy.number]) → int#

Get the code from the dtype

Parameters:: value (Type[numpy.number]) – The dtype
Returns:: The code
Return type:: int

classmethod dtype_from_code(value: int) → Type[numpy.number]#

Get the dtype from the code

Parameters:: value (int) – The code
Returns:: The dtype
Return type:: Type[numpy.number]

classmethod size(key: Union[int, Type[numpy.number]]) → int#

Get the size of the dtype/code in bytes

Parameters:: key (Union[int, Type[numpy.number]]) – The dtype or code
Raises:: ValueError – If the key is neither dtype nor integer code
Returns:: The size of the dtype/code in bytes
Return type:: int

classmethod optimal_dtype( cardinality: Optional[int], ) → Type[numpy.number]#

Get the dtype to use for an index of a certain cardinality

Parameters:: cardinality (Optional[int]) – The number of elements to be indexed
Returns:: The dtype to use for the index
Return type:: Type[numpy.number]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#

Bases: object

Object class to write the index (.idx) file

Parameters:

idx_path (str) – The path to the index file
dtype (Type[numpy.number]) – The dtype of the index file

Initialization

__enter__() → nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter#

Enter the context introduced by the ‘with’ keyword

Returns:: The instance
Return type:: _IndexWriter

__exit__( exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException], exc_tb: Optional[types.TracebackType], ) → Optional[bool]#

Exit the context introduced by the ‘with’ keyword

Parameters:

exc_type (Optional[Type[BaseException]]) – Exception type
exc_val (Optional[BaseException]) – Exception value
exc_tb (Optional[TracebackType]) – Exception traceback object

Returns:

Whether to silence the exception

Return type:

Optional[bool]

write( sequence_lengths: List[int], sequence_modes: Optional[List[int]], document_indices: List[int], ) → None#

Write the index (.idx) file

Parameters:

sequence_lengths (List[int]) – The length of each sequence
sequence_modes (Optional[List[int]]) – The mode of each sequences
document_indices (List[int]) – The seqyebce indices demarcating the end of each document

_sequence_pointers( sequence_lengths: List[int], ) → List[int]#

Build the sequence pointers per the sequence lengths and dtype size

Parameters:: sequence_lengths (List[int]) – The length of each sequence
Returns:: The pointer to the beginning of each sequence
Return type:: List[int]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#

Object class to read the index (.idx) file

Parameters:

idx_path (str) – The path to the index file
multimodal (bool) – Whether the dataset is multimodal

Initialization

__del__() → None#: Clean up the object

__len__() → int#

Get the number of sequences in the dataset

Returns:: The number of sequences in the dataset
Return type:: int

__getitem__( idx: int, ) → Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]#

Return the pointer, length, and mode at the index

Parameters:: idx (int) – The index into the dataset
Returns:: The pointer, length and mode at the index
Return type:: Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#

Bases: abc.ABC

Abstract class to read the data (.bin) file

abstractmethod read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that memory maps the data (.bin) file

Initialization

Initialize the _MMapBinReader

Parameters:: bin_path (str) – The path to the data (.bin) file.

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

__del__() → None#: Clean up the object

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that reads from the data (.bin) file using a file pointer

Initialization

Initialize the _FileBinReader

Parameters:: bin_path (str) – The path to the data (.bin) file.

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader( bin_path: str, object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig, )#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

Stream .bin data from S3 via chunked ranged GetObject calls.

A single in-memory chunk (sized by

Attr:: ObjectStorageConfig.bin_chunk_nbytes) is cached so consecutive reads within the same chunk avoid network round-trips. Random-access reads outside the current chunk trigger a new ranged GetObject.

Initialization

_extract_from_cache(offset: int, size: int) → bytes#

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#: Read count elements of dtype starting at byte offset.

__del__() → None#

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MultiStorageClientBinReader( bin_path: str, object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig, )#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

Read .bin data via NVIDIA’s :mod:multi_storage_client.

Initialization

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#: Read count elements of dtype starting at byte offset.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.OBJECT_STORAGE_BIN_READERS: Dict[str, Type[nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader]]#: None

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset( path_prefix: str, multimodal: bool = False, mmap: bool = True, object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None, )#

Bases: torch.utils.data.Dataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

Initialization

Initialize the IndexedDataset

Parameters:

path_prefix (str) – The index (.idx) and data (.bin) prefix. May be an S3 URI (s3://bucket/key) when object_storage_config is provided.
multimodal (bool) – Whether the dataset is multimodal. Defaults to False.
mmap (bool) – Whether to mmap the .bin files. Defaults to True. Must be False for object-storage paths.
object_storage_config (Optional[ObjectStorageConfig]) – When provided and path_prefix is an S3/MSC URI, the .idx file is downloaded to object_storage_config.path_to_idx_cache and the .bin file is streamed via chunked GETs.

initialize( path_prefix: str, multimodal: bool, mmap: bool, object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None, ) → None#

__len__() → int#

__getitem__( idx: Union[int, numpy.integer, slice], ) → Union[numpy.ndarray, Tuple[numpy.ndarray, Any], List[numpy.ndarray], Tuple[List[numpy.ndarray], numpy.ndarray]]#

get( idx: int, offset: int = 0, length: Optional[int] = None, ) → Union[numpy.ndarray, Tuple[numpy.ndarray, Any]]#

property sequence_lengths#

property document_indices#

static exists(path_prefix: str) → bool#

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder( bin_path: str, dtype: Type[numpy.number] = numpy.int32, multimodal: bool = False, )#

Bases: object

Builder class for the IndexedDataset class

Parameters:

bin_path (str) – The path to the data (.bin) file
dtype (Type[numpy.number], optional) – The dtype of the index file. Defaults to numpy.int32.
multimodal (bool, optional) – Whether the dataset is multimodal. Defaults to False.

Initialization

add_item(tensor: torch.Tensor, mode: int = 0) → None#

Add a single item to the dataset

Parameters:

tensor (torch.Tensor) – The item to add to the data file
mode (int, optional) – The mode for the item. Defaults to 0.

add_document( tensor: torch.Tensor, lengths: List[int], modes: Optional[List[int]] = None, ) → None#

Add an entire document to the dataset

Parameters:

tensor (torch.Tensor) – The document to add
lengths (List[int]) – The lengths of each item in the document
modes (Optional[List[int]], optional) – The modes for each item in the document. Defaults to None.

end_document() → None#: Finalize the document, for use with IndexedDatasetBuilder.add_item

add_index(path_prefix: str) → None#

Add an entire IndexedDataset to the dataset

Parameters:: path_prefix (str) – The index (.idx) and data (.bin) prefix

finalize(idx_path: str) → None#

Clean up and write the index (.idx) file

Parameters:: idx_path (str) – The path to the index file

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) → str#: Return the index-file path for a Megatron dataset prefix.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) → str#: Return the binary-data path for a Megatron dataset prefix.

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) → str#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.megatron.indexed_dataset`#