nemo_automodel.components.datasets.llm.megatron.indexed_dataset
nemo_automodel.components.datasets.llm.megatron.indexed_dataset
A self-contained port of Megatron-Core’s indexed dataset loader.
Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs, plus optional streaming readers for object storage (S3 and MSC).
All three calls below are equivalent for local data:
from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset
ds = IndexedDataset(“/path/to/shard_00_text_document”) print(len(ds), ds[0][:20])
ds = IndexedDataset(“/path/to/shard_00_text_document.bin”) print(len(ds), ds[0][:20])
ds = IndexedDataset(“/path/to/shard_00_text_document.idx”) print(len(ds), ds[0][:20])
For object-storage data, pass an :class:ObjectStorageConfig:
cfg = ObjectStorageConfig(path_to_idx_cache=“/tmp/idx_cache”) ds = IndexedDataset(“s3://bucket/path/shard_00_text_document”, object_storage_config=cfg)
Module Contents
Classes
Functions
Data
API
Bases: enum.Enum
The NumPy data type Enum for reading the IndexedDataset indices
Bases: Dataset
A fast, on-disk dataset backed by Megatron-style index + binary files.
Builder class for the IndexedDataset class
Parameters:
The path to the data (.bin) file
The dtype of the index file. Defaults to numpy.int32.
Whether the dataset is multimodal. Defaults to False.
Add an entire document to the dataset
Parameters:
The document to add
The lengths of each item in the document
The modes for each item in the document. Defaults to None.
Add an entire IndexedDataset to the dataset
Parameters:
The index (.idx) and data (.bin) prefix
Add a single item to the dataset
Parameters:
The item to add to the data file
The mode for the item. Defaults to 0.
Finalize the document, for use with IndexedDatasetBuilder.add_item
Clean up and write the index (.idx) file
Parameters:
The path to the index file
Configuration for reading .bin/.idx files from object storage.
Abstract class to read the data (.bin) file
Read bytes into a numpy array.
Parameters:
Data-type of the returned array.
Number of items to read.
Start reading from this offset (in bytes).
Returns: numpy.ndarray
numpy.ndarray: An array with count items and data-type dtype constructed from
reading bytes from the data file starting at offset.
Bases: _BinReader
A _BinReader that reads from the data (.bin) file using a file pointer
Read bytes into a numpy array.
Parameters:
Data-type of the returned array.
Number of items to read.
Start reading from this offset (in bytes).
Returns: numpy.ndarray
numpy.ndarray: An array with count items and data-type dtype constructed from
reading bytes from the data file starting at offset.
Object class to read the index (.idx) file
Parameters:
The path to the index file
Whether the dataset is multimodal
Clean up the object
Return the pointer, length, and mode at the index
Parameters:
The index into the dataset
Returns: Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]
Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]: The pointer, length and mode at the index
Get the number of sequences in the dataset
Returns: int
The number of sequences in the dataset
Object class to write the index (.idx) file
Parameters:
The path to the index file
The dtype of the index file
Enter the context introduced by the ‘with’ keyword
Returns: '_IndexWriter'
The instance
Exit the context introduced by the ‘with’ keyword
Parameters:
Exception type
Exception value
Exception traceback object
Returns: Optional[bool]
Optional[bool]: Whether to silence the exception
Build the sequence pointers per the sequence lengths and dtype size
Parameters:
The length of each sequence
Returns: List[int]
List[int]: The pointer to the beginning of each sequence
Write the index (.idx) file
Parameters:
The length of each sequence
The mode of each sequences
The seqyebce indices demarcating the end of each document
Bases: _BinReader
A _BinReader that memory maps the data (.bin) file
Clean up the object
Read bytes into a numpy array.
Parameters:
Data-type of the returned array.
Number of items to read.
Start reading from this offset (in bytes).
Returns: numpy.ndarray
numpy.ndarray: An array with count items and data-type dtype constructed from
reading bytes from the data file starting at offset.
Bases: _BinReader
Read .bin data via NVIDIA’s :mod:multi_storage_client.
Read count elements of dtype starting at byte offset.
Bases: _BinReader
Stream .bin data from S3 via chunked ranged GetObject calls.
A single in-memory chunk (sized by
:attr:ObjectStorageConfig.bin_chunk_nbytes) is cached so consecutive
reads within the same chunk avoid network round-trips. Random-access
reads outside the current chunk trigger a new ranged GetObject.
Read count elements of dtype starting at byte offset.
Download .idx from object storage to local_path.
Rank 0 performs the download and other ranks wait on a torch.distributed
barrier. If the local file already exists this is a no-op.
Raises:
ImportError: If the relevant client library (boto3fors3://ormulti_storage_clientformsc://) is not installed.ValueError: Ifremote_pathis neither ans3://nor anmsc://URI.
Return the local cache path for idx_path under path_to_idx_cache.
Return True if path is an s3:// or msc:// URI.
Split an s3://bucket/key URI into (bucket, key).
Return the binary-data path for a Megatron dataset prefix.
Return the index-file path for a Megatron dataset prefix.