nemo_automodel.components.datasets.llm.megatron.indexed_dataset#
A self-contained port of Megatron-Core’s indexed dataset loader.
Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs, plus optional streaming readers for object storage (S3 and MSC).
All three calls below are equivalent for local data:
from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset
ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])
ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])
ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])
For object-storage data, pass an :class:ObjectStorageConfig:
cfg = ObjectStorageConfig(path_to_idx_cache="/tmp/idx_cache")
ds = IndexedDataset("s3://bucket/path/shard_00_text_document", object_storage_config=cfg)
Module Contents#
Classes#
Configuration for reading |
|
The NumPy data type Enum for reading the IndexedDataset indices |
|
Object class to write the index (.idx) file |
|
Object class to read the index (.idx) file |
|
Abstract class to read the data (.bin) file |
|
A _BinReader that memory maps the data (.bin) file |
|
A _BinReader that reads from the data (.bin) file using a file pointer |
|
Stream |
|
Read |
|
A fast, on-disk dataset backed by Megatron-style index + binary files. |
|
Builder class for the IndexedDataset class |
Functions#
Return |
|
Split an |
|
Return the local cache path for |
|
Download |
|
Return the index-file path for a Megatron dataset prefix. |
|
Return the binary-data path for a Megatron dataset prefix. |
|
Data#
API#
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#
‘getLogger(…)’
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3_PREFIX#
‘s3://’
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MSC_PREFIX#
‘msc://’
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig#
Configuration for reading
.bin/.idxfiles from object storage... attribute:: path_to_idx_cache
Local directory where the
.idxfile is cached on first use. Re-used across ranks via a per-host directory layout... attribute:: bin_chunk_nbytes
Size in bytes of each chunked range read against the
.binobject. Defaults to 256 MiB. Larger values reduce request count but increase per-rank memory footprint.- path_to_idx_cache: str#
None
- bin_chunk_nbytes: int#
None
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._is_object_storage_path(path: str) bool#
Return
Trueifpathis ans3://ormsc://URI.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._parse_s3_path(path: str) Tuple[str, str]#
Split an
s3://bucket/keyURI into(bucket, key).
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._get_index_cache_path(
- idx_path: str,
- object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
Return the local cache path for
idx_pathunderpath_to_idx_cache.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._cache_index_file(remote_path: str, local_path: str) None#
Download
.idxfrom object storage tolocal_path.Rank 0 performs the download and other ranks wait on a
torch.distributedbarrier. If the local file already exists this is a no-op.- Raises:
ImportError – If the relevant client library (
boto3fors3://ormulti_storage_clientformsc://) is not installed.ValueError – If
remote_pathis neither ans3://nor anmsc://URI.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#
b’MMIDIDX\x00\x00’
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#
Bases:
enum.EnumThe NumPy data type Enum for reading the IndexedDataset indices
Initialization
- uint8#
1
- int8#
2
- int16#
3
- int32#
4
- int64#
5
- float64#
6
- float32#
7
- uint16#
8
- classmethod code_from_dtype(value: Type[numpy.number]) int#
Get the code from the dtype
- Parameters:
value (Type[numpy.number]) – The dtype
- Returns:
The code
- Return type:
int
- classmethod dtype_from_code(value: int) Type[numpy.number]#
Get the dtype from the code
- Parameters:
value (int) – The code
- Returns:
The dtype
- Return type:
Type[numpy.number]
- classmethod size(key: Union[int, Type[numpy.number]]) int#
Get the size of the dtype/code in bytes
- Parameters:
key (Union[int, Type[numpy.number]]) – The dtype or code
- Raises:
ValueError – If the key is neither dtype nor integer code
- Returns:
The size of the dtype/code in bytes
- Return type:
int
- classmethod optimal_dtype(
- cardinality: Optional[int],
Get the dtype to use for an index of a certain cardinality
- Parameters:
cardinality (Optional[int]) – The number of elements to be indexed
- Returns:
The dtype to use for the index
- Return type:
Type[numpy.number]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#
Bases:
objectObject class to write the index (.idx) file
- Parameters:
idx_path (str) – The path to the index file
dtype (Type[numpy.number]) – The dtype of the index file
Initialization
- __enter__() nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter#
Enter the context introduced by the ‘with’ keyword
- Returns:
The instance
- Return type:
- __exit__(
- exc_type: Optional[Type[BaseException]],
- exc_val: Optional[BaseException],
- exc_tb: Optional[types.TracebackType],
Exit the context introduced by the ‘with’ keyword
- Parameters:
exc_type (Optional[Type[BaseException]]) – Exception type
exc_val (Optional[BaseException]) – Exception value
exc_tb (Optional[TracebackType]) – Exception traceback object
- Returns:
Whether to silence the exception
- Return type:
Optional[bool]
- write(
- sequence_lengths: List[int],
- sequence_modes: Optional[List[int]],
- document_indices: List[int],
Write the index (.idx) file
- Parameters:
sequence_lengths (List[int]) – The length of each sequence
sequence_modes (Optional[List[int]]) – The mode of each sequences
document_indices (List[int]) – The seqyebce indices demarcating the end of each document
- _sequence_pointers(
- sequence_lengths: List[int],
Build the sequence pointers per the sequence lengths and dtype size
- Parameters:
sequence_lengths (List[int]) – The length of each sequence
- Returns:
The pointer to the beginning of each sequence
- Return type:
List[int]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#
Object class to read the index (.idx) file
- Parameters:
idx_path (str) – The path to the index file
multimodal (bool) – Whether the dataset is multimodal
Initialization
- __del__() None#
Clean up the object
- __len__() int#
Get the number of sequences in the dataset
- Returns:
The number of sequences in the dataset
- Return type:
int
- __getitem__(
- idx: int,
Return the pointer, length, and mode at the index
- Parameters:
idx (int) – The index into the dataset
- Returns:
The pointer, length and mode at the index
- Return type:
Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#
Bases:
abc.ABCAbstract class to read the data (.bin) file
- abstractmethod read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).
- Returns:
An array with
countitems and data-typedtypeconstructed from reading bytes from the data file starting atoffset.- Return type:
numpy.ndarray
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReaderA _BinReader that memory maps the data (.bin) file
Initialization
Initialize the _MMapBinReader
- Parameters:
bin_path (str) – The path to the data (.bin) file.
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).
- Returns:
An array with
countitems and data-typedtypeconstructed from reading bytes from the data file starting atoffset.- Return type:
numpy.ndarray
- __del__() None#
Clean up the object
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReaderA _BinReader that reads from the data (.bin) file using a file pointer
Initialization
Initialize the _FileBinReader
- Parameters:
bin_path (str) – The path to the data (.bin) file.
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).
- Returns:
An array with
countitems and data-typedtypeconstructed from reading bytes from the data file starting atoffset.- Return type:
numpy.ndarray
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._S3BinReader(
- bin_path: str,
- object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReaderStream
.bindata from S3 via chunked rangedGetObjectcalls.A single in-memory chunk (sized by
- Attr:
ObjectStorageConfig.bin_chunk_nbytes) is cached so consecutive reads within the same chunk avoid network round-trips. Random-access reads outside the current chunk trigger a new rangedGetObject.
Initialization
- _extract_from_cache(offset: int, size: int) bytes#
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read
countelements ofdtypestarting at byteoffset.
- __del__() None#
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MultiStorageClientBinReader(
- bin_path: str,
- object_storage_config: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig,
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReaderRead
.bindata via NVIDIA’s :mod:multi_storage_client.Initialization
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read
countelements ofdtypestarting at byteoffset.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.OBJECT_STORAGE_BIN_READERS: Dict[str, Type[nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader]]#
None
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset(
- path_prefix: str,
- multimodal: bool = False,
- mmap: bool = True,
- object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
Bases:
torch.utils.data.DatasetA fast, on-disk dataset backed by Megatron-style index + binary files.
Initialization
Initialize the IndexedDataset
- Parameters:
path_prefix (str) – The index (.idx) and data (.bin) prefix. May be an S3 URI (
s3://bucket/key) whenobject_storage_configis provided.multimodal (bool) – Whether the dataset is multimodal. Defaults to False.
mmap (bool) – Whether to mmap the .bin files. Defaults to True. Must be False for object-storage paths.
object_storage_config (Optional[ObjectStorageConfig]) – When provided and
path_prefixis an S3/MSC URI, the .idx file is downloaded toobject_storage_config.path_to_idx_cacheand the .bin file is streamed via chunked GETs.
- initialize(
- path_prefix: str,
- multimodal: bool,
- mmap: bool,
- object_storage_config: Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
- __len__() int#
- __getitem__(
- idx: Union[int, numpy.integer, slice],
- get(
- idx: int,
- offset: int = 0,
- length: Optional[int] = None,
- property sequence_lengths#
- property document_indices#
- static exists(path_prefix: str) bool#
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder(
- bin_path: str,
- dtype: Type[numpy.number] = numpy.int32,
- multimodal: bool = False,
Bases:
objectBuilder class for the IndexedDataset class
- Parameters:
bin_path (str) – The path to the data (.bin) file
dtype (Type[numpy.number], optional) – The dtype of the index file. Defaults to numpy.int32.
multimodal (bool, optional) – Whether the dataset is multimodal. Defaults to False.
Initialization
- add_item(tensor: torch.Tensor, mode: int = 0) None#
Add a single item to the dataset
- Parameters:
tensor (torch.Tensor) – The item to add to the data file
mode (int, optional) – The mode for the item. Defaults to 0.
- add_document(
- tensor: torch.Tensor,
- lengths: List[int],
- modes: Optional[List[int]] = None,
Add an entire document to the dataset
- Parameters:
tensor (torch.Tensor) – The document to add
lengths (List[int]) – The lengths of each item in the document
modes (Optional[List[int]], optional) – The modes for each item in the document. Defaults to None.
- end_document() None#
Finalize the document, for use with IndexedDatasetBuilder.add_item
- add_index(path_prefix: str) None#
Add an entire IndexedDataset to the dataset
- Parameters:
path_prefix (str) – The index (.idx) and data (.bin) prefix
- finalize(idx_path: str) None#
Clean up and write the index (.idx) file
- Parameters:
idx_path (str) – The path to the index file
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) str#
Return the index-file path for a Megatron dataset prefix.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) str#
Return the binary-data path for a Megatron dataset prefix.
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) str#