`nemo_automodel.components.datasets.llm.megatron.indexed_dataset`#

A self-contained port of Megatron-Core’s indexed dataset loader.

Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs. The file pair is expected to live on a local filesystem.

All three calls below are equivalent:

from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset

ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])

Module Contents#

Classes#

`DType`	The NumPy data type Enum for reading the IndexedDataset indices
`_IndexWriter`	Object class to write the index (.idx) file
`_IndexReader`	Object class to read the index (.idx) file
`_BinReader`	Abstract class to read the data (.bin) file
`_MMapBinReader`	A _BinReader that memory maps the data (.bin) file
`_FileBinReader`	A _BinReader that reads from the data (.bin) file using a file pointer
`IndexedDataset`	A fast, on-disk dataset backed by Megatron-style index + binary files.
`IndexedDatasetBuilder`	Builder class for the IndexedDataset class

Functions#

`get_idx_path`
`get_bin_path`
`_normalize_prefix`

Data#

`logger`
`_INDEX_HEADER`

API#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#: b’MMIDIDX\x00\x00’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#

Bases: enum.Enum

The NumPy data type Enum for reading the IndexedDataset indices

Initialization

uint8#: 1

int8#: 2

int16#: 3

int32#: 4

int64#: 5

float64#: 6

float32#: 7

uint16#: 8

classmethod code_from_dtype(value: Type[numpy.number]) → int#

Get the code from the dtype

Parameters:: value (Type[numpy.number]) – The dtype
Returns:: The code
Return type:: int

classmethod dtype_from_code(value: int) → Type[numpy.number]#

Get the dtype from the code

Parameters:: value (int) – The code
Returns:: The dtype
Return type:: Type[numpy.number]

classmethod size(key: Union[int, Type[numpy.number]]) → int#

Get the size of the dtype/code in bytes

Parameters:: key (Union[int, Type[numpy.number]]) – The dtype or code
Raises:: ValueError – If the key is neither dtype nor integer code
Returns:: The size of the dtype/code in bytes
Return type:: int

classmethod optimal_dtype( cardinality: Optional[int], ) → Type[numpy.number]#

Get the dtype to use for an index of a certain cardinality

Parameters:: cardinality (Optional[int]) – The number of elements to be indexed
Returns:: The dtype to use for the index
Return type:: Type[numpy.number]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#

Bases: object

Object class to write the index (.idx) file

Parameters:

idx_path (str) – The path to the index file
dtype (Type[numpy.number]) – The dtype of the index file

Initialization

__enter__() → nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter#

Enter the context introduced by the ‘with’ keyword

Returns:: The instance
Return type:: _IndexWriter

__exit__( exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException], exc_tb: Optional[types.TracebackType], ) → Optional[bool]#

Exit the context introduced by the ‘with’ keyword

Parameters:

exc_type (Optional[Type[BaseException]]) – Exception type
exc_val (Optional[BaseException]) – Exception value
exc_tb (Optional[TracebackType]) – Exception traceback object

Returns:

Whether to silence the exception

Return type:

Optional[bool]

write( sequence_lengths: List[int], sequence_modes: Optional[List[int]], document_indices: List[int], ) → None#

Write the index (.idx) file

Parameters:

sequence_lengths (List[int]) – The length of each sequence
sequence_modes (Optional[List[int]]) – The mode of each sequences
document_indices (List[int]) – The seqyebce indices demarcating the end of each document

_sequence_pointers( sequence_lengths: List[int], ) → List[int]#

Build the sequence pointers per the sequence lengths and dtype size

Parameters:: sequence_lengths (List[int]) – The length of each sequence
Returns:: The pointer to the beginning of each sequence
Return type:: List[int]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#

Object class to read the index (.idx) file

Parameters:

idx_path (str) – The path to the index file
multimodal (bool) – Whether the dataset is multimodal

Initialization

__del__() → None#: Clean up the object

__len__() → int#

Get the number of sequences in the dataset

Returns:: The number of sequences in the dataset
Return type:: int

__getitem__( idx: int, ) → Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]#

Return the pointer, length, and mode at the index

Parameters:: idx (int) – The index into the dataset
Returns:: The pointer, length and mode at the index
Return type:: Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#

Bases: abc.ABC

Abstract class to read the data (.bin) file

abstractmethod read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that memory maps the data (.bin) file

Initialization

Initialize the _MMapBinReader

Parameters:: bin_path (str) – The path to the data (.bin) file.

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

__del__() → None#: Clean up the object

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that reads from the data (.bin) file using a file pointer

Initialization

Initialize the _FileBinReader

Parameters:: bin_path (str) – The path to the data (.bin) file.

read( dtype: Type[numpy.number], count: int, offset: int, ) → numpy.ndarray#

Read bytes into a numpy array.

Parameters:

dtype (Type[numpy.number]) – Data-type of the returned array.
count (int) – Number of items to read.
offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset( path_prefix: str, multimodal: bool = False, mmap: bool = True, )#

Bases: torch.utils.data.Dataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

Initialization

Initialize the IndexedDataset

Args: path_prefix (str): The index (.idx) and data (.bin) prefix

multimodal (bool): Whether the dataset is multimodal. Defaults to False.

mmap (bool): Whether to mmap the .bin files. Defaults to True.

initialize(path_prefix: str, multimodal: bool, mmap: bool) → None#

__len__() → int#

__getitem__( idx: Union[int, numpy.integer, slice], ) → Union[numpy.ndarray, Tuple[numpy.ndarray, Any], List[numpy.ndarray], Tuple[List[numpy.ndarray], numpy.ndarray]]#

get( idx: int, offset: int = 0, length: Optional[int] = None, ) → Union[numpy.ndarray, Tuple[numpy.ndarray, Any]]#

property sequence_lengths#

property document_indices#

static exists(path_prefix: str) → bool#

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder( bin_path: str, dtype: Type[numpy.number] = numpy.int32, multimodal: bool = False, )#

Bases: object

Builder class for the IndexedDataset class

Parameters:

bin_path (str) – The path to the data (.bin) file
dtype (Type[numpy.number], optional) – The dtype of the index file. Defaults to numpy.int32.
multimodal (bool, optional) – Whether the dataset is multimodal. Defaults to False.

Initialization

add_item(tensor: torch.Tensor, mode: int = 0) → None#

Add a single item to the dataset

Parameters:

tensor (torch.Tensor) – The item to add to the data file
mode (int, optional) – The mode for the item. Defaults to 0.

add_document( tensor: torch.Tensor, lengths: List[int], modes: Optional[List[int]] = None, ) → None#

Add an entire document to the dataset

Parameters:

tensor (torch.Tensor) – The document to add
lengths (List[int]) – The lengths of each item in the document
modes (Optional[List[int]], optional) – The modes for each item in the document. Defaults to None.

end_document() → None#: Finalize the document, for use with IndexedDatasetBuilder.add_item

add_index(path_prefix: str) → None#

Add an entire IndexedDataset to the dataset

Parameters:: path_prefix (str) – The index (.idx) and data (.bin) prefix

finalize(idx_path: str) → None#

Clean up and write the index (.idx) file

Parameters:: idx_path (str) – The path to the index file

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) → str#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) → str#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) → str#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.megatron.indexed_dataset`#