nemo_automodel.components.datasets.llm.megatron.indexed_dataset#

A self-contained port of Megatron-Core’s indexed dataset loader.

Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs. The file pair is expected to live on a local filesystem.

All three calls below are equivalent:

from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset

ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])

ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])

Module Contents#

Classes#

DType

The NumPy data type Enum for reading the IndexedDataset indices

_IndexWriter

Object class to write the index (.idx) file

_IndexReader

Object class to read the index (.idx) file

_BinReader

Abstract class to read the data (.bin) file

_MMapBinReader

A _BinReader that memory maps the data (.bin) file

_FileBinReader

A _BinReader that reads from the data (.bin) file using a file pointer

IndexedDataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

IndexedDatasetBuilder

Builder class for the IndexedDataset class

Functions#

Data#

API#

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#

β€˜getLogger(…)’

nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#

b’MMIDIDX\x00\x00’

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#

Bases: enum.Enum

The NumPy data type Enum for reading the IndexedDataset indices

Initialization

uint8#

1

int8#

2

int16#

3

int32#

4

int64#

5

float64#

6

float32#

7

uint16#

8

classmethod code_from_dtype(value: Type[numpy.number]) int#

Get the code from the dtype

Parameters:

value (Type[numpy.number]) – The dtype

Returns:

The code

Return type:

int

classmethod dtype_from_code(value: int) Type[numpy.number]#

Get the dtype from the code

Parameters:

value (int) – The code

Returns:

The dtype

Return type:

Type[numpy.number]

classmethod size(key: Union[int, Type[numpy.number]]) int#

Get the size of the dtype/code in bytes

Parameters:

key (Union[int, Type[numpy.number]]) – The dtype or code

Raises:

ValueError – If the key is neither dtype nor integer code

Returns:

The size of the dtype/code in bytes

Return type:

int

classmethod optimal_dtype(
cardinality: Optional[int],
) Type[numpy.number]#

Get the dtype to use for an index of a certain cardinality

Parameters:

cardinality (Optional[int]) – The number of elements to be indexed

Returns:

The dtype to use for the index

Return type:

Type[numpy.number]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#

Bases: object

Object class to write the index (.idx) file

Parameters:
  • idx_path (str) – The path to the index file

  • dtype (Type[numpy.number]) – The dtype of the index file

Initialization

__enter__() nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter#

Enter the context introduced by the β€˜with’ keyword

Returns:

The instance

Return type:

_IndexWriter

__exit__(
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[types.TracebackType],
) Optional[bool]#

Exit the context introduced by the β€˜with’ keyword

Parameters:
  • exc_type (Optional[Type[BaseException]]) – Exception type

  • exc_val (Optional[BaseException]) – Exception value

  • exc_tb (Optional[TracebackType]) – Exception traceback object

Returns:

Whether to silence the exception

Return type:

Optional[bool]

write(
sequence_lengths: List[int],
sequence_modes: Optional[List[int]],
document_indices: List[int],
) None#

Write the index (.idx) file

Parameters:
  • sequence_lengths (List[int]) – The length of each sequence

  • sequence_modes (Optional[List[int]]) – The mode of each sequences

  • document_indices (List[int]) – The seqyebce indices demarcating the end of each document

_sequence_pointers(
sequence_lengths: List[int],
) List[int]#

Build the sequence pointers per the sequence lengths and dtype size

Parameters:

sequence_lengths (List[int]) – The length of each sequence

Returns:

The pointer to the beginning of each sequence

Return type:

List[int]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#

Object class to read the index (.idx) file

Parameters:
  • idx_path (str) – The path to the index file

  • multimodal (bool) – Whether the dataset is multimodal

Initialization

__del__() None#

Clean up the object

__len__() int#

Get the number of sequences in the dataset

Returns:

The number of sequences in the dataset

Return type:

int

__getitem__(
idx: int,
) Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]#

Return the pointer, length, and mode at the index

Parameters:

idx (int) – The index into the dataset

Returns:

The pointer, length and mode at the index

Return type:

Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#

Bases: abc.ABC

Abstract class to read the data (.bin) file

abstract read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that memory maps the data (.bin) file

Initialization

Initialize the _MMapBinReader

Parameters:

bin_path (str) – The path to the data (.bin) file.

read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

__del__() None#

Clean up the object

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#

Bases: nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader

A _BinReader that reads from the data (.bin) file using a file pointer

Initialization

Initialize the _FileBinReader

Parameters:

bin_path (str) – The path to the data (.bin) file.

read(
dtype: Type[numpy.number],
count: int,
offset: int,
) numpy.ndarray#

Read bytes into a numpy array.

Parameters:
  • dtype (Type[numpy.number]) – Data-type of the returned array.

  • count (int) – Number of items to read.

  • offset (int) – Start reading from this offset (in bytes).

Returns:

An array with count items and data-type dtype constructed from reading bytes from the data file starting at offset.

Return type:

numpy.ndarray

class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset(
path_prefix: str,
multimodal: bool = False,
mmap: bool = True,
)#

Bases: torch.utils.data.Dataset

A fast, on-disk dataset backed by Megatron-style index + binary files.

Initialization

Initialize the IndexedDataset

Args: path_prefix (str): The index (.idx) and data (.bin) prefix

multimodal (bool): Whether the dataset is multimodal. Defaults to False.

mmap (bool): Whether to mmap the .bin files. Defaults to True.

initialize(path_prefix: str, multimodal: bool, mmap: bool) None#
__len__() int#
__getitem__(
idx: Union[int, numpy.integer, slice],
) Union[numpy.ndarray, Tuple[numpy.ndarray, Any], List[numpy.ndarray], Tuple[List[numpy.ndarray], numpy.ndarray]]#
get(
idx: int,
offset: int = 0,
length: Optional[int] = None,
) Union[numpy.ndarray, Tuple[numpy.ndarray, Any]]#
property sequence_lengths#
property document_indices#
static exists(path_prefix: str) bool#
class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder(
bin_path: str,
dtype: Type[numpy.number] = numpy.int32,
multimodal: bool = False,
)#

Bases: object

Builder class for the IndexedDataset class

Parameters:
  • bin_path (str) – The path to the data (.bin) file

  • dtype (Type[numpy.number], optional) – The dtype of the index file. Defaults to numpy.int32.

  • multimodal (bool, optional) – Whether the dataset is multimodal. Defaults to False.

Initialization

add_item(tensor: torch.Tensor, mode: int = 0) None#

Add a single item to the dataset

Parameters:
  • tensor (torch.Tensor) – The item to add to the data file

  • mode (int, optional) – The mode for the item. Defaults to 0.

add_document(
tensor: torch.Tensor,
lengths: List[int],
modes: Optional[List[int]] = None,
) None#

Add an entire document to the dataset

Parameters:
  • tensor (torch.Tensor) – The document to add

  • lengths (List[int]) – The lengths of each item in the document

  • modes (Optional[List[int]], optional) – The modes for each item in the document. Defaults to None.

end_document() None#

Finalize the document, for use with IndexedDatasetBuilder.add_item

add_index(path_prefix: str) None#

Add an entire IndexedDataset to the dataset

Parameters:

path_prefix (str) – The index (.idx) and data (.bin) prefix

finalize(idx_path: str) None#

Clean up and write the index (.idx) file

Parameters:

idx_path (str) – The path to the index file

nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) str#
nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) str#
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) str#