nemo_automodel.components.datasets.llm.megatron.indexed_dataset
#
A self-contained port of Megatron-Coreβs indexed dataset loader.
Supports the original mmap and file-pointer readers for local *.bin / *.idx pairs. The file pair is expected to live on a local filesystem.
All three calls below are equivalent:
from nemo_automodel.datasets.llm.indexed_dataset import IndexedDataset
ds = IndexedDataset("/path/to/shard_00_text_document")
print(len(ds), ds[0][:20])
ds = IndexedDataset("/path/to/shard_00_text_document.bin")
print(len(ds), ds[0][:20])
ds = IndexedDataset("/path/to/shard_00_text_document.idx")
print(len(ds), ds[0][:20])
Module Contents#
Classes#
The NumPy data type Enum for reading the IndexedDataset indices |
|
Object class to write the index (.idx) file |
|
Object class to read the index (.idx) file |
|
Abstract class to read the data (.bin) file |
|
A _BinReader that memory maps the data (.bin) file |
|
A _BinReader that reads from the data (.bin) file using a file pointer |
|
A fast, on-disk dataset backed by Megatron-style index + binary files. |
|
Builder class for the IndexedDataset class |
Functions#
Data#
API#
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.logger#
βgetLogger(β¦)β
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._INDEX_HEADER#
bβMMIDIDX\x00\x00β
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.DType(*args, **kwds)#
Bases:
enum.Enum
The NumPy data type Enum for reading the IndexedDataset indices
Initialization
- uint8#
1
- int8#
2
- int16#
3
- int32#
4
- int64#
5
- float64#
6
- float32#
7
- uint16#
8
- classmethod code_from_dtype(value: Type[numpy.number]) int #
Get the code from the dtype
- Parameters:
value (Type[numpy.number]) β The dtype
- Returns:
The code
- Return type:
int
- classmethod dtype_from_code(value: int) Type[numpy.number] #
Get the dtype from the code
- Parameters:
value (int) β The code
- Returns:
The dtype
- Return type:
Type[numpy.number]
- classmethod size(key: Union[int, Type[numpy.number]]) int #
Get the size of the dtype/code in bytes
- Parameters:
key (Union[int, Type[numpy.number]]) β The dtype or code
- Raises:
ValueError β If the key is neither dtype nor integer code
- Returns:
The size of the dtype/code in bytes
- Return type:
int
- classmethod optimal_dtype(
- cardinality: Optional[int],
Get the dtype to use for an index of a certain cardinality
- Parameters:
cardinality (Optional[int]) β The number of elements to be indexed
- Returns:
The dtype to use for the index
- Return type:
Type[numpy.number]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter(idx_path: str, dtype: Type[numpy.number])#
Bases:
object
Object class to write the index (.idx) file
- Parameters:
idx_path (str) β The path to the index file
dtype (Type[numpy.number]) β The dtype of the index file
Initialization
- __enter__() nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexWriter #
Enter the context introduced by the βwithβ keyword
- Returns:
The instance
- Return type:
- __exit__(
- exc_type: Optional[Type[BaseException]],
- exc_val: Optional[BaseException],
- exc_tb: Optional[types.TracebackType],
Exit the context introduced by the βwithβ keyword
- Parameters:
exc_type (Optional[Type[BaseException]]) β Exception type
exc_val (Optional[BaseException]) β Exception value
exc_tb (Optional[TracebackType]) β Exception traceback object
- Returns:
Whether to silence the exception
- Return type:
Optional[bool]
- write(
- sequence_lengths: List[int],
- sequence_modes: Optional[List[int]],
- document_indices: List[int],
Write the index (.idx) file
- Parameters:
sequence_lengths (List[int]) β The length of each sequence
sequence_modes (Optional[List[int]]) β The mode of each sequences
document_indices (List[int]) β The seqyebce indices demarcating the end of each document
- _sequence_pointers(
- sequence_lengths: List[int],
Build the sequence pointers per the sequence lengths and dtype size
- Parameters:
sequence_lengths (List[int]) β The length of each sequence
- Returns:
The pointer to the beginning of each sequence
- Return type:
List[int]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._IndexReader(idx_path: str, multimodal: bool)#
Object class to read the index (.idx) file
- Parameters:
idx_path (str) β The path to the index file
multimodal (bool) β Whether the dataset is multimodal
Initialization
- __del__() None #
Clean up the object
- __len__() int #
Get the number of sequences in the dataset
- Returns:
The number of sequences in the dataset
- Return type:
int
- __getitem__(
- idx: int,
Return the pointer, length, and mode at the index
- Parameters:
idx (int) β The index into the dataset
- Returns:
The pointer, length and mode at the index
- Return type:
Tuple[numpy.int32, numpy.int64, Optional[numpy.int8]]
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader#
Bases:
abc.ABC
Abstract class to read the data (.bin) file
- abstract read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) β Data-type of the returned array.
count (int) β Number of items to read.
offset (int) β Start reading from this offset (in bytes).
- Returns:
An array with
count
items and data-typedtype
constructed from reading bytes from the data file starting atoffset
.- Return type:
numpy.ndarray
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._MMapBinReader(bin_path: str)#
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader
A _BinReader that memory maps the data (.bin) file
Initialization
Initialize the _MMapBinReader
- Parameters:
bin_path (str) β The path to the data (.bin) file.
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) β Data-type of the returned array.
count (int) β Number of items to read.
offset (int) β Start reading from this offset (in bytes).
- Returns:
An array with
count
items and data-typedtype
constructed from reading bytes from the data file starting atoffset
.- Return type:
numpy.ndarray
- __del__() None #
Clean up the object
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset._FileBinReader(bin_path: str)#
Bases:
nemo_automodel.components.datasets.llm.megatron.indexed_dataset._BinReader
A _BinReader that reads from the data (.bin) file using a file pointer
Initialization
Initialize the _FileBinReader
- Parameters:
bin_path (str) β The path to the data (.bin) file.
- read(
- dtype: Type[numpy.number],
- count: int,
- offset: int,
Read bytes into a numpy array.
- Parameters:
dtype (Type[numpy.number]) β Data-type of the returned array.
count (int) β Number of items to read.
offset (int) β Start reading from this offset (in bytes).
- Returns:
An array with
count
items and data-typedtype
constructed from reading bytes from the data file starting atoffset
.- Return type:
numpy.ndarray
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset(
- path_prefix: str,
- multimodal: bool = False,
- mmap: bool = True,
Bases:
torch.utils.data.Dataset
A fast, on-disk dataset backed by Megatron-style index + binary files.
Initialization
Initialize the IndexedDataset
Args: path_prefix (str): The index (.idx) and data (.bin) prefix
multimodal (bool): Whether the dataset is multimodal. Defaults to False.
mmap (bool): Whether to mmap the .bin files. Defaults to True.
- initialize(path_prefix: str, multimodal: bool, mmap: bool) None #
- __len__() int #
- __getitem__(
- idx: Union[int, numpy.integer, slice],
- get(
- idx: int,
- offset: int = 0,
- length: Optional[int] = None,
- property sequence_lengths#
- property document_indices#
- static exists(path_prefix: str) bool #
- class nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDatasetBuilder(
- bin_path: str,
- dtype: Type[numpy.number] = numpy.int32,
- multimodal: bool = False,
Bases:
object
Builder class for the IndexedDataset class
- Parameters:
bin_path (str) β The path to the data (.bin) file
dtype (Type[numpy.number], optional) β The dtype of the index file. Defaults to numpy.int32.
multimodal (bool, optional) β Whether the dataset is multimodal. Defaults to False.
Initialization
- add_item(tensor: torch.Tensor, mode: int = 0) None #
Add a single item to the dataset
- Parameters:
tensor (torch.Tensor) β The item to add to the data file
mode (int, optional) β The mode for the item. Defaults to 0.
- add_document(
- tensor: torch.Tensor,
- lengths: List[int],
- modes: Optional[List[int]] = None,
Add an entire document to the dataset
- Parameters:
tensor (torch.Tensor) β The document to add
lengths (List[int]) β The lengths of each item in the document
modes (Optional[List[int]], optional) β The modes for each item in the document. Defaults to None.
- end_document() None #
Finalize the document, for use with IndexedDatasetBuilder.add_item
- add_index(path_prefix: str) None #
Add an entire IndexedDataset to the dataset
- Parameters:
path_prefix (str) β The index (.idx) and data (.bin) prefix
- finalize(idx_path: str) None #
Clean up and write the index (.idx) file
- Parameters:
idx_path (str) β The path to the index file
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_idx_path(path_prefix: str) str #
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset.get_bin_path(path_prefix: str) str #
- nemo_automodel.components.datasets.llm.megatron.indexed_dataset._normalize_prefix(path_prefix: str) str #