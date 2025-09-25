In order, the mode per sequence (in the multimodal case)

In order, the consecutive sequence index range [...) per document

In order, the byte offset (pointer) per sequence

In order, the number of elements per sequence

The index file stores document-level and sequence-level metadata second:

The number of documents in the dataset

The number of sequences in the dataset

A numeric code corresponding to the data type used to write data to the data file

The IndexedDataset class is the lowest-level data interface in Megatron Core. Internally, an IndexedDataset instance references two binaries: the data file ( .bin ) contains document/sequence data and the index file ( .idx ) contains document/sequence metadata.

The IndexedDatasetBuilder is capable of building and merging IndexedDataset instances.

At the moment, an end-to-end data preprocessing implementation is left to the user. See the class docstring(s) for more details.

Data preprocessing is built around the following classes:

Data loading: construction#

Building the data loaders is a distributed-aware process built around the following classes:

BlendedMegatronDatasetConfig BlendedMegatronDatasetBuilder IndexedDataset MegatronDataset BlendedDataset

See the class docstrings for more details.

BlendedMegatronDatasetConfig (extendable)# The BlendedMegatronDatasetConfig class parameterizes the BlendedMegatronDatasetBuilder and in turn the MegatronDataset and BlendedDataset . Different training/inference regimes will require different extensions e.g. the GPTDatasetConfig

BlendedMegatronDatasetBuilder# The BlendedMegatronDatasetBuilder class builds the highest-level data interfaces in Megatron Core. NB: All ranks should attempt to build the dataset via the BlendedMegatronDatasetBuilder or the program will hang. Which ranks follow through on their attempts can be controlled via the BlendedMegatronDatasetConfig .

IndexedDataset# The IndexedDataset class is the lowest-level data interface in Megatron Core. The IndexedDataset should already exist on disk before attempting to build any of the high-level data interfaces.

MegatronDataset (extendable)# The MegatronDataset abstract class is a high-level data interface in Megatron Core. It is an abstraction built upon the IndexedDataset . Different training/inference regimes will require different extensions e.g. the GPTDataset