> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.megatron.gpt_dataset

## Module Contents

### Classes

| Name                                                                                                                        | Description                                              |
| --------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
| [`BlendedMegatronDatasetConfig`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-BlendedMegatronDatasetConfig) | Configuration object for Megatron Core datasets          |
| [`GPTDataset`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-GPTDataset)                                     | The base GPT dataset                                     |
| [`GPTDatasetConfig`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-GPTDatasetConfig)                         | Configuration object for Megatron Core GPT datasets      |
| [`Split`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-Split)                                               | Dataset split identifiers used by Megatron GPT datasets. |

### Functions

| Name                                                                                                                                        | Description                                                                   |
| ------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`_build_document_index`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-_build_document_index)                               | Build an array with length = num epochs \* num documents                      |
| [`_build_shuffle_index`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-_build_shuffle_index)                                 | Build the range \[0, size) and shuffle                                        |
| [`_get_ltor_masks_and_position_ids`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-_get_ltor_masks_and_position_ids)         | Build masks and position id for left to right model.                          |
| [`convert_split_vector_to_split_matrix`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-convert_split_vector_to_split_matrix) | Build the split matrix from one or optionally two contributing split vectors. |
| [`normalize`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-normalize)                                                       | Do non-exponentiated normalization                                            |
| [`parse_and_normalize_split`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-parse_and_normalize_split)                       | Parse the dataset split ratios from a string                                  |

### Data

[`_PAD_TOKEN_ID`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-_PAD_TOKEN_ID)

[`logger`](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-logger)

### API

```python
class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig(
    random_seed: int,
    sequence_length: int,
    blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
    blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
    split: typing.Optional[str] = None,
    num_dataset_builder_threads: int = 1,
    path_to_cache: typing.Optional[str] = None,
    mmap_bin_files: bool = True,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    mid_level_dataset_surplus: float = 0.005,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None
)
```

Dataclass

Configuration object for Megatron Core datasets

The blend, consisting of a list of dataset prefixes and optionally a list of dataset
weights. For example, \[\["dataset-path1", "dataset-path2"], \[0.3, 0.7]]. When the weights are
None, they are inferred from the lengths of the contributing datasets. Not to be used with
'blend\_per\_split'. Defaults to None.

A set of blends, as defined above, one for each split distribution. Not to be used with
'blend'. Defauls to None.

The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005.
This value is irrelevant for single source data blends. This value may need to be increased
if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0
in future if the top level dataset is constrained to not oversample the mid level
datasets(s).

Whether to mmap the .bin files or use file pointers.

Whether to bypass real data loading and validation in favor of mock data generation.
Created automatically from 'blend' and 'blend\_per\_split'. Not to be passed in to the
constructor.

The number of threads to use for dataset building.

When set, the .idx files are downloaded to path\_to\_idx\_cache and .bin files are streamed
from S3/MSC via chunked GETs. mmap\_bin\_files is automatically overridden to False.

Where all re-useable dataset indices are to be cached.

The seed for all RNG during dataset creation.

The sequence length.

The split string, a comma separated weighting for the dataset splits when drawing samples
from a single distribution. Not to be used with 'blend\_per\_split'.  Defaults to None.

The split matrix consisting of non-overlapping book-ends of each split in order. For more
information, refer to 'convert\_split\_vector\_to\_split\_matrix'. Created automatically from
'split'. Not to be passed in to the constructor.

The PreTrainedTokenizerBase instance. Required for datasets that do online tokenization.

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.BlendedMegatronDatasetConfig.__post_init__() -> None
```

Do asserts and set fields post init

```python
class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset(
    indexed_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset,
    dataset_path: typing.Optional[str],
    indexed_indices: numpy.ndarray,
    num_samples: typing.Optional[int],
    index_split: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split,
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
)
```

**Bases:** `Dataset`

The base GPT dataset

**Parameters:**

The IndexedDataset around which to build the GPTDataset

The real path on disk to the dataset, for bookkeeping

The set of the documents indices to expose

The number of samples to draw from the indexed dataset. When
None, build as many samples as correspond to one epoch.

The indexed\_indices Split

The config

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__getitem__(
    idx: typing.Optional[int]
) -> dict[str, torch.Tensor]
```

Abstract method implementation

**Parameters:**

The index into the dataset

**Returns:** `dict[str, torch.Tensor]`

dict\[str, torch.Tensor]: The sample information wrapped in a dictionary

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.__len__() -> int
```

Abstract method implementation

**Returns:** `int`

The effective length of the dataset, capped by num\_samples when provided

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._build_document_sample_shuffle_indices() -> typing.Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
```

Build the document index, the sample index, and the shuffle index

**Returns:** `numpy.ndarray`

Tuple\[numpy.ndarray, numpy.ndarray, numpy.ndarray]: The document index, the sample

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_epochs(
    num_tokens_per_epoch: int
) -> int
```

Calculate the number of epochs

**Parameters:**

The number of tokens in a single epoch

**Returns:** `int`

The number of epochs

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._get_num_tokens_per_epoch() -> int
```

Calculate the number of tokens in a single epoch

**Returns:** `int`

The number of tokens in a single epoch

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._key_config_attributes() -> typing.List[str]
```

staticmethod

Return all config attributes which contribute to uniquely identifying the dataset.

These attributes will be used to build a uniquely identifying string and MD5 hash which
will be used to cache/load dataset resources from run to run.

**Returns:** `List[str]`

List\[str]: The key config attributes

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset._query_document_sample_shuffle_indices(
    idx: int
) -> typing.Tuple[numpy.ndarray, numpy.ndarray]
```

Get the text (token ids) and document ids for a given index

**Parameters:**

The index into the dataset

**Returns:** `Tuple[numpy.ndarray, numpy.ndarray]`

Tuple\[numpy.ndarray, numpy.ndarray]: The text ids and document ids

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.build_low_level_dataset(
    dataset_path: str,
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
) -> nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset
```

staticmethod

Abstract method implementation

**Parameters:**

The real path prefix to the IndexedDataset .bin and .idx files

The config

**Returns:** `IndexedDataset`

The underlying IndexedDataset

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset.numel_low_level_dataset(
    low_level_dataset: nemo_automodel.components.datasets.llm.megatron.indexed_dataset.IndexedDataset
) -> int
```

staticmethod

Abstract method implementation

For GPT, the underlying IndexedDataset should be split by sequence, as opposed to, say,
BERT, which should be split by document

**Parameters:**

The underlying IndexedDataset

**Returns:** `int`

The number of unique elements in the underlying IndexedDataset

```python
class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig(
    random_seed: int,
    sequence_length: int,
    blend: typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]] = None,
    blend_per_split: typing.Optional[typing.List[typing.Optional[typing.Tuple[typing.List[str], typing.Optional[typing.List[float]]]]]] = None,
    split: typing.Optional[str] = None,
    num_dataset_builder_threads: int = 1,
    path_to_cache: typing.Optional[str] = None,
    mmap_bin_files: bool = True,
    tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
    mid_level_dataset_surplus: float = 0.005,
    object_storage_config: typing.Optional[nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig] = None,
    reset_position_ids: typing.Optional[bool] = None,
    reset_attention_mask: typing.Optional[bool] = None,
    eod_mask_loss: typing.Optional[bool] = None,
    create_attention_mask: bool = True,
    drop_last_partial_validation_sequence: bool = True,
    add_extra_token_to_sequence: bool = True
)
```

Dataclass

**Bases:** [BlendedMegatronDatasetConfig](#nemo_automodel-components-datasets-llm-megatron-gpt_dataset-BlendedMegatronDatasetConfig)

Configuration object for Megatron Core GPT datasets

Option to draw sequences with one extra token to ensure the sample input tokens and sample
output tokens are both of the desired sequence length

Option to enable the attention masks generation. Can be disabled if attention kernel
generates masks by itself.

Option to drop the last partial validation sequence

Option to enable the EOD mask loss

Option to reset the attention mask from the dataset

Option to reset the position IDs in the dataset at an interval

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig.__post_init__() -> None
```

Do asserts and set fields post init

```python
class nemo_automodel.components.datasets.llm.megatron.gpt_dataset.Split
```

**Bases:** `enum.Enum`

Dataset split identifiers used by Megatron GPT datasets.

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_document_index(
    documents: numpy.ndarray,
    num_epochs: int,
    numpy_random_state: numpy.random.RandomState,
    separate_final_epoch: bool
) -> numpy.ndarray
```

Build an array with length = num epochs \* num documents

**Parameters:**

the subset of exposed document indices

The number of epochs

The NumPy random state

Whether to exclude the last epoch from the global shuffle

**Returns:** `numpy.ndarray`

numpy.ndarray: The document index

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset._build_shuffle_index(
    num_samples: int,
    total_size: int,
    numpy_random_state: numpy.random.RandomState
) -> numpy.ndarray
```

Build the range \[0, size) and shuffle

**Parameters:**

The size of the first shuffle range \[0, num\_samples)

The size of the entire index. If larger than 'num\_samples', it defines
the second shuffle range \[num\_samples, total\_size)

The NumPy random state

**Returns:** `numpy.ndarray`

numpy.ndarray: The shuffle index

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset._get_ltor_masks_and_position_ids(
    data: torch.Tensor,
    eod_token: int,
    reset_position_ids: bool,
    reset_attention_mask: bool,
    eod_mask_loss: bool,
    create_attention_mask: bool
)
```

Build masks and position id for left to right model.

**Parameters:**

The data tenor that holds the tokens from the dataset

ID of the token to that is considered the EOD

Switch to reset the document position ID's

Switch to reset the attention mask

Switch to enable the EOD mask loss

Switch to enable the attention masks generation. Can be
disabled if attention kernel generates masks by itself.

**Returns:**

torch.Tensor: Attention mask needed to be used for Attention

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.convert_split_vector_to_split_matrix(
    vector_a: typing.List[float],
    vector_b: typing.Optional[typing.List[float]] = None
) -> typing.List[typing.Optional[typing.Tuple[float, float]]]
```

Build the split matrix from one or optionally two contributing split vectors.

Ex. a standard conversion:

\[0.99, 0.01, 0.0] -> \[(0, 0.99), (0.99, 1.0), None]

Ex. a conversion for Retro when Retro pretraining uses a \[0.99, 0.01, 0.0] split and Retro
preprocessing used a \[0.98, 0.02, 0.0] split:

\[0.99, 0.01, 0.0], \[0.98, 0.02, 0.0] -> \[(0, 0.98), (0.99, 1.0), None]

**Parameters:**

The primary split vector

An optional secondary split vector which constrains the
primary split vector. Defaults to None.

**Returns:** `List[Optional[Tuple[float, float]]]`

List\[Tuple\[float, float]]: The split matrix consisting of book-ends of each split in order

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.normalize(
    weights: list[float]
) -> list[float]
```

Do non-exponentiated normalization

**Parameters:**

The weights

**Returns:** `list[float]`

List\[float]: The normalized weights

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.parse_and_normalize_split(
    split: str
) -> typing.List[float]
```

Parse the dataset split ratios from a string

**Parameters:**

The train valid test split string e.g. "99,1,0"

**Returns:** `List[float]`

List\[float]: The trian valid test split ratios e.g. \[0.99, 0.01, 0.0]

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset._PAD_TOKEN_ID = -100
```

```python
nemo_automodel.components.datasets.llm.megatron.gpt_dataset.logger = logging.getLogger(__name__)
```