> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.megatron.builder

## Module Contents

### Classes

| Name                                                                                                                      | Description                                                      |
| ------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| [`BlendedDataset`](#nemo_automodel-components-datasets-llm-megatron-builder-BlendedDataset)                               | Conjugating class for a set of MegatronDataset instances         |
| [`BlendedMegatronDatasetBuilder`](#nemo_automodel-components-datasets-llm-megatron-builder-BlendedMegatronDatasetBuilder) | Builder class for the BlendedDataset and MegatronDataset classes |

### Functions

| Name                                                                                                                          | Description                                                                           |
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| [`_get_size_per_split_per_dataset`](#nemo_automodel-components-datasets-llm-megatron-builder-_get_size_per_split_per_dataset) | Determine the contribution of the MegatronDataset splits to the BlendedDataset splits |

### Data

[`_VERBOSE`](#nemo_automodel-components-datasets-llm-megatron-builder-_VERBOSE)

[`logger`](#nemo_automodel-components-datasets-llm-megatron-builder-logger)

### API

```python
class nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset(
    datasets: typing.List[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset],
    weights: typing.List[typing.Union[int, float]],
    size: typing.Optional[int],
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig
)
```

**Bases:** `Dataset`

Conjugating class for a set of MegatronDataset instances

**Parameters:**

The MegatronDataset instances to blend

The weights that determine the dataset blend ratios

The number of samples to draw from the blend. If None, for each
dataset index idx draw exactly weights\[idx] samples from datasets\[idx].

The config

**Raises:**

* `RuntimeError`: When the dataset has fewer or more samples than 'size' post-initialization

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.Union[int, numpy.ndarray]]
```

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset._build_indices() -> typing.Tuple[numpy.ndarray, numpy.ndarray]
```

Build and optionally cache the dataset index and the dataset sample index

The dataset index is a 1-D mapping which determines the dataset to query. The dataset
sample index is a 1-D mapping which determines the sample to request from the queried
dataset.

**Returns:** `Tuple[numpy.ndarray, numpy.ndarray]`

Tuple\[numpy.ndarray, numpy.ndarray]: The dataset index and the dataset sample index

```python
class nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder(
    sizes: list[int],
    is_built_on_rank: typing.Callable,
    config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig,
    enabled_splits: typing.Optional[list[str]] = None
)
```

Builder class for the BlendedDataset and MegatronDataset classes

Args:

sizes (List\[Optional\[int]]): The minimum total number of samples to draw, or None, per split

is\_built\_on\_rank (Callable): A callable which returns True if the dataset should be built on
the current rank and False otherwise. It should be Megatron Core parallelism aware i.e.
global rank, local group rank, and virtual rank may inform its return value.

config (BlendedMegatronDatasetConfig): The config object which informs dataset creation

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_blended_dataset_splits() -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]
```

Build all dataset splits according to the provided blend(s)

See the BlendedMegatronDatasetBuilder.build alias for more information.

**Returns:** `List[Optional[GPTDataset]]`

List\[Optional\[GPTDataset]]: A list containing a dataset instance (or None) per
split

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_megatron_dataset_splits(
    dataset_path: typing.Optional[str],
    split: typing.List[float],
    sizes: typing.List[int],
    synchronize_ranks: bool = True
) -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]
```

Build each MidLevelDataset split from a single LowLevelDataset

**Parameters:**

The path on disk which defines the underlying
LowLevelDataset, or None for mock dataset classes

The dataset split matrix

The number of total samples to draw from each split

Whether to call barrier for rank-0 / barrier / other-ranks
behavior. Set to False when we enforce this behavior at higher level.

**Returns:** `List[Optional[GPTDataset]]`

List\[Optional\[GPTDataset]]: The GPTDataset (or None) per split

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._build_megatron_datasets_parallel(
    prefixes: typing.List[str],
    split: typing.List[float],
    sizes_per_dataset: typing.List[typing.List[int]]
) -> typing.List[typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]]
```

Build the megatron datasets for a list of prefixes in parallel

**Parameters:**

The list of prefix strings

The dataset split ratios (must sum to 1.00)

The number of samples to request

**Returns:** `List[List[Optional[GPTDataset]]]`

List\[List\[Optional\[GPTDataset]]]: For each split, have a list of

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._is_enabled_index(
    idx: int
) -> bool
```

Return True if a given split index should be built.

If no enabled\_splits were provided, all splits are enabled.

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder._masked_split_matrix(
    split_matrix: typing.List[typing.Optional[tuple]]
) -> typing.List[typing.Optional[tuple]]
```

Mask splits that are not enabled by setting their bookends to None.

This preserves the original split ratios while skipping construction for disabled splits.

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder.build() -> typing.List[typing.Optional[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset]]
```

Build all dataset splits according to the provided blend(s)

This method is distributed-aware and must be called on all ranks.

The dataset splits returned can vary according to the config. Supply config.blend and
config.split to build BlendedDataset and/or MegatronDataset splits from the same
distribution. Supply config.blend\_per\_split to build BlendedDataset and/or MegatronDataset
splits from separate distributions. In either case, for each split, handle the following
cases:

(1) The split is None

* do nothing

(2) The split has one contributing dataset, and...

(a) 'size' is not None

* Build a mid-level dataset with low-level dataset sampling in proportion to the
  size

(b) 'size' is None

* Build mid-level datasets with no excess low-level dataset sampling

(3) The split has multiple contributing datasets, and...

(a) 'weights' is not None and 'size' is not None

* Build mid-level datasets with low-level dataset sampling in proportion to their
  weights and the size
* Build a top-level dataset of length marginally greater than 'size' with mid-level
  dataset sampling in proportion to their weights and the size

(b) 'weights' is not None and 'size' is None

* Error

(c) 'weights' is None and 'size' is not None

* Build mid-level datasets with no excess low-level dataset sampling
* Build a top-level dataset of length 'size' (capped at the sum of the mid-level
  dataset lengths) with mid-level dataset sampling in proportion to their lengths
  and the size

(d) 'weights' is None and 'size' is None

* Build mid-level datasets with no excess low-level dataset sampling
* Build a top-level dataset with no excess mid-level dataset sampling

**Returns:** `List[Optional[GPTDataset]]`

List\[Optional\[GPTDataset]]: A list containing a dataset instance (or None) per
split

```python
nemo_automodel.components.datasets.llm.megatron.builder.BlendedMegatronDatasetBuilder.build_generic_dataset(
    is_built_on_rank: typing.Callable,
    synchronize_ranks: bool,
    args: typing.Any = ()
) -> typing.Optional[typing.Union[nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDataset | nemo_automodel.components.datasets.llm.megatron.builder.BlendedDataset, typing.Iterable]]
```

staticmethod

Build the GPTDataset or BlendedDataset

Return None if and only if the underlying dataset class is not built on the current rank
and torch.distributed is initialized.

**Parameters:**

The GPTDataset or BlendedDataset class to be
built. In special cases, e.g. when we are building the low level dataset for a
RawMegatronDataset instance, we can accept a Callable which returns an Iterable.

Whether to call barrier for rank-0 / barrier / other-ranks
behavior. Set to False when we enforce this behavior at higher level.

The positional arguments used to build the provided
GPTDataset or BlendedDataset class

**Returns:** `Optional[Union[GPTDataset | BlendedDataset, Iterable]]`

Optional\[Union\[GPTDataset | BlendedDataset, Iterable]]: The GPTDataset or BlendedDataset instantion, the
Iterable instantiation, or None

**Raises:**

* `Exception`: When the dataset constructor raises an OSError

```python
nemo_automodel.components.datasets.llm.megatron.builder._get_size_per_split_per_dataset(
    normalized_weights: typing.List[float],
    target_size_per_split: typing.List[int],
    surplus: float = 0.0
) -> typing.List[typing.List[int]]
```

Determine the contribution of the MegatronDataset splits to the BlendedDataset splits

**Parameters:**

e.g. \[0.3, 0.7]

The number of samples to target for each BlendedDataset
split

The sample surplus to build per split per dataset

**Returns:** `List[List[int]]`

List\[List\[int]]: The number of samples to request per MegatronDataset per split

```python
nemo_automodel.components.datasets.llm.megatron.builder._VERBOSE = False
```

```python
nemo_automodel.components.datasets.llm.megatron.builder.logger = logging.getLogger(__name__)
```