`nemo_automodel.components.datasets.llm.megatron_dataset`#

Module Contents#

Classes#

MegatronPretraining

Build Megatron pretraining datasets and dataloaders.

Functions#

`is_number_tryexcept`	Returns True if string is a number.
`is_zipped_list`	Check if the paths are zipped.
`validate_dataset_asset_accessibility`	Validate the accessibility of the dataset assets. Skips local-filesystem checks for S3/MSC paths when object_storage_config is provided.
`get_list_of_files`	Get the list of unique dataset prefixes (full paths without extension) from a glob pattern.
`try_load_blend_from_json`	Load a data blend configuration from a JSON file.

Data#

logger

API#

nemo_automodel.components.datasets.llm.megatron_dataset.logger#: ‘getLogger(…)’

class nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining( paths: pathlib.Path | List | Dict[str, List], seq_length: int = 2048, tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None, micro_batch_size: int = 4, global_batch_size: int = 8, create_attention_mask: bool = False, seed: int = 1234, split: str = '900,50,50', index_mapping_dir: Optional[str] = None, num_dataset_builder_threads: int = 1, num_train_samples: Optional[int] = None, num_val_samples: Optional[int] = None, num_test_samples: Optional[int] = None, trainer_max_steps: Optional[int] = None, trainer_val_check_interval: int = 1000, trainer_limit_val_batches: Union[int, float] = 1, trainer_limit_test_batches: Union[int, float] = 1, mmap_bin_files: bool = True, splits_to_build: Optional[Union[str, List[str]]] = None, object_storage_config: Optional[Union[Dict, nemo_automodel.components.datasets.llm.megatron.indexed_dataset.ObjectStorageConfig]] = None, )#

Build Megatron pretraining datasets and dataloaders.

Initialization

Pretraining dataset class for Megatron-LM datasets.

Parameters:

paths (Path | List | Dict[str, List]) –
Paths of the data distributions. Can be either a single path, a list of paths, a dictionary, or a path to a JSON file containing a dictionary. If a single path (not JSON) or a list of paths, the given paths will be used to generate the train, validation and test datasets. If providing a list of paths, the format can be either (1) a list of paths, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and paths, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”] If a dictionary is provided (either directly or via JSON file), it is expected to have the following form: { ‘train’: , ‘validation’: , ‘test’: } where each value is either a path or a list of paths as described above. In this case, each split will be generated using the given paths. Split name aliases are supported: ‘valid’, ‘val’, ‘dev’ are normalized to ‘validation’. Note that if limit_val_batches <= 1, we generate the entire validaton dataset, so weights should not be provided for the validation split.

Example JSON file format (dict-of-splits): { “train”: [“30”, “path/to/dataset1”, “70”, “path/to/dataset2”], “valid”: [“path/to/val_dataset”], “test”: [“path/to/test_dataset”] } Alternatively the JSON file may contain a single flattened list (Megatron-LM canonical form), paired with the split argument to derive per-split ratios: [“30”, “path/to/dataset1”, “70”, “path/to/dataset2”]
seq_length (int) – Sequence length.
tokenizer (Optional[PreTrainedTokenizerBase]) – An instance of a PreTrainedTokenizerBase object.
micro_batch_size (int) – Batch size per GPU.
global_batch_size (int) – Global batch size.
create_attention_mask (bool) – Option to enable the attention masks generation. Not supported with fused and flash attention.
seed (int) – Seed for generating the GPT dataset.
split (str) – A string of 3 comma-separated integers denoting how much of the distribution to allocate to train, validation, and test sets, respectively. Unused if paths is a dict.
index_mapping_dir (Optional[str]) – Path to a directory to write index mapping files.
num_dataset_builder_threads (int) – The number of threads to use for dataset building.
num_train_samples (Optional[int]) – The number of samples to use for training, defaults to total train steps times global batch size.
num_val_samples (Optional[int]) – The number of samples to use for validation, defaults to total validation steps times global batch size.
num_test_samples (Optional[int]) – The number of samples to use for testing, defaults to total test steps times global batch size.
trainer_max_steps (Optional[int]) – Maximum training steps. If None or -1, uses full dataset for one epoch.
trainer_val_check_interval (int) – Interval for validation checks.
trainer_limit_val_batches (Union[int, float]) – Limit for validation batches.
trainer_limit_test_batches (Union[int, float]) – Limit for test batches.
splits_to_build (Optional[Union[str, List[str]]]) – Splits to build. If None, builds all splits.
object_storage_config (Optional[Union[Dict, ObjectStorageConfig]]) – Configuration for reading .bin/.idx files from S3/MSC. A dict with path_to_idx_cache (required) and bin_chunk_nbytes (optional, default 256 MiB) is also accepted.

build()#: Build the datasets using the trainer parameters provided during initialization.

get_dataset(split: str)#: Get the dataset for a given split.

property gpt_dataset_config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig#: Get the GPT dataset configuration.

nemo_automodel.components.datasets.llm.megatron_dataset.is_number_tryexcept(s)#: Returns True if string is a number.

nemo_automodel.components.datasets.llm.megatron_dataset.is_zipped_list(paths)#: Check if the paths are zipped.

nemo_automodel.components.datasets.llm.megatron_dataset.validate_dataset_asset_accessibility( paths, object_storage_config=None, )#: Validate the accessibility of the dataset assets. Skips local-filesystem checks for S3/MSC paths when object_storage_config is provided.

nemo_automodel.components.datasets.llm.megatron_dataset.get_list_of_files(path: str)#: Get the list of unique dataset prefixes (full paths without extension) from a glob pattern.

nemo_automodel.components.datasets.llm.megatron_dataset.try_load_blend_from_json( path: Union[str, pathlib.Path], ) → Optional[Union[Dict[str, List], List]]#

Load a data blend configuration from a JSON file.

Two top-level JSON shapes are accepted:

Dict-of-splits (Automodel native form): keys are split names (‘train’, ‘valid’, ‘test’); values are path lists. Common aliases ‘valid’ / ‘val’ / ‘dev’ are normalized to ‘validation’.
Flat list (Megatron-LM canonical form): a single zipped list of alternating weights and dataset prefixes. The caller uses the split= parameter to allocate this blend across train / validation / test splits.

Parameters:

path – Path to a JSON file containing the blend configuration.

Returns:

Dictionary or list containing the blend configuration if path is a JSON file. None if path is not a .json file (caller should fall back to interpreting path as a glob or a literal prefix).

Raises:

FileNotFoundError – If the JSON file does not exist.
PermissionError – If the JSON file cannot be read.
ValueError – If the JSON is invalid or is neither a list nor a dict.

Example dict-of-splits JSON: { “train”: [“30”, “path/to/dataset1”, “70”, “path/to/dataset2”], “valid”: [“path/to/val_dataset”], “test”: [“path/to/test_dataset”] }

Example flat-list JSON (Megatron-LM convention, paired with split=): [“30”, “path/to/dataset1”, “70”, “path/to/dataset2”]

nemo_automodel.components.datasets.llm.megatron_dataset#