nemo_automodel.components.datasets.llm.megatron_dataset
nemo_automodel.components.datasets.llm.megatron_dataset
Module Contents
Classes
Functions
Data
API
Build Megatron pretraining datasets and dataloaders.
Get the GPT dataset configuration.
Build the datasets using the trainer parameters provided during initialization.
Get the dataset for a given split.
Get the list of unique dataset prefixes (full paths without extension) from a glob pattern.
Returns True if string is a number.
Check if the paths are zipped.
Load a data blend configuration from a JSON file.
Two top-level JSON shapes are accepted:
- Dict-of-splits (Automodel native form): keys are split names (‘train’, ‘valid’, ‘test’); values are path lists. Common aliases ‘valid’ / ‘val’ / ‘dev’ are normalized to ‘validation’.
- Flat list (Megatron-LM canonical form): a single zipped list of
alternating weights and dataset prefixes. The caller uses the
split=parameter to allocate this blend across train / validation / test splits.
Example flat-list JSON (Megatron-LM convention, paired with split=):
[“30”, “path/to/dataset1”, “70”, “path/to/dataset2”]
Parameters:
Path to a JSON file containing the blend configuration.
Returns: Optional[Union[Dict[str, List], List]]
Dictionary or list containing the blend configuration if path is
Raises:
FileNotFoundError: If the JSON file does not exist.PermissionError: If the JSON file cannot be read.ValueError: If the JSON is invalid or is neither a list nor a dict.
Validate the accessibility of the dataset assets. Skips local-filesystem checks for S3/MSC paths when object_storage_config is provided.