nemo_automodel.components.datasets.llm.megatron_dataset
#
Module Contents#
Classes#
Functions#
Returns True if string is a number. |
|
Check if the paths are zipped. |
|
Validate the accessibility of the dataset assets. |
|
Get the list of unique dataset prefixes (full paths without extension) from a glob pattern. |
Data#
API#
- nemo_automodel.components.datasets.llm.megatron_dataset.logger#
‘getLogger(…)’
- class nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining(
- paths: pathlib.Path | List | Dict[str, List],
- seq_length: int = 2048,
- tokenizer: Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None,
- micro_batch_size: int = 4,
- global_batch_size: int = 8,
- create_attention_mask: bool = False,
- seed: int = 1234,
- split: str = '900,50,50',
- index_mapping_dir: Optional[str] = None,
- num_dataset_builder_threads: int = 1,
- num_train_samples: Optional[int] = None,
- num_val_samples: Optional[int] = None,
- num_test_samples: Optional[int] = None,
- trainer_max_steps: Optional[int] = None,
- trainer_val_check_interval: int = 1000,
- trainer_limit_val_batches: Union[int, float] = 1,
- trainer_limit_test_batches: Union[int, float] = 1,
- mmap_bin_files: bool = True,
- splits_to_build: Optional[Union[str, List[str]]] = None,
Initialization
Pretraining dataset class for Megatron-LM datasets.
- Parameters:
paths (Path | List | Dict[str, List]) – Paths of the data distributions. Can be either a single path, a list of paths, or a dictionary. If a single path or a list of paths, the given paths will be used to generate the train, validation and test datasets. If providing a list of paths, the format can be either (1) a list of paths, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and paths, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”] If a dictionary is provided, it is expected to have the following form: { ‘train’:
, ‘validation’: , ‘test’: } where each value is either a path or a list of paths as described above. In this case, each split will be generated using the given paths. Note that if limit_val_batches <= 1, we generate the entire validaton dataset, so weights should not be provided for the validation split. seq_length (int) – Sequence length.
tokenizer (Optional[PreTrainedTokenizerBase]) – An instance of a PreTrainedTokenizerBase object.
micro_batch_size (int) – Batch size per GPU.
global_batch_size (int) – Global batch size.
create_attention_mask (bool) – Option to enable the attention masks generation. Not supported with fused and flash attention.
seed (int) – Seed for generating the GPT dataset.
split (str) – A string of 3 comma-separated integers denoting how much of the distribution to allocate to train, validation, and test sets, respectively. Unused if
paths
is a dict.index_mapping_dir (Optional[str]) – Path to a directory to write index mapping files.
num_dataset_builder_threads (int) – The number of threads to use for dataset building.
num_train_samples (Optional[int]) – The number of samples to use for training, defaults to total train steps times global batch size.
num_val_samples (Optional[int]) – The number of samples to use for validation, defaults to total validation steps times global batch size.
num_test_samples (Optional[int]) – The number of samples to use for testing, defaults to total test steps times global batch size.
trainer_max_steps (Optional[int]) – Maximum training steps. If None or -1, uses full dataset for one epoch.
trainer_val_check_interval (int) – Interval for validation checks.
trainer_limit_val_batches (Union[int, float]) – Limit for validation batches.
trainer_limit_test_batches (Union[int, float]) – Limit for test batches.
splits_to_build (Optional[Union[str, List[str]]]) – Splits to build. If None, builds all splits.
- build()#
Build the datasets using the trainer parameters provided during initialization.
- get_dataset(split: str)#
Get the dataset for a given split.
- property gpt_dataset_config: nemo_automodel.components.datasets.llm.megatron.gpt_dataset.GPTDatasetConfig#
Get the GPT dataset configuration.
- nemo_automodel.components.datasets.llm.megatron_dataset.is_number_tryexcept(s)#
Returns True if string is a number.
- nemo_automodel.components.datasets.llm.megatron_dataset.is_zipped_list(paths)#
Check if the paths are zipped.
- nemo_automodel.components.datasets.llm.megatron_dataset.validate_dataset_asset_accessibility(paths)#
Validate the accessibility of the dataset assets.
- nemo_automodel.components.datasets.llm.megatron_dataset.get_list_of_files(path: str)#
Get the list of unique dataset prefixes (full paths without extension) from a glob pattern.