core.datasets.blended_megatron_dataset_config#
Module Contents#
Classes#
Configuration object for Megatron Core datasets |
Functions#
Parse the dataset split ratios from a string |
|
Build the split matrix from one or optionally two contributing split vectors. |
Data#
API#
- core.datasets.blended_megatron_dataset_config.logger#
‘getLogger(…)’
- class core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig#
Configuration object for Megatron Core datasets
- random_seed: int#
None
The seed for all RNG during dataset creation.
- sequence_length: int#
None
The sequence length.
- blend: Optional[Tuple[List[str], Optional[List[float]]]]#
None
The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.
- blend_per_split: Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]]#
None
A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defaults to None.
- multiple_validation_sets: Optional[bool]#
None
Whether the validation split should be treated as multiple seperate datasets.
- full_validation: Optional[bool]#
None
Whether to run a full epoch of validation each time validation occurs.
- split: Optional[str]#
None
The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.
- split_matrix: Optional[List[Tuple[float, float]]]#
‘field(…)’
The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.
- num_dataset_builder_threads: int#
1
The number of threads to use for dataset building.
- path_to_cache: Optional[str]#
None
Where all re-useable dataset indices are to be cached.
- mmap_bin_files: bool#
True
Whether to mmap the .bin files or use file pointers.
- mock: bool#
‘field(…)’
Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.
- tokenizer: Optional[megatron.core.tokenizers.MegatronTokenizerBase]#
None
The MegatronTokenizerBase instance. Required for datasets that do online tokenization.
- mid_level_dataset_surplus: float#
0.005
The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).
- allow_ambiguous_pad_tokens: Optional[bool]#
False
Whether to prevent pad tokens already present in the dataset from being masked out when the pad token incorrectly shares the same id with other special tokens. Treating such tokens as pad tokens results in training instability and divergence. Such a scenario is best resolved by fixing the tokenizer, but leaving this option as False provides a workaround. This argument will have no effect if the tokenizer is correct. However, should the user desire to train on a dataset that intentionally contains pad tokens - while also using an incorrect tokenizer - this option may be set to True. This is typically not recommended.
- __post_init__() None#
Do asserts and set fields post init
- core.datasets.blended_megatron_dataset_config.parse_and_normalize_split(split: str) List[float]#
Parse the dataset split ratios from a string
- Parameters:
split (str) – The train valid test split string e.g. “99,1,0”
- Returns:
The trian valid test split ratios e.g. [0.99, 0.01, 0.0]
- Return type:
List[float]
- core.datasets.blended_megatron_dataset_config.convert_split_vector_to_split_matrix(
- vector_a: List[float],
- vector_b: Optional[List[float]] = None,
Build the split matrix from one or optionally two contributing split vectors.
Ex. a standard conversion:
[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]
Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:
[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]
- Parameters:
vector_a (List[float]) – The primary split vector
vector_b (Optional[List[float]]) – An optional secondary split vector which constrains the primary split vector. Defaults to None.
- Returns:
The split matrix consisting of book-ends of each split in order
- Return type:
List[Tuple[float, float]]