core.datasets.blended_megatron_dataset_config#

Module Contents#

Classes#

BlendedMegatronDatasetConfig

Configuration object for Megatron Core datasets

Functions#

parse_and_normalize_split

Parse the dataset split ratios from a string

convert_split_vector_to_split_matrix

Build the split matrix from one or optionally two contributing split vectors.

Data#

API#

core.datasets.blended_megatron_dataset_config.logger#

‘getLogger(…)’

class core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig#

Configuration object for Megatron Core datasets

random_seed: int#

None

The seed for all RNG during dataset creation.

sequence_length: int#

None

The sequence length.

blend: Optional[Tuple[List[str], Optional[List[float]]]]#

None

The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights. For example, [[“dataset-path1”, “dataset-path2”], [0.3, 0.7]]. When the weights are None, they are inferred from the lengths of the contributing datasets. Not to be used with ‘blend_per_split’. Defaults to None.

blend_per_split: Optional[List[Optional[Tuple[List[str], Optional[List[float]]]]]]#

None

A set of blends, as defined above, one for each split distribution. Not to be used with ‘blend’. Defaults to None.

multiple_validation_sets: Optional[bool]#

None

Whether the validation split should be treated as multiple seperate datasets.

full_validation: Optional[bool]#

None

Whether to run a full epoch of validation each time validation occurs.

split: Optional[str]#

None

The split string, a comma separated weighting for the dataset splits when drawing samples from a single distribution. Not to be used with ‘blend_per_split’. Defaults to None.

split_matrix: Optional[List[Tuple[float, float]]]#

‘field(…)’

The split matrix consisting of non-overlapping book-ends of each split in order. For more information, refer to ‘convert_split_vector_to_split_matrix’. Created automatically from ‘split’. Not to be passed in to the constructor.

num_dataset_builder_threads: int#

1

The number of threads to use for dataset building.

path_to_cache: Optional[str]#

None

Where all re-useable dataset indices are to be cached.

mmap_bin_files: bool#

True

Whether to mmap the .bin files or use file pointers.

mock: bool#

‘field(…)’

Whether to bypass real data loading and validation in favor of mock data generation. Created automatically from ‘blend’ and ‘blend_per_split’. Not to be passed in to the constructor.

tokenizer: Optional[megatron.core.tokenizers.MegatronTokenizerBase]#

None

The MegatronTokenizerBase instance. Required for datasets that do online tokenization.

mid_level_dataset_surplus: float#

0.005

The sample surplus to build for the mid-level datasets(s). Defaults arbitrarily to 0.005. This value is irrelevant for single source data blends. This value may need to be increased if the top level dataset oversamples the mid level dataset(s). This value may be set to 0.0 in future if the top level dataset is constrained to not oversample the mid level datasets(s).

allow_ambiguous_pad_tokens: Optional[bool]#

False

Whether to prevent pad tokens already present in the dataset from being masked out when the pad token incorrectly shares the same id with other special tokens. Treating such tokens as pad tokens results in training instability and divergence. Such a scenario is best resolved by fixing the tokenizer, but leaving this option as False provides a workaround. This argument will have no effect if the tokenizer is correct. However, should the user desire to train on a dataset that intentionally contains pad tokens - while also using an incorrect tokenizer - this option may be set to True. This is typically not recommended.

__post_init__() None#

Do asserts and set fields post init

core.datasets.blended_megatron_dataset_config.parse_and_normalize_split(split: str) List[float]#

Parse the dataset split ratios from a string

Parameters:

split (str) – The train valid test split string e.g. “99,1,0”

Returns:

The trian valid test split ratios e.g. [0.99, 0.01, 0.0]

Return type:

List[float]

core.datasets.blended_megatron_dataset_config.convert_split_vector_to_split_matrix(
vector_a: List[float],
vector_b: Optional[List[float]] = None,
) List[Optional[Tuple[float, float]]]#

Build the split matrix from one or optionally two contributing split vectors.

Ex. a standard conversion:

[0.99, 0.01, 0.0] -> [(0, 0.99), (0.99, 1.0), None]

Ex. a conversion for Retro when Retro pretraining uses a [0.99, 0.01, 0.0] split and Retro preprocessing used a [0.98, 0.02, 0.0] split:

[0.99, 0.01, 0.0], [0.98, 0.02, 0.0] -> [(0, 0.98), (0.99, 1.0), None]

Parameters:
  • vector_a (List[float]) – The primary split vector

  • vector_b (Optional[List[float]]) – An optional secondary split vector which constrains the primary split vector. Defaults to None.

Returns:

The split matrix consisting of book-ends of each split in order

Return type:

List[Tuple[float, float]]