`core.datasets.utils`#

Module Contents#

`compile_helpers`	Compile C++ helper functions at runtime. Make sure this is invoked on a single process.
`normalize`	Do non-exponentiated normalization
`get_blend_from_list`	Get the blended_megatron_dataset_config.BlendedMegatronDatasetConfig blend from the blend list

class core.datasets.utils.Split(*args, **kwds)#

Bases: enum.Enum

core.datasets.utils.compile_helpers()#: Compile C++ helper functions at runtime. Make sure this is invoked on a single process.

core.datasets.utils.normalize(weights: List[float]) → List[float]#

Do non-exponentiated normalization

core.datasets.utils.get_blend_from_list( blend: Optional[List[str]], ) → Optional[Tuple[List[str], Optional[List[float]]]]#

Get the blended_megatron_dataset_config.BlendedMegatronDatasetConfig blend from the blend list

Parameters:: blend (Optional[List[str]]) – The blend list, which can be either (1) a list of prefixes, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and prefixes, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”]
Returns:: The blend, consisting of a list of dataset prefixes and optionally a list of dataset weights, e.g. [[“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], [30.0, 70.0]].
Return type:: Optional[Tuple[List[str], Optional[List[float]]]]