core.datasets.blended_dataset#
Module Contents#
Classes#
Conjugating class for a set of MegatronDataset instances |
Data#
API#
- core.datasets.blended_dataset.logger#
‘getLogger(…)’
- core.datasets.blended_dataset._VERBOSE#
False
- class core.datasets.blended_dataset.BlendedDataset(
- datasets: List[megatron.core.datasets.megatron_dataset.MegatronDataset],
- weights: List[Union[int, float]],
- size: Optional[int],
- config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
Bases:
torch.utils.data.DatasetConjugating class for a set of MegatronDataset instances
- Parameters:
datasets (List[MegatronDataset]) – The MegatronDataset instances to blend
weights (List[Union[int, float]]) – The weights that determine the dataset blend ratios
size (Optional[int]) – The number of samples to draw from the blend. If None, for each dataset index idx draw exactly weights[idx] samples from datasets[idx].
config (BlendedMegatronDatasetConfig) – The config
- Raises:
RuntimeError – When the dataset has fewer or more samples than ‘size’ post-initialization
Initialization
- __len__() int#
- __getitem__(
- idx: int,
- _build_indices() Tuple[numpy.ndarray, numpy.ndarray]#
Build and optionally cache the dataset index and the dataset sample index
The dataset index is a 1-D mapping which determines the dataset to query. The dataset sample index is a 1-D mapping which determines the sample to request from the queried dataset.
- Returns:
The dataset index and the dataset sample index
- Return type:
Tuple[numpy.ndarray, numpy.ndarray]