bridge.data.utils#

Module Contents#

Functions#

is_dataset_built_on_rank

Determines whether the dataset should be built on the current rank.

pretrain_train_valid_test_datasets_provider

Build pretraining train, validation, and test datasets.

hf_train_valid_test_datasets_provider

Build train, validation, and test datasets from a Hugging Face dataset.

finetuning_train_valid_test_datasets_provider

Build finetuning train, validation, and test datasets.

get_dataset_provider

Get the appropriate dataset provider function based on the config type.

Data#

API#

bridge.data.utils.is_dataset_built_on_rank() bool#

Determines whether the dataset should be built on the current rank.

Datasets are typically built only on the first and last pipeline stages and the first tensor parallel rank to save memory and avoid redundancy.

Returns:

True if the dataset should be built on the current rank, False otherwise.

bridge.data.utils.pretrain_train_valid_test_datasets_provider(
train_val_test_num_samples: list[int],
dataset_config: megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig,
) tuple[megatron.core.datasets.gpt_dataset.GPTDataset, megatron.core.datasets.gpt_dataset.GPTDataset, megatron.core.datasets.gpt_dataset.GPTDataset]#

Build pretraining train, validation, and test datasets.

Uses BlendedMegatronDatasetBuilder to create GPTDataset or MockGPTDataset instances.

Parameters:
  • train_val_test_num_samples – A list containing the number of samples for train, validation, and test datasets.

  • dataset_config – Configuration object for the blended Megatron dataset.

Returns:

A tuple containing the train, validation, and test datasets.

bridge.data.utils.hf_train_valid_test_datasets_provider(
train_val_test_num_samples: list[int],
dataset_config: megatron.bridge.data.builders.hf_dataset.HFDatasetConfig,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
) tuple[Any, Any, Any]#

Build train, validation, and test datasets from a Hugging Face dataset.

Uses HFDatasetBuilder to create dataset instances.

Parameters:
  • train_val_test_num_samples – A list containing the number of samples for train, validation, and test datasets.

  • dataset_config – Configuration object for the Hugging Face dataset.

  • tokenizer – The MegatronTokenizer instance.

Returns:

A tuple containing the train, validation, and test datasets.

bridge.data.utils.finetuning_train_valid_test_datasets_provider(
train_val_test_num_samples: list[int],
dataset_config: megatron.bridge.training.config.FinetuningDatasetConfig,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
) tuple[Any, Any, Any]#

Build finetuning train, validation, and test datasets.

Uses FinetuningDatasetBuilder to create dataset instances.

Parameters:
  • train_val_test_num_samples – A list containing the number of samples for train, validation, and test datasets.

  • dataset_config – Configuration object for the finetuning dataset.

  • tokenizer – The MegatronTokenizer instance.

Returns:

A tuple containing the train, validation, and test datasets.

bridge.data.utils._REGISTRY: Dict[Type[Union[megatron.bridge.training.config.FinetuningDatasetConfig, megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig, megatron.bridge.data.builders.hf_dataset.HFDatasetConfig]], Callable]#

None

bridge.data.utils.get_dataset_provider(
dataset_config: Union[megatron.bridge.training.config.FinetuningDatasetConfig, megatron.core.datasets.blended_megatron_dataset_config.BlendedMegatronDatasetConfig, megatron.bridge.data.builders.hf_dataset.HFDatasetConfig],
) Callable#

Get the appropriate dataset provider function based on the config type.

Parameters:

dataset_config – The dataset configuration object.

Returns:

The callable dataset provider function corresponding to the config type.