`bridge.recipes.utils.dataset_utils`#

Dataset configuration utilities for recipes and training scripts.

Module Contents#

Functions#

`default_peft_config`	Create the default PEFT configuration for a finetuning recipe.
`_text_hf_dataset_config`	Create an HF-backed text SFT config with optional offline packing.
`default_squad_config`	Create the default SQuAD dataset configuration for finetuning recipes.
`default_tulu3_config`	Create the default Tulu 3 SFT mixture dataset configuration.
`default_openmathinstruct2_config`	Create the default OpenMathInstruct-2 finetuning dataset.
`default_gsm8k_config`	Create the default GSM8K dataset configuration for finetuning recipes.
`default_openmathinstruct2_thinking_config`	Create the thinking/chat variant of the OpenMathInstruct-2 dataset.
`get_blend_fields_from_data_paths`	Common configuration logic for blend, blend_per_split, split dataset config fields.
`_resolve_seq_length`	Use the selected recipe’s model sequence length for a dataset preset.
`_mock_dataset_config`	Build the mock pretraining dataset preset.
`_megatron_indexed_dataset_config`	Build the Megatron indexed pretraining dataset preset.
`_squad_dataset_config`	Build the SQuAD text SFT dataset preset.
`_tulu3_dataset_config`	Build the Tulu 3 chat SFT dataset preset.
`_openmathinstruct2_dataset_config`	Build the OpenMathInstruct-2 prompt-completion preset.
`_openmathinstruct2_thinking_dataset_config`	Build the OpenMathInstruct-2 thinking/chat preset.
`_gsm8k_dataset_config`	Build the GSM8K text SFT dataset preset.
`_local_jsonl_dataset_config`	Build the local prompt-completion JSONL config before path overrides.
`_local_vlm_json_source`	Build an override-ready local JSON source for one VLM split.
`_require_direct_hf_config`	Return the recipe’s direct-HF config or reject an incompatible preset.
`_local_vlm_dataset_config`	Build an override-ready local JSON/JSONL VLM preset.
`_hf_vlm_dataset_config`	Build a named direct-HF VLM dataset preset.
`build_dataset_config`	Build a dataset config from a public preset name.
`dataset_train_mode`	Return the training loop required by a built dataset config.

Data#

`_BLEND_TYPE`
`_BLEND_PER_SPLIT_TYPE`
`_SPLIT_TYPE`
`PublicDatasetConfig`
`DatasetPreset`
`DATASET_PRESETS`

API#

bridge.recipes.utils.dataset_utils._BLEND_TYPE#: None

bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE#: None

bridge.recipes.utils.dataset_utils._SPLIT_TYPE#: None

bridge.recipes.utils.dataset_utils.default_peft_config(

peft_scheme: str | megatron.bridge.peft.base.PEFT | None,

**kwargs: Any,

) → megatron.bridge.peft.base.PEFT | None#

Create the default PEFT configuration for a finetuning recipe.

Parameters:

peft_scheme – PEFT scheme ("lora", "dora"), an existing PEFT instance, or None for full finetuning.
**kwargs – Keyword arguments passed to the selected PEFT configuration.

Returns:

A PEFT configuration, or None for full finetuning.

Raises:

ValueError – If peft_scheme is not supported.

bridge.recipes.utils.dataset_utils._text_hf_dataset_config( *, seq_length: int, source: megatron.bridge.data.builders.HFDatasetSourceConfig, preprocessing: megatron.bridge.data.builders.SFTPreprocessingConfig, validation_source: megatron.bridge.data.builders.HFDatasetSourceConfig | None = None, test_source: megatron.bridge.data.builders.HFDatasetSourceConfig | None = None, do_validation: bool = True, do_test: bool = False, enable_offline_packing: bool = False, offline_packing_specs: megatron.bridge.data.packing.PackedSequenceSpecs | None = None, dataset_kwargs: dict[str, Any] | None = None, val_proportion: float | None = None, num_workers: int = 2, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Create an HF-backed text SFT config with optional offline packing.

bridge.recipes.utils.dataset_utils.default_squad_config( seq_length: int, enable_offline_packing: bool = True, pad_seq_to_mult: int = 1, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#

Create the default SQuAD dataset configuration for finetuning recipes.

Parameters:

seq_length – Sequence length for the dataset.
enable_offline_packing – Whether to enable offline packed-sequence preparation.
pad_seq_to_mult – Multiple to pad each sequence to when packing.

Returns:

A dataset configuration for SQuAD finetuning.

bridge.recipes.utils.dataset_utils.default_tulu3_config( seq_length: int = 4096, enable_offline_packing: bool = False, pad_seq_to_mult: int = 1, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#

Create the default Tulu 3 SFT mixture dataset configuration.

Parameters:

seq_length – Maximum sequence length.
enable_offline_packing – Whether to enable offline text SFT packing.
pad_seq_to_mult – Sequence-length multiple used by offline packing.

Returns:

A chat SFT configuration for allenai/tulu-3-sft-mixture.

bridge.recipes.utils.dataset_utils.default_openmathinstruct2_config( seq_length: int = 4096, enable_offline_packing: bool = False, pad_seq_to_mult: int = 1, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#

Create the default OpenMathInstruct-2 finetuning dataset.

Parameters:

seq_length – Maximum sequence length.
enable_offline_packing – Whether to enable offline text SFT packing.
pad_seq_to_mult – Sequence-length multiple used by offline packing.

Returns:

An OpenMathInstruct-2 dataset configuration.

bridge.recipes.utils.dataset_utils.default_gsm8k_config( seq_length: int = 2048, enable_offline_packing: bool = False, pad_seq_to_mult: int = 1, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#

Create the default GSM8K dataset configuration for finetuning recipes.

Parameters:

seq_length – Maximum sequence length.
enable_offline_packing – Whether to enable offline text SFT packing.
pad_seq_to_mult – Sequence-length multiple used by offline packing.

Returns:

A GSM8K dataset configuration.

bridge.recipes.utils.dataset_utils.default_openmathinstruct2_thinking_config( seq_length: int = 4096, enable_offline_packing: bool = False, pad_seq_to_mult: int = 1, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#

Create the thinking/chat variant of the OpenMathInstruct-2 dataset.

Parameters:

seq_length – Maximum sequence length.
enable_offline_packing – Whether to enable offline text SFT packing.
pad_seq_to_mult – Sequence-length multiple used by offline packing.

Returns:

An OpenMathInstruct-2 thinking dataset configuration.

bridge.recipes.utils.dataset_utils.get_blend_fields_from_data_paths( data_paths: Optional[List[str]] = None, data_args_path: Optional[str] = None, train_data_path: Optional[List[str]] = None, valid_data_path: Optional[List[str]] = None, test_data_path: Optional[List[str]] = None, per_split_data_args_path: Optional[str] = None, mock: bool = False, ) → Tuple[bridge.recipes.utils.dataset_utils._BLEND_TYPE, bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE, bridge.recipes.utils.dataset_utils._SPLIT_TYPE]#

Common configuration logic for blend, blend_per_split, split dataset config fields.

Handles mock and real data. If no path to data is provided, mock data will be used. Prioritizes data_paths over split data paths. For all of data_paths, train_data_path, valid_data_path, and test_data_path, two formats are accepted: either (1) a list of prefixes, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and prefixes, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”]

Parameters:

data_paths (Optional[List[str]]) – List of paths to dataset files.
data_args_path (Optional[str]) – Path to file containing data arguments.
train_data_path (Optional[List[str]]) – List of training data paths.
valid_data_path (Optional[List[str]]) – List of validation data paths.
test_data_path (Optional[List[str]]) – List of test data paths.
per_split_data_args_path (Optional[str]) – Path to JSON file with per-split data configuration.
mock (bool) – Whether to use mock data. If True, ignores data_paths.

Returns:

A tuple (blend, blend_per_split, split), the corresponding fields to be passed to GPTDatasetConfig.

bridge.recipes.utils.dataset_utils.PublicDatasetConfig: TypeAlias#: None

bridge.recipes.utils.dataset_utils.DatasetPreset: TypeAlias#: None

bridge.recipes.utils.dataset_utils._resolve_seq_length( config: megatron.bridge.training.config.ConfigContainer, ) → int#: Use the selected recipe’s model sequence length for a dataset preset.

bridge.recipes.utils.dataset_utils._mock_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.training.config.MockGPTDatasetConfig#: Build the mock pretraining dataset preset.

bridge.recipes.utils.dataset_utils._megatron_indexed_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.training.config.GPTDatasetConfig#: Build the Megatron indexed pretraining dataset preset.

bridge.recipes.utils.dataset_utils._squad_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the SQuAD text SFT dataset preset.

bridge.recipes.utils.dataset_utils._tulu3_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the Tulu 3 chat SFT dataset preset.

bridge.recipes.utils.dataset_utils._openmathinstruct2_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the OpenMathInstruct-2 prompt-completion preset.

bridge.recipes.utils.dataset_utils._openmathinstruct2_thinking_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the OpenMathInstruct-2 thinking/chat preset.

bridge.recipes.utils.dataset_utils._gsm8k_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the GSM8K text SFT dataset preset.

bridge.recipes.utils.dataset_utils._local_jsonl_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.GPTSFTDatasetConfig#: Build the local prompt-completion JSONL config before path overrides.

bridge.recipes.utils.dataset_utils._local_vlm_json_source( split: str, ) → megatron.bridge.data.builders.HFDatasetSourceConfig#: Build an override-ready local JSON source for one VLM split.

bridge.recipes.utils.dataset_utils._require_direct_hf_config( config: megatron.bridge.training.config.ConfigContainer, dataset_name: str, ) → megatron.bridge.data.builders.DirectHFSFTDatasetConfig#: Return the recipe’s direct-HF config or reject an incompatible preset.

bridge.recipes.utils.dataset_utils._local_vlm_dataset_config( config: megatron.bridge.training.config.ConfigContainer, ) → megatron.bridge.data.builders.DirectHFSFTDatasetConfig#: Build an override-ready local JSON/JSONL VLM preset.

bridge.recipes.utils.dataset_utils._hf_vlm_dataset_config( config: megatron.bridge.training.config.ConfigContainer, *, public_name: str, hf_dataset_name: str, train_only: bool = False, supports_test: bool = False, adapter_kwargs: dict[str, object] | None = None, ) → megatron.bridge.data.builders.DirectHFSFTDatasetConfig#: Build a named direct-HF VLM dataset preset.

bridge.recipes.utils.dataset_utils.DATASET_PRESETS: dict[str, bridge.recipes.utils.dataset_utils.DatasetPreset]#: None

bridge.recipes.utils.dataset_utils.build_dataset_config( config: megatron.bridge.training.config.ConfigContainer, dataset_name: str, ) → bridge.recipes.utils.dataset_utils.PublicDatasetConfig#

Build a dataset config from a public preset name.

Parameters:

config – Recipe config supplying model and model-specific dataset defaults.
dataset_name – Public dataset preset or local source selector.

Returns:

A new dataset config. Callers may then apply ordinary dataset.* ConfigContainer overrides before validation and runtime builder selection.

Raises:

ValueError – If the name is unknown or the recipe’s dataset config is incompatible.

bridge.recipes.utils.dataset_utils.dataset_train_mode( dataset_config: bridge.recipes.utils.dataset_utils.PublicDatasetConfig, ) → Literal[pretrain, finetune]#: Return the training loop required by a built dataset config.

bridge.recipes.utils.dataset_utils#

Module Contents#

Functions#

Data#

API#

`bridge.recipes.utils.dataset_utils`#