bridge.recipes.utils.dataset_utils#
Dataset configuration utilities for recipes and training scripts.
Module Contents#
Functions#
Common configuration logic for blend, blend_per_split, split dataset config fields. |
|
Extract a Hydra-style override (key=value) from cli_overrides and remove it. |
|
Resolve sequence length: explicit arg > model config > 4096 fallback. |
|
Replace the recipe’s dataset config based on the requested dataset type. |
|
Infer training mode from the dataset type prefix. |
Data#
API#
- bridge.recipes.utils.dataset_utils.logger#
‘getLogger(…)’
- bridge.recipes.utils.dataset_utils._BLEND_TYPE#
None
- bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE#
None
- bridge.recipes.utils.dataset_utils._SPLIT_TYPE#
None
- bridge.recipes.utils.dataset_utils.get_blend_fields_from_data_paths(
- data_paths: Optional[List[str]] = None,
- data_args_path: Optional[str] = None,
- train_data_path: Optional[List[str]] = None,
- valid_data_path: Optional[List[str]] = None,
- test_data_path: Optional[List[str]] = None,
- per_split_data_args_path: Optional[str] = None,
- mock: bool = False,
Common configuration logic for blend, blend_per_split, split dataset config fields.
Handles mock and real data. If no path to data is provided, mock data will be used. Prioritizes
data_pathsover split data paths. For all ofdata_paths,train_data_path,valid_data_path, andtest_data_path, two formats are accepted: either (1) a list of prefixes, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and prefixes, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”]- Parameters:
data_paths (Optional[List[str]]) – List of paths to dataset files.
data_args_path (Optional[str]) – Path to file containing data arguments.
train_data_path (Optional[List[str]]) – List of training data paths.
valid_data_path (Optional[List[str]]) – List of validation data paths.
test_data_path (Optional[List[str]]) – List of test data paths.
per_split_data_args_path (Optional[str]) – Path to JSON file with per-split data configuration.
mock (bool) – Whether to use mock data. If True, ignores data_paths.
- Returns:
A tuple (blend, blend_per_split, split), the corresponding fields to be passed to GPTDatasetConfig.
- bridge.recipes.utils.dataset_utils.DATASET_TYPES#
[‘llm-pretrain’, ‘llm-pretrain-mock’, ‘llm-finetune’, ‘llm-finetune-preloaded’, ‘vlm-energon’, ‘vlm-…
- bridge.recipes.utils.dataset_utils.LLM_FINETUNE_PRESETS: dict[str, Callable]#
None
- bridge.recipes.utils.dataset_utils.extract_and_remove_override(
- cli_overrides: list[str],
- key: str,
- default: str | None = None,
Extract a Hydra-style override (key=value) from cli_overrides and remove it.
Returns the value if found, otherwise default.
- bridge.recipes.utils.dataset_utils._resolve_seq_length(
- config: megatron.bridge.training.config.ConfigContainer,
- seq_length: int | None,
Resolve sequence length: explicit arg > model config > 4096 fallback.
- bridge.recipes.utils.dataset_utils.apply_dataset_override(
- config: megatron.bridge.training.config.ConfigContainer,
- dataset_type: str,
- packed_sequence: bool = False,
- seq_length: int | None = None,
- cli_overrides: list[str] | None = None,
Replace the recipe’s dataset config based on the requested dataset type.
- Parameters:
config – The recipe config to modify.
dataset_type – One of :data:
DATASET_TYPES.packed_sequence – Whether to enable packed sequences.
seq_length – Explicit sequence length (None = use model’s or default 4096).
cli_overrides – Mutable list of Hydra-style CLI overrides. For
llm-finetune,dataset.dataset_nameis extracted and consumed here to select the preset.
- Returns:
The modified ConfigContainer.
- bridge.recipes.utils.dataset_utils.infer_mode_from_dataset(dataset_type: str) str#
Infer training mode from the dataset type prefix.