bridge.recipes.utils.dataset_utils
#
Module Contents#
Functions#
Common configuration logic for blend, blend_per_split, split dataset config fields. |
Data#
API#
- bridge.recipes.utils.dataset_utils._BLEND_TYPE#
None
- bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE#
None
- bridge.recipes.utils.dataset_utils._SPLIT_TYPE#
None
- bridge.recipes.utils.dataset_utils.get_blend_fields_from_data_paths(
- data_paths: Optional[List[str]] = None,
- data_args_path: Optional[str] = None,
- train_data_path: Optional[List[str]] = None,
- valid_data_path: Optional[List[str]] = None,
- test_data_path: Optional[List[str]] = None,
- per_split_data_args_path: Optional[str] = None,
- mock: bool = False,
Common configuration logic for blend, blend_per_split, split dataset config fields.
Handles mock and real data. If no path to data is provided, mock data will be used. Prioritizes
data_paths
over split data paths. For all ofdata_paths
,train_data_path
,valid_data_path
, andtest_data_path
, two formats are accepted: either (1) a list of prefixes, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and prefixes, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”]- Parameters:
data_paths (Optional[List[str]]) – List of paths to dataset files.
data_args_path (Optional[str]) – Path to file containing data arguments.
train_data_path (Optional[List[str]]) – List of training data paths.
valid_data_path (Optional[List[str]]) – List of validation data paths.
test_data_path (Optional[List[str]]) – List of test data paths.
per_split_data_args_path (Optional[str]) – Path to JSON file with per-split data configuration.
mock (bool) – Whether to use mock data. If True, ignores data_paths.
- Returns:
A tuple (blend, blend_per_split, split), the corresponding fields to be passed to GPTDatasetConfig.