bridge.recipes.utils.dataset_utils#

Module Contents#

Functions#

get_blend_fields_from_data_paths

Common configuration logic for blend, blend_per_split, split dataset config fields.

Data#

API#

bridge.recipes.utils.dataset_utils._BLEND_TYPE#

None

bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE#

None

bridge.recipes.utils.dataset_utils._SPLIT_TYPE#

None

bridge.recipes.utils.dataset_utils.get_blend_fields_from_data_paths(
data_paths: Optional[List[str]] = None,
data_args_path: Optional[str] = None,
train_data_path: Optional[List[str]] = None,
valid_data_path: Optional[List[str]] = None,
test_data_path: Optional[List[str]] = None,
per_split_data_args_path: Optional[str] = None,
mock: bool = False,
) Tuple[bridge.recipes.utils.dataset_utils._BLEND_TYPE, bridge.recipes.utils.dataset_utils._BLEND_PER_SPLIT_TYPE, bridge.recipes.utils.dataset_utils._SPLIT_TYPE]#

Common configuration logic for blend, blend_per_split, split dataset config fields.

Handles mock and real data. If no path to data is provided, mock data will be used. Prioritizes data_paths over split data paths. For all of data_paths, train_data_path, valid_data_path, and test_data_path, two formats are accepted: either (1) a list of prefixes, e.g. [“path/to/dataset_1_prefix”, “path/to/dataset_2_prefix”], or (2) a flattened, zipped list of weights and prefixes, e.g. [“30”, “path/to/dataset_1_prefix”, “70”, “path/to/dataset_2_prefix”]

Parameters:
  • data_paths (Optional[List[str]]) – List of paths to dataset files.

  • data_args_path (Optional[str]) – Path to file containing data arguments.

  • train_data_path (Optional[List[str]]) – List of training data paths.

  • valid_data_path (Optional[List[str]]) – List of validation data paths.

  • test_data_path (Optional[List[str]]) – List of test data paths.

  • per_split_data_args_path (Optional[str]) – Path to JSON file with per-split data configuration.

  • mock (bool) – Whether to use mock data. If True, ignores data_paths.

Returns:

A tuple (blend, blend_per_split, split), the corresponding fields to be passed to GPTDatasetConfig.