data_designer.config.utils.io_helpers
data_designer.config.utils.io_helpers
data_designer.config.utils.io_helpers
logger
MAX_CONFIG_URL_SIZE_BYTES
VALID_DATASET_FILE_EXTENSIONS
VALID_CONFIG_FILE_EXTENSIONS
Create configuration directory if it doesn’t exist.
Parameters:
Directory path to create
Load a YAML configuration file.
Parameters:
Path to the YAML file
Returns:
dict
Parsed YAML content as dictionary
Raises:
If file doesn’t exist
If YAML is malformed
If file is empty
Save configuration to a YAML file.
Parameters:
Path where to save the file
Configuration dictionary to save
Raises:
If file cannot be written
Discover processor names from directories and parquet files under the given path.
Load a processor’s output dataset, checking for a directory first then a single parquet file.
Read a parquet dataset from a path.
Parameters:
The path to the parquet dataset, can be either a file or a directory.
Returns:
pandas.DataFrame
The parquet dataset as a pandas DataFrame.
Validate that a dataset file path has a valid extension and optionally exists.
Parameters:
The path to validate, either as a string or Path object.
If True, verify that the file exists. Defaults to True.
Returns:
pathlib.Path
The validated path as a Path object.
Raises:
If the path is not a file.
If the path does not have a valid extension.
Validate that a path contains files of a specific type.
Parameters:
The path to validate. Can contain wildcards like *.parquet.
The extension of the files to validate (without the dot, e.g., “parquet”).
Returns:
None
None if the path contains files of the specified type, raises an error otherwise.
Raises:
If the path does not contain files of the specified type.
Load a dataframe from file if a path is given, otherwise return the dataframe.
Parameters:
A path to a file or a pandas DataFrame object.
Returns:
pandas.DataFrame
A pandas DataFrame object.
Return the yaml config as a dict given flexible input types.
Parameters:
The config as a dict, yaml string, or yaml file path.
Returns:
dict
The config as a dict.
Internal YAML loader with context to prevent URL recursion on fetched payloads.
Check whether a string is an HTTP or HTTPS URL.
Rewrite known hosting-provider file-view URLs to raw-content URLs.
Return URL without query/fragment for safe logging.
Rewrite GitHub blob URLs to raw.githubusercontent.com equivalents.
GitHub blob URLs (e.g. https://github.com/org/repo/blob/main/config.yaml) serve HTML pages, not raw file content. This rewrites them so that downstream fetchers get the actual file.
Rewrite Hugging Face Hub blob URLs to raw URL equivalents.
Raise a ValueError with actionable details for failing HTTP status codes.
Fetch a remote YAML/JSON config URL and return the parsed dict.
Parameters:
HTTP(S) URL pointing to a YAML or JSON configuration file.
Returns:
dict
The parsed configuration as a dictionary.
Raises:
If the URL extension is unsupported, the fetch fails, the response exceeds the size limit, or parsing produces a non-dict result.
Convert non-JSON-serializable objects to JSON-serializable Python-native types.
Raises:
If the object type is not supported for serialization.