utils.file_utils
#
Module Contents#
Functions#
Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None |
|
Validate and act on the write mode for an output directory. |
|
Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists. |
|
Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size). |
|
Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths. |
|
Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path |
|
Infer a protocol from a list of paths, if any. |
|
Project a Pandas DataFrame onto existing columns, logging warnings for missing ones. |
|
Safely extract a tar file, preventing path traversal attacks. |
Data#
API#
- utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS#
None
- utils.file_utils.check_disallowed_kwargs(
- kwargs: dict,
- disallowed_keys: list[str],
- raise_error: bool = True,
Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None
- utils.file_utils.check_output_mode(
- mode: Literal[overwrite, append, error, ignore],
- fs: fsspec.AbstractFileSystem,
- path: str,
- append_mode_implemented: bool = False,
Validate and act on the write mode for an output directory.
Modes:
“overwrite”: delete existing
output_dir
recursively if it exists.“append”: no-op here; raises if append is not implemented.
“error”: raise FileExistsError if
output_dir
already exists.“ignore”: no-op.
- utils.file_utils.create_or_overwrite_dir(
- path: str,
- fs: fsspec.AbstractFileSystem | None = None,
- storage_options: dict[str, str] | None = None,
Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.
- utils.file_utils.delete_dir(
- path: str,
- fs: fsspec.AbstractFileSystem | None = None,
- storage_options: dict[str, str] | None = None,
- utils.file_utils.filter_files_by_extension(
- files_list: list[str],
- keep_extensions: str | list[str],
- utils.file_utils.get_all_file_paths_and_size_under(
- path: str,
- recurse_subdirectories: bool = False,
- keep_extensions: str | list[str] | None = None,
- storage_options: dict[str, str] | None = None,
- fs: fsspec.AbstractFileSystem | None = None,
Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).
- utils.file_utils.get_all_file_paths_under(
- path: str,
- recurse_subdirectories: bool = False,
- keep_extensions: str | list[str] | None = None,
- storage_options: dict[str, str] | None = None,
- fs: fsspec.AbstractFileSystem | None = None,
Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.
- utils.file_utils.get_fs(
- path: str,
- storage_options: dict[str, str] | None = None,
- utils.file_utils.infer_dataset_name_from_path(path: str) str #
Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path
- utils.file_utils.infer_protocol_from_paths(
- paths: collections.abc.Iterable[str],
Infer a protocol from a list of paths, if any.
Returns the first detected protocol scheme (e.g., “s3”, “gcs”, “gs”, “abfs”) or None for local paths.
- utils.file_utils.is_not_empty(
- path: str,
- fs: fsspec.AbstractFileSystem | None = None,
- storage_options: dict[str, str] | None = None,
- utils.file_utils.pandas_select_columns(
- df: pandas.DataFrame,
- columns: list[str] | None,
- file_path: str,
Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.
Returns the projected DataFrame. If no requested columns exist, returns None.
- utils.file_utils.tar_safe_extract(tar: tarfile.TarFile, path: str) None #
Safely extract a tar file, preventing path traversal attacks.
Args: tar: The TarFile object to extract path: The destination path for extraction
Raises: ValueError: If any member has an unsafe path