utils.file_utils#

Module Contents#

Functions#

check_disallowed_kwargs

Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None

check_output_mode

Validate and act on the write mode for an output directory.

create_or_overwrite_dir

Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.

delete_dir

filter_files_by_extension

get_all_file_paths_and_size_under

Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).

get_all_file_paths_under

Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.

get_fs

infer_dataset_name_from_path

Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path

infer_protocol_from_paths

Infer a protocol from a list of paths, if any.

is_not_empty

pandas_select_columns

Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

tar_safe_extract

Safely extract a tar file, preventing path traversal attacks.

Data#

API#

utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS#

None

utils.file_utils.check_disallowed_kwargs(
kwargs: dict,
disallowed_keys: list[str],
raise_error: bool = True,
) None#

Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None

utils.file_utils.check_output_mode(
mode: Literal[overwrite, append, error, ignore],
fs: fsspec.AbstractFileSystem,
path: str,
append_mode_implemented: bool = False,
) None#

Validate and act on the write mode for an output directory.

Modes:

  • “overwrite”: delete existing output_dir recursively if it exists.

  • “append”: no-op here; raises if append is not implemented.

  • “error”: raise FileExistsError if output_dir already exists.

  • “ignore”: no-op.

utils.file_utils.create_or_overwrite_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None,
) None#

Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.

utils.file_utils.delete_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None,
) None#
utils.file_utils.filter_files_by_extension(
files_list: list[str],
keep_extensions: str | list[str],
) list[str]#
utils.file_utils.get_all_file_paths_and_size_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None,
) list[tuple[str, int]]#

Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).

utils.file_utils.get_all_file_paths_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None,
) list[str]#

Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.

utils.file_utils.get_fs(
path: str,
storage_options: dict[str, str] | None = None,
) fsspec.AbstractFileSystem#
utils.file_utils.infer_dataset_name_from_path(path: str) str#

Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path

utils.file_utils.infer_protocol_from_paths(
paths: collections.abc.Iterable[str],
) str | None#

Infer a protocol from a list of paths, if any.

Returns the first detected protocol scheme (e.g., “s3”, “gcs”, “gs”, “abfs”) or None for local paths.

utils.file_utils.is_not_empty(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None,
) bool#
utils.file_utils.pandas_select_columns(
df: pandas.DataFrame,
columns: list[str] | None,
file_path: str,
) pandas.DataFrame | None#

Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

Returns the projected DataFrame. If no requested columns exist, returns None.

utils.file_utils.tar_safe_extract(tar: tarfile.TarFile, path: str) None#

Safely extract a tar file, preventing path traversal attacks.

Args: tar: The TarFile object to extract path: The destination path for extraction

Raises: ValueError: If any member has an unsafe path