`utils.file_utils`#

Module Contents#

Functions#

`check_disallowed_kwargs`	Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None
`check_output_mode`	Validate and act on the write mode for an output directory.
`create_or_overwrite_dir`	Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.
`delete_dir`
`filter_files_by_extension`
`get_all_file_paths_and_size_under`	Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).
`get_all_file_paths_under`	Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.
`get_fs`
`infer_dataset_name_from_path`	Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path
`infer_protocol_from_paths`	Infer a protocol from a list of paths, if any.
`is_not_empty`
`pandas_select_columns`	Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.
`tar_safe_extract`	Safely extract a tar file, preventing path traversal attacks.

Data#

FILETYPE_TO_DEFAULT_EXTENSIONS

API#

utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS#: None

utils.file_utils.check_disallowed_kwargs( kwargs: dict, disallowed_keys: list[str], raise_error: bool = True, ) → None#: Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None

utils.file_utils.check_output_mode( mode: Literal[overwrite, append, error, ignore], fs: fsspec.AbstractFileSystem, path: str, append_mode_implemented: bool = False, ) → None#

Validate and act on the write mode for an output directory.

Modes:

“overwrite”: delete existing output_dir recursively if it exists.
“append”: no-op here; raises if append is not implemented.
“error”: raise FileExistsError if output_dir already exists.
“ignore”: no-op.

utils.file_utils.create_or_overwrite_dir( path: str, fs: fsspec.AbstractFileSystem | None = None, storage_options: dict[str, str] | None = None, ) → None#: Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.

utils.file_utils.delete_dir( path: str, fs: fsspec.AbstractFileSystem | None = None, storage_options: dict[str, str] | None = None, ) → None#

utils.file_utils.filter_files_by_extension( files_list: list[str], keep_extensions: str | list[str], ) → list[str]#

utils.file_utils.get_all_file_paths_and_size_under( path: str, recurse_subdirectories: bool = False, keep_extensions: str | list[str] | None = None, storage_options: dict[str, str] | None = None, fs: fsspec.AbstractFileSystem | None = None, ) → list[tuple[str, int]]#: Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).

utils.file_utils.get_all_file_paths_under( path: str, recurse_subdirectories: bool = False, keep_extensions: str | list[str] | None = None, storage_options: dict[str, str] | None = None, fs: fsspec.AbstractFileSystem | None = None, ) → list[str]#: Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.

utils.file_utils.get_fs( path: str, storage_options: dict[str, str] | None = None, ) → fsspec.AbstractFileSystem#

utils.file_utils.infer_dataset_name_from_path(path: str) → str#: Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path

utils.file_utils.infer_protocol_from_paths( paths: collections.abc.Iterable[str], ) → str | None#

Infer a protocol from a list of paths, if any.

Returns the first detected protocol scheme (e.g., “s3”, “gcs”, “gs”, “abfs”) or None for local paths.

utils.file_utils.is_not_empty( path: str, fs: fsspec.AbstractFileSystem | None = None, storage_options: dict[str, str] | None = None, ) → bool#

utils.file_utils.pandas_select_columns( df: pandas.DataFrame, columns: list[str] | None, file_path: str, ) → pandas.DataFrame | None#

Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

Returns the projected DataFrame. If no requested columns exist, returns None.

utils.file_utils.tar_safe_extract(tar: tarfile.TarFile, path: str) → None#

Safely extract a tar file, preventing path traversal attacks.

Args: tar: The TarFile object to extract path: The destination path for extraction

Raises: ValueError: If any member has an unsafe path

utils.file_utils#

Module Contents#

Functions#

Data#

API#

`utils.file_utils`#