nemo_curator.utils.file_utils
Module Contents
Functions
Data
FILETYPE_TO_DEFAULT_EXTENSIONS
API
Gather the extension of a given path. Args: path: The path to get the extension from. Returns: The extension of the path.
Gather file records from a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. include_size: Whether to include the size of the files. Returns: A list of tuples (file_path, file_size).
Check if a path is safe for extraction (no path traversal).
Parameters:
The path to check
The base directory for extraction
Returns: bool
True if the path is safe, False otherwise
Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None
Validate and act on the write mode for an output directory.
Modes:
- “overwrite”: delete existing
output_dirrecursively if it exists. - “append”: no-op here; raises if append is not implemented.
- “error”: raise FileExistsError if
output_diralready exists. - “ignore”: no-op.
Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.
Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).
Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.
Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path
Infer a protocol from a list of paths, if any.
Returns the first detected protocol scheme (e.g., “s3”, “gcs”, “gs”, “abfs”) or None for local paths.
Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.
Returns the projected DataFrame. If no requested columns exist, returns None.
Safely extract a tar file, preventing path traversal attacks.
Parameters:
The TarFile object to extract
The destination path for extraction
Raises:
ValueError: If any member has an unsafe path