nemo_curator.utils.file_utils

View as Markdown

Module Contents

Functions

NameDescription
_gather_extentionGather the extension of a given path.
_gather_file_recordsGather file records from a given path.
_is_safe_pathCheck if a path is safe for extraction (no path traversal).
_split_files_as_per_blocksize-
check_disallowed_kwargsCheck if any of the disallowed keys are in provided kwargs
check_output_modeValidate and act on the write mode for an output directory.
create_or_overwrite_dirCreates a directory if it does not exist and overwrites it if it does.
delete_dir-
filter_files_by_extension-
get_all_file_paths_and_size_underGet all file paths and their sizes under a given path.
get_all_file_paths_underGet all file paths under a given path.
get_fs-
infer_dataset_name_from_pathInfer a dataset name from a path, handling both local and cloud storage paths.
infer_protocol_from_pathsInfer a protocol from a list of paths, if any.
is_not_empty-
pandas_select_columnsProject a Pandas DataFrame onto existing columns, logging warnings for missing ones.
tar_safe_extractSafely extract a tar file, preventing path traversal attacks.

Data

FILETYPE_TO_DEFAULT_EXTENSIONS

API

nemo_curator.utils.file_utils._gather_extention(
path: str
) -> str

Gather the extension of a given path. Args: path: The path to get the extension from. Returns: The extension of the path.

nemo_curator.utils.file_utils._gather_file_records(
path: str,
recurse_subdirectories: bool,
keep_extensions: str | list[str] | None,
storage_options: dict[str, str] | None,
fs: fsspec.AbstractFileSystem | None,
include_size: bool
) -> list[tuple[str, int]]

Gather file records from a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. include_size: Whether to include the size of the files. Returns: A list of tuples (file_path, file_size).

nemo_curator.utils.file_utils._is_safe_path(
path: str,
base_path: str
) -> bool

Check if a path is safe for extraction (no path traversal).

Parameters:

path
str

The path to check

base_path
str

The base directory for extraction

Returns: bool

True if the path is safe, False otherwise

nemo_curator.utils.file_utils._split_files_as_per_blocksize(
sorted_file_sizes: list[tuple[str, int]],
max_byte_per_chunk: int
) -> list[list[str]]
nemo_curator.utils.file_utils.check_disallowed_kwargs(
kwargs: dict,
disallowed_keys: list[str],
raise_error: bool = True
) -> None

Check if any of the disallowed keys are in provided kwargs Used for read/write kwargs in stages. Args: kwargs: The dictionary to check disallowed_keys: The keys that are not allowed. raise_error: Whether to raise an error if any of the disallowed keys are in the kwargs. Raises: ValueError: If any of the disallowed keys are in the kwargs and raise_error is True. Warning: If any of the disallowed keys are in the kwargs and raise_error is False. Returns: None

nemo_curator.utils.file_utils.check_output_mode(
mode: typing.Literal['overwrite', 'append', 'error', 'ignore'],
fs: fsspec.AbstractFileSystem,
path: str,
append_mode_implemented: bool = False
) -> None

Validate and act on the write mode for an output directory.

Modes:

  • “overwrite”: delete existing output_dir recursively if it exists.
  • “append”: no-op here; raises if append is not implemented.
  • “error”: raise FileExistsError if output_dir already exists.
  • “ignore”: no-op.
nemo_curator.utils.file_utils.create_or_overwrite_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> None

Creates a directory if it does not exist and overwrites it if it does. Warning: This function will delete all files in the directory if it exists.

nemo_curator.utils.file_utils.delete_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> None
nemo_curator.utils.file_utils.filter_files_by_extension(
files_list: list[str],
keep_extensions: str | list[str]
) -> list[str]
nemo_curator.utils.file_utils.get_all_file_paths_and_size_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None
) -> list[tuple[str, int]]

Get all file paths and their sizes under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of tuples (file_path, file_size).

nemo_curator.utils.file_utils.get_all_file_paths_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None
) -> list[str]

Get all file paths under a given path. Args: path: The path to get the file paths from. recurse_subdirectories: Whether to recurse subdirectories. keep_extensions: The extensions to keep. storage_options: The storage options to use. fs: The filesystem to use. Returns: A list of file paths.

nemo_curator.utils.file_utils.get_fs(
path: str,
storage_options: dict[str, str] | None = None
) -> fsspec.AbstractFileSystem
nemo_curator.utils.file_utils.infer_dataset_name_from_path(
path: str
) -> str

Infer a dataset name from a path, handling both local and cloud storage paths. Args: path: Local path or cloud storage URL (e.g. s3://, abfs://) Returns: Inferred dataset name from the path

nemo_curator.utils.file_utils.infer_protocol_from_paths(
paths: collections.abc.Iterable[str]
) -> str | None

Infer a protocol from a list of paths, if any.

Returns the first detected protocol scheme (e.g., “s3”, “gcs”, “gs”, “abfs”) or None for local paths.

nemo_curator.utils.file_utils.is_not_empty(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> bool
nemo_curator.utils.file_utils.pandas_select_columns(
df: pandas.DataFrame,
columns: list[str] | None,
file_path: str
) -> pandas.DataFrame | None

Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

Returns the projected DataFrame. If no requested columns exist, returns None.

nemo_curator.utils.file_utils.tar_safe_extract(
tar: tarfile.TarFile,
path: str
) -> None

Safely extract a tar file, preventing path traversal attacks.

Parameters:

tar
tarfile.TarFile

The TarFile object to extract

path
str

The destination path for extraction

Raises:

  • ValueError: If any member has an unsafe path
nemo_curator.utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS = {'parquet': ['.parquet'], 'jsonl': ['.jsonl', '.json'], 'megatron': ['.bin', '.i...