***

layout: overview
slug: nemo-curator/nemo\_curator/utils/file\_utils
title: nemo\_curator.utils.file\_utils
--------------------------------------

## Module Contents

### Functions

| Name                                                                                                    | Description                                                                          |
| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`_gather_extention`](#nemo_curator-utils-file_utils-_gather_extention)                                 | Gather the extension of a given path.                                                |
| [`_gather_file_records`](#nemo_curator-utils-file_utils-_gather_file_records)                           | Gather file records from a given path.                                               |
| [`_is_safe_path`](#nemo_curator-utils-file_utils-_is_safe_path)                                         | Check if a path is safe for extraction (no path traversal).                          |
| [`_split_files_as_per_blocksize`](#nemo_curator-utils-file_utils-_split_files_as_per_blocksize)         | -                                                                                    |
| [`check_disallowed_kwargs`](#nemo_curator-utils-file_utils-check_disallowed_kwargs)                     | Check if any of the disallowed keys are in provided kwargs                           |
| [`check_output_mode`](#nemo_curator-utils-file_utils-check_output_mode)                                 | Validate and act on the write mode for an output directory.                          |
| [`create_or_overwrite_dir`](#nemo_curator-utils-file_utils-create_or_overwrite_dir)                     | Creates a directory if it does not exist and overwrites it if it does.               |
| [`delete_dir`](#nemo_curator-utils-file_utils-delete_dir)                                               | -                                                                                    |
| [`filter_files_by_extension`](#nemo_curator-utils-file_utils-filter_files_by_extension)                 | -                                                                                    |
| [`get_all_file_paths_and_size_under`](#nemo_curator-utils-file_utils-get_all_file_paths_and_size_under) | Get all file paths and their sizes under a given path.                               |
| [`get_all_file_paths_under`](#nemo_curator-utils-file_utils-get_all_file_paths_under)                   | Get all file paths under a given path.                                               |
| [`get_fs`](#nemo_curator-utils-file_utils-get_fs)                                                       | -                                                                                    |
| [`infer_dataset_name_from_path`](#nemo_curator-utils-file_utils-infer_dataset_name_from_path)           | Infer a dataset name from a path, handling both local and cloud storage paths.       |
| [`infer_protocol_from_paths`](#nemo_curator-utils-file_utils-infer_protocol_from_paths)                 | Infer a protocol from a list of paths, if any.                                       |
| [`is_not_empty`](#nemo_curator-utils-file_utils-is_not_empty)                                           | -                                                                                    |
| [`pandas_select_columns`](#nemo_curator-utils-file_utils-pandas_select_columns)                         | Project a Pandas DataFrame onto existing columns, logging warnings for missing ones. |
| [`tar_safe_extract`](#nemo_curator-utils-file_utils-tar_safe_extract)                                   | Safely extract a tar file, preventing path traversal attacks.                        |

### Data

[`FILETYPE_TO_DEFAULT_EXTENSIONS`](#nemo_curator-utils-file_utils-FILETYPE_TO_DEFAULT_EXTENSIONS)

### API

<Anchor id="nemo_curator-utils-file_utils-_gather_extention">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils._gather_extention(
        path: str
    ) -> str
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Gather the extension of a given path.
  Args:
  path: The path to get the extension from.
  Returns:
  The extension of the path.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-_gather_file_records">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils._gather_file_records(
        path: str,
        recurse_subdirectories: bool,
        keep_extensions: str | list[str] | None,
        storage_options: dict[str, str] | None,
        fs: fsspec.AbstractFileSystem | None,
        include_size: bool
    ) -> list[tuple[str, int]]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Gather file records from a given path.
  Args:
  path: The path to get the file paths from.
  recurse\_subdirectories: Whether to recurse subdirectories.
  keep\_extensions: The extensions to keep.
  storage\_options: The storage options to use.
  fs: The filesystem to use.
  include\_size: Whether to include the size of the files.
  Returns:
  A list of tuples (file\_path, file\_size).
</Indent>

<Anchor id="nemo_curator-utils-file_utils-_is_safe_path">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils._is_safe_path(
        path: str,
        base_path: str
    ) -> bool
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Check if a path is safe for extraction (no path traversal).

  **Parameters:**

  <ParamField path="path" type="str">
    The path to check
  </ParamField>

  <ParamField path="base_path" type="str">
    The base directory for extraction
  </ParamField>

  **Returns:** `bool`

  True if the path is safe, False otherwise
</Indent>

<Anchor id="nemo_curator-utils-file_utils-_split_files_as_per_blocksize">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils._split_files_as_per_blocksize(
        sorted_file_sizes: list[tuple[str, int]],
        max_byte_per_chunk: int
    ) -> list[list[str]]
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-file_utils-check_disallowed_kwargs">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.check_disallowed_kwargs(
        kwargs: dict,
        disallowed_keys: list[str],
        raise_error: bool = True
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Check if any of the disallowed keys are in provided kwargs
  Used for read/write kwargs in stages.
  Args:
  kwargs: The dictionary to check
  disallowed\_keys: The keys that are not allowed.
  raise\_error: Whether to raise an error if any of the disallowed keys are in the kwargs.
  Raises:
  ValueError: If any of the disallowed keys are in the kwargs and raise\_error is True.
  Warning: If any of the disallowed keys are in the kwargs and raise\_error is False.
  Returns:
  None
</Indent>

<Anchor id="nemo_curator-utils-file_utils-check_output_mode">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.check_output_mode(
        mode: typing.Literal['overwrite', 'append', 'error', 'ignore'],
        fs: fsspec.AbstractFileSystem,
        path: str,
        append_mode_implemented: bool = False
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Validate and act on the write mode for an output directory.

  Modes:

  * "overwrite": delete existing `output_dir` recursively if it exists.
  * "append": no-op here; raises if append is not implemented.
  * "error": raise FileExistsError if `output_dir` already exists.
  * "ignore": no-op.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-create_or_overwrite_dir">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.create_or_overwrite_dir(
        path: str,
        fs: fsspec.AbstractFileSystem | None = None,
        storage_options: dict[str, str] | None = None
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Creates a directory if it does not exist and overwrites it if it does.
  Warning: This function will delete all files in the directory if it exists.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-delete_dir">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.delete_dir(
        path: str,
        fs: fsspec.AbstractFileSystem | None = None,
        storage_options: dict[str, str] | None = None
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-file_utils-filter_files_by_extension">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.filter_files_by_extension(
        files_list: list[str],
        keep_extensions: str | list[str]
    ) -> list[str]
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-file_utils-get_all_file_paths_and_size_under">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.get_all_file_paths_and_size_under(
        path: str,
        recurse_subdirectories: bool = False,
        keep_extensions: str | list[str] | None = None,
        storage_options: dict[str, str] | None = None,
        fs: fsspec.AbstractFileSystem | None = None
    ) -> list[tuple[str, int]]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Get all file paths and their sizes under a given path.
  Args:
  path: The path to get the file paths from.
  recurse\_subdirectories: Whether to recurse subdirectories.
  keep\_extensions: The extensions to keep.
  storage\_options: The storage options to use.
  fs: The filesystem to use.
  Returns:
  A list of tuples (file\_path, file\_size).
</Indent>

<Anchor id="nemo_curator-utils-file_utils-get_all_file_paths_under">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.get_all_file_paths_under(
        path: str,
        recurse_subdirectories: bool = False,
        keep_extensions: str | list[str] | None = None,
        storage_options: dict[str, str] | None = None,
        fs: fsspec.AbstractFileSystem | None = None
    ) -> list[str]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Get all file paths under a given path.
  Args:
  path: The path to get the file paths from.
  recurse\_subdirectories: Whether to recurse subdirectories.
  keep\_extensions: The extensions to keep.
  storage\_options: The storage options to use.
  fs: The filesystem to use.
  Returns:
  A list of file paths.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-get_fs">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.get_fs(
        path: str,
        storage_options: dict[str, str] | None = None
    ) -> fsspec.AbstractFileSystem
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-file_utils-infer_dataset_name_from_path">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.infer_dataset_name_from_path(
        path: str
    ) -> str
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Infer a dataset name from a path, handling both local and cloud storage paths.
  Args:
  path: Local path or cloud storage URL (e.g. s3://, abfs\://)
  Returns:
  Inferred dataset name from the path
</Indent>

<Anchor id="nemo_curator-utils-file_utils-infer_protocol_from_paths">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.infer_protocol_from_paths(
        paths: collections.abc.Iterable[str]
    ) -> str | None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Infer a protocol from a list of paths, if any.

  Returns the first detected protocol scheme (e.g., "s3", "gcs", "gs", "abfs")
  or None for local paths.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-is_not_empty">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.is_not_empty(
        path: str,
        fs: fsspec.AbstractFileSystem | None = None,
        storage_options: dict[str, str] | None = None
    ) -> bool
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-file_utils-pandas_select_columns">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.pandas_select_columns(
        df: pandas.DataFrame,
        columns: list[str] | None,
        file_path: str
    ) -> pandas.DataFrame | None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

  Returns the projected DataFrame. If no requested columns exist, returns None.
</Indent>

<Anchor id="nemo_curator-utils-file_utils-tar_safe_extract">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.tar_safe_extract(
        tar: tarfile.TarFile,
        path: str
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Safely extract a tar file, preventing path traversal attacks.

  **Parameters:**

  <ParamField path="tar" type="tarfile.TarFile">
    The TarFile object to extract
  </ParamField>

  <ParamField path="path" type="str">
    The destination path for extraction
  </ParamField>

  **Raises:**

  * `ValueError`: If any member has an unsafe path
</Indent>

<Anchor id="nemo_curator-utils-file_utils-FILETYPE_TO_DEFAULT_EXTENSIONS">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS = {'parquet': ['.parquet'], 'jsonl': ['.jsonl', '.json'], 'megatron': ['.bin', '.i...
    ```
  </CodeBlock>
</Anchor>
