> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

# nemo_curator.utils.file_utils

## Module Contents

### Functions

| Name                                                                                                    | Description                                                                          |
| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`_gather_extention`](#nemo_curator-utils-file_utils-_gather_extention)                                 | Gather the extension of a given path.                                                |
| [`_gather_file_records`](#nemo_curator-utils-file_utils-_gather_file_records)                           | Gather file records from a given path.                                               |
| [`_is_safe_path`](#nemo_curator-utils-file_utils-_is_safe_path)                                         | Check if a path is safe for extraction (no path traversal).                          |
| [`_split_files_as_per_blocksize`](#nemo_curator-utils-file_utils-_split_files_as_per_blocksize)         | -                                                                                    |
| [`check_disallowed_kwargs`](#nemo_curator-utils-file_utils-check_disallowed_kwargs)                     | Check if any of the disallowed keys are in provided kwargs                           |
| [`check_output_mode`](#nemo_curator-utils-file_utils-check_output_mode)                                 | Validate and act on the write mode for an output directory.                          |
| [`create_or_overwrite_dir`](#nemo_curator-utils-file_utils-create_or_overwrite_dir)                     | Creates a directory if it does not exist and overwrites it if it does.               |
| [`delete_dir`](#nemo_curator-utils-file_utils-delete_dir)                                               | -                                                                                    |
| [`filter_files_by_extension`](#nemo_curator-utils-file_utils-filter_files_by_extension)                 | -                                                                                    |
| [`get_all_file_paths_and_size_under`](#nemo_curator-utils-file_utils-get_all_file_paths_and_size_under) | Get all file paths and their sizes under a given path.                               |
| [`get_all_file_paths_under`](#nemo_curator-utils-file_utils-get_all_file_paths_under)                   | Get all file paths under a given path.                                               |
| [`get_fs`](#nemo_curator-utils-file_utils-get_fs)                                                       | -                                                                                    |
| [`infer_dataset_name_from_path`](#nemo_curator-utils-file_utils-infer_dataset_name_from_path)           | Infer a dataset name from a path, handling both local and cloud storage paths.       |
| [`infer_protocol_from_paths`](#nemo_curator-utils-file_utils-infer_protocol_from_paths)                 | Infer a protocol from a list of paths, if any.                                       |
| [`is_not_empty`](#nemo_curator-utils-file_utils-is_not_empty)                                           | -                                                                                    |
| [`pandas_select_columns`](#nemo_curator-utils-file_utils-pandas_select_columns)                         | Project a Pandas DataFrame onto existing columns, logging warnings for missing ones. |
| [`parse_bytes_string_to_int`](#nemo_curator-utils-file_utils-parse_bytes_string_to_int)                 | Taken from dask.utils.parse\_bytes                                                   |
| [`tar_safe_extract`](#nemo_curator-utils-file_utils-tar_safe_extract)                                   | Safely extract a tar file, preventing path traversal attacks.                        |

### Data

[`FILETYPE_TO_DEFAULT_EXTENSIONS`](#nemo_curator-utils-file_utils-FILETYPE_TO_DEFAULT_EXTENSIONS)

### API

```python
nemo_curator.utils.file_utils._gather_extention(
    path: str
) -> str
```

Gather the extension of a given path.
Args:
path: The path to get the extension from.
Returns:
The extension of the path.

```python
nemo_curator.utils.file_utils._gather_file_records(
    path: str,
    recurse_subdirectories: bool,
    keep_extensions: str | list[str] | None,
    storage_options: dict[str, str] | None,
    fs: fsspec.AbstractFileSystem | None,
    include_size: bool
) -> list[tuple[str, int]]
```

Gather file records from a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
include\_size: Whether to include the size of the files.
Returns:
A list of tuples (file\_path, file\_size).

```python
nemo_curator.utils.file_utils._is_safe_path(
    path: str,
    base_path: str
) -> bool
```

Check if a path is safe for extraction (no path traversal).

**Parameters:**

The path to check

The base directory for extraction

**Returns:** `bool`

True if the path is safe, False otherwise

```python
nemo_curator.utils.file_utils._split_files_as_per_blocksize(
    sorted_file_sizes: list[tuple[str, int]],
    max_byte_per_chunk: int
) -> list[list[str]]
```

```python
nemo_curator.utils.file_utils.check_disallowed_kwargs(
    kwargs: dict,
    disallowed_keys: list[str],
    raise_error: bool = True
) -> None
```

Check if any of the disallowed keys are in provided kwargs
Used for read/write kwargs in stages.
Args:
kwargs: The dictionary to check
disallowed\_keys: The keys that are not allowed.
raise\_error: Whether to raise an error if any of the disallowed keys are in the kwargs.
Raises:
ValueError: If any of the disallowed keys are in the kwargs and raise\_error is True.
Warning: If any of the disallowed keys are in the kwargs and raise\_error is False.
Returns:
None

```python
nemo_curator.utils.file_utils.check_output_mode(
    mode: typing.Literal['overwrite', 'append', 'error', 'ignore'],
    fs: fsspec.AbstractFileSystem,
    path: str,
    append_mode_implemented: bool = False
) -> None
```

Validate and act on the write mode for an output directory.

Modes:

* "overwrite": delete existing `output_dir` recursively if it exists.
* "append": no-op here; raises if append is not implemented.
* "error": raise FileExistsError if `output_dir` already exists.
* "ignore": no-op.

```python
nemo_curator.utils.file_utils.create_or_overwrite_dir(
    path: str,
    fs: fsspec.AbstractFileSystem | None = None,
    storage_options: dict[str, str] | None = None
) -> None
```

Creates a directory if it does not exist and overwrites it if it does.
Warning: This function will delete all files in the directory if it exists.

```python
nemo_curator.utils.file_utils.delete_dir(
    path: str,
    fs: fsspec.AbstractFileSystem | None = None,
    storage_options: dict[str, str] | None = None
) -> None
```

```python
nemo_curator.utils.file_utils.filter_files_by_extension(
    files_list: list[str],
    keep_extensions: str | list[str]
) -> list[str]
```

```python
nemo_curator.utils.file_utils.get_all_file_paths_and_size_under(
    path: str,
    recurse_subdirectories: bool = False,
    keep_extensions: str | list[str] | None = None,
    storage_options: dict[str, str] | None = None,
    fs: fsspec.AbstractFileSystem | None = None,
    sort_by_size: bool = True
) -> list[tuple[str, int]]
```

Get all file paths and their sizes under a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
sort\_by\_size: Whether to sort the files by size.
If False, the files will be sorted by path instead.
Returns:
A list of tuples (file\_path, file\_size).

```python
nemo_curator.utils.file_utils.get_all_file_paths_under(
    path: str,
    recurse_subdirectories: bool = False,
    keep_extensions: str | list[str] | None = None,
    storage_options: dict[str, str] | None = None,
    fs: fsspec.AbstractFileSystem | None = None
) -> list[str]
```

Get all file paths under a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
Returns:
A list of file paths.

```python
nemo_curator.utils.file_utils.get_fs(
    path: str,
    storage_options: dict[str, str] | None = None
) -> fsspec.AbstractFileSystem
```

```python
nemo_curator.utils.file_utils.infer_dataset_name_from_path(
    path: str
) -> str
```

Infer a dataset name from a path, handling both local and cloud storage paths.
Args:
path: Local path or cloud storage URL (e.g. s3://, abfs\://)
Returns:
Inferred dataset name from the path

```python
nemo_curator.utils.file_utils.infer_protocol_from_paths(
    paths: collections.abc.Iterable[str]
) -> str | None
```

Infer a protocol from a list of paths, if any.

Returns the first detected protocol scheme (e.g., "s3", "gcs", "gs", "abfs")
or None for local paths.

```python
nemo_curator.utils.file_utils.is_not_empty(
    path: str,
    fs: fsspec.AbstractFileSystem | None = None,
    storage_options: dict[str, str] | None = None
) -> bool
```

```python
nemo_curator.utils.file_utils.pandas_select_columns(
    df: pandas.DataFrame,
    columns: list[str] | None,
    file_path: str
) -> pandas.DataFrame | None
```

Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.

Returns the projected DataFrame. If no requested columns exist, returns None.

```python
nemo_curator.utils.file_utils.parse_bytes_string_to_int(
    size: float | str
) -> int
```

Taken from dask.utils.parse\_bytes
[https://github.com/dask/dask/blob/3801bedc7c71c83f37e836af71f740974c0434b3/dask/utils.py#L1585](https://github.com/dask/dask/blob/3801bedc7c71c83f37e836af71f740974c0434b3/dask/utils.py#L1585)
Parse byte string to numbers.

\>>> parse\_bytes('100')
100
\>>> parse\_bytes('100 MB')
100000000
\>>> parse\_bytes('100M')
100000000
\>>> parse\_bytes('5kB')
5000
\>>> parse\_bytes('5.4 kB')
5400
\>>> parse\_bytes('1kiB')
1024
\>>> parse\_bytes('1e6')
1000000
\>>> parse\_bytes('1e6 kB')
1000000000
\>>> parse\_bytes('MB')
1000000
\>>> parse\_bytes(123)
123
\>>> parse\_bytes('5 foos')
Traceback (most recent call last):
...
ValueError: Could not interpret 'foos' as a byte unit

```python
nemo_curator.utils.file_utils.tar_safe_extract(
    tar: tarfile.TarFile,
    path: str
) -> None
```

Safely extract a tar file, preventing path traversal attacks.

**Parameters:**

The TarFile object to extract

The destination path for extraction

**Raises:**

* `ValueError`: If any member has an unsafe path

```python
nemo_curator.utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS = {'parquet': ['.parquet'], 'jsonl': ['.jsonl', '.json'], 'megatron': ['.bin', '.i...
```