***
layout: overview
slug: nemo-curator/nemo\_curator/utils/file\_utils
title: nemo\_curator.utils.file\_utils
--------------------------------------
## Module Contents
### Functions
| Name | Description |
| ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| [`_gather_extention`](#nemo_curator-utils-file_utils-_gather_extention) | Gather the extension of a given path. |
| [`_gather_file_records`](#nemo_curator-utils-file_utils-_gather_file_records) | Gather file records from a given path. |
| [`_is_safe_path`](#nemo_curator-utils-file_utils-_is_safe_path) | Check if a path is safe for extraction (no path traversal). |
| [`_split_files_as_per_blocksize`](#nemo_curator-utils-file_utils-_split_files_as_per_blocksize) | - |
| [`check_disallowed_kwargs`](#nemo_curator-utils-file_utils-check_disallowed_kwargs) | Check if any of the disallowed keys are in provided kwargs |
| [`check_output_mode`](#nemo_curator-utils-file_utils-check_output_mode) | Validate and act on the write mode for an output directory. |
| [`create_or_overwrite_dir`](#nemo_curator-utils-file_utils-create_or_overwrite_dir) | Creates a directory if it does not exist and overwrites it if it does. |
| [`delete_dir`](#nemo_curator-utils-file_utils-delete_dir) | - |
| [`filter_files_by_extension`](#nemo_curator-utils-file_utils-filter_files_by_extension) | - |
| [`get_all_file_paths_and_size_under`](#nemo_curator-utils-file_utils-get_all_file_paths_and_size_under) | Get all file paths and their sizes under a given path. |
| [`get_all_file_paths_under`](#nemo_curator-utils-file_utils-get_all_file_paths_under) | Get all file paths under a given path. |
| [`get_fs`](#nemo_curator-utils-file_utils-get_fs) | - |
| [`infer_dataset_name_from_path`](#nemo_curator-utils-file_utils-infer_dataset_name_from_path) | Infer a dataset name from a path, handling both local and cloud storage paths. |
| [`infer_protocol_from_paths`](#nemo_curator-utils-file_utils-infer_protocol_from_paths) | Infer a protocol from a list of paths, if any. |
| [`is_not_empty`](#nemo_curator-utils-file_utils-is_not_empty) | - |
| [`pandas_select_columns`](#nemo_curator-utils-file_utils-pandas_select_columns) | Project a Pandas DataFrame onto existing columns, logging warnings for missing ones. |
| [`tar_safe_extract`](#nemo_curator-utils-file_utils-tar_safe_extract) | Safely extract a tar file, preventing path traversal attacks. |
### Data
[`FILETYPE_TO_DEFAULT_EXTENSIONS`](#nemo_curator-utils-file_utils-FILETYPE_TO_DEFAULT_EXTENSIONS)
### API
```python
nemo_curator.utils.file_utils._gather_extention(
path: str
) -> str
```
Gather the extension of a given path.
Args:
path: The path to get the extension from.
Returns:
The extension of the path.
```python
nemo_curator.utils.file_utils._gather_file_records(
path: str,
recurse_subdirectories: bool,
keep_extensions: str | list[str] | None,
storage_options: dict[str, str] | None,
fs: fsspec.AbstractFileSystem | None,
include_size: bool
) -> list[tuple[str, int]]
```
Gather file records from a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
include\_size: Whether to include the size of the files.
Returns:
A list of tuples (file\_path, file\_size).
```python
nemo_curator.utils.file_utils._is_safe_path(
path: str,
base_path: str
) -> bool
```
Check if a path is safe for extraction (no path traversal).
**Parameters:**
The path to check
The base directory for extraction
**Returns:** `bool`
True if the path is safe, False otherwise
```python
nemo_curator.utils.file_utils._split_files_as_per_blocksize(
sorted_file_sizes: list[tuple[str, int]],
max_byte_per_chunk: int
) -> list[list[str]]
```
```python
nemo_curator.utils.file_utils.check_disallowed_kwargs(
kwargs: dict,
disallowed_keys: list[str],
raise_error: bool = True
) -> None
```
Check if any of the disallowed keys are in provided kwargs
Used for read/write kwargs in stages.
Args:
kwargs: The dictionary to check
disallowed\_keys: The keys that are not allowed.
raise\_error: Whether to raise an error if any of the disallowed keys are in the kwargs.
Raises:
ValueError: If any of the disallowed keys are in the kwargs and raise\_error is True.
Warning: If any of the disallowed keys are in the kwargs and raise\_error is False.
Returns:
None
```python
nemo_curator.utils.file_utils.check_output_mode(
mode: typing.Literal['overwrite', 'append', 'error', 'ignore'],
fs: fsspec.AbstractFileSystem,
path: str,
append_mode_implemented: bool = False
) -> None
```
Validate and act on the write mode for an output directory.
Modes:
* "overwrite": delete existing `output_dir` recursively if it exists.
* "append": no-op here; raises if append is not implemented.
* "error": raise FileExistsError if `output_dir` already exists.
* "ignore": no-op.
```python
nemo_curator.utils.file_utils.create_or_overwrite_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> None
```
Creates a directory if it does not exist and overwrites it if it does.
Warning: This function will delete all files in the directory if it exists.
```python
nemo_curator.utils.file_utils.delete_dir(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> None
```
```python
nemo_curator.utils.file_utils.filter_files_by_extension(
files_list: list[str],
keep_extensions: str | list[str]
) -> list[str]
```
```python
nemo_curator.utils.file_utils.get_all_file_paths_and_size_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None
) -> list[tuple[str, int]]
```
Get all file paths and their sizes under a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
Returns:
A list of tuples (file\_path, file\_size).
```python
nemo_curator.utils.file_utils.get_all_file_paths_under(
path: str,
recurse_subdirectories: bool = False,
keep_extensions: str | list[str] | None = None,
storage_options: dict[str, str] | None = None,
fs: fsspec.AbstractFileSystem | None = None
) -> list[str]
```
Get all file paths under a given path.
Args:
path: The path to get the file paths from.
recurse\_subdirectories: Whether to recurse subdirectories.
keep\_extensions: The extensions to keep.
storage\_options: The storage options to use.
fs: The filesystem to use.
Returns:
A list of file paths.
```python
nemo_curator.utils.file_utils.get_fs(
path: str,
storage_options: dict[str, str] | None = None
) -> fsspec.AbstractFileSystem
```
```python
nemo_curator.utils.file_utils.infer_dataset_name_from_path(
path: str
) -> str
```
Infer a dataset name from a path, handling both local and cloud storage paths.
Args:
path: Local path or cloud storage URL (e.g. s3://, abfs\://)
Returns:
Inferred dataset name from the path
```python
nemo_curator.utils.file_utils.infer_protocol_from_paths(
paths: collections.abc.Iterable[str]
) -> str | None
```
Infer a protocol from a list of paths, if any.
Returns the first detected protocol scheme (e.g., "s3", "gcs", "gs", "abfs")
or None for local paths.
```python
nemo_curator.utils.file_utils.is_not_empty(
path: str,
fs: fsspec.AbstractFileSystem | None = None,
storage_options: dict[str, str] | None = None
) -> bool
```
```python
nemo_curator.utils.file_utils.pandas_select_columns(
df: pandas.DataFrame,
columns: list[str] | None,
file_path: str
) -> pandas.DataFrame | None
```
Project a Pandas DataFrame onto existing columns, logging warnings for missing ones.
Returns the projected DataFrame. If no requested columns exist, returns None.
Safely extract a tar file, preventing path traversal attacks.
**Parameters:**
The TarFile object to extract
The destination path for extraction
**Raises:**
* `ValueError`: If any member has an unsafe path
```python
nemo_curator.utils.file_utils.FILETYPE_TO_DEFAULT_EXTENSIONS = {'parquet': ['.parquet'], 'jsonl': ['.jsonl', '.json'], 'megatron': ['.bin', '.i...
```