> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

# nemo_curator.utils.split_large_files

## Module Contents

### Functions

| Name                                                                                             | Description                                                                               |
| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| [`_basename_and_ext`](#nemo_curator-utils-split_large_files-_basename_and_ext)                   | Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl). |
| [`_flush_jsonl_chunk`](#nemo_curator-utils-split_large_files-_flush_jsonl_chunk)                 | -                                                                                         |
| [`_join_out_path`](#nemo_curator-utils-split_large_files-_join_out_path)                         | Join output directory and filename using the target filesystem (local or remote).         |
| [`_split_table`](#nemo_curator-utils-split_large_files-_split_table)                             | -                                                                                         |
| [`_storage_options`](#nemo_curator-utils-split_large_files-_storage_options)                     | -                                                                                         |
| [`_write_table_to_file`](#nemo_curator-utils-split_large_files-_write_table_to_file)             | -                                                                                         |
| [`main`](#nemo_curator-utils-split_large_files-main)                                             | -                                                                                         |
| [`parse_args`](#nemo_curator-utils-split_large_files-parse_args)                                 | -                                                                                         |
| [`split_jsonl_file_by_size`](#nemo_curator-utils-split_large_files-split_jsonl_file_by_size)     | -                                                                                         |
| [`split_parquet_file_by_size`](#nemo_curator-utils-split_large_files-split_parquet_file_by_size) | -                                                                                         |

### API

```python
nemo_curator.utils.split_large_files._basename_and_ext(
    path: str
) -> tuple[str, str]
```

Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).

```python
nemo_curator.utils.split_large_files._flush_jsonl_chunk(
    lines: list[bytes],
    output_file: str,
    storage_options: dict[str, typing.Any]
) -> None
```

```python
nemo_curator.utils.split_large_files._join_out_path(
    output_path: str,
    filename: str,
    storage_options: dict[str, typing.Any]
) -> str
```

Join output directory and filename using the target filesystem (local or remote).

```python
nemo_curator.utils.split_large_files._split_table(
    table: pyarrow.Table,
    target_size: int
) -> list[pyarrow.Table]
```

```python
nemo_curator.utils.split_large_files._storage_options(
    storage_options: dict[str, typing.Any] | None
) -> dict[str, typing.Any]
```

```python
nemo_curator.utils.split_large_files._write_table_to_file(
    table: pyarrow.Table,
    output_file: str,
    storage_options: dict[str, typing.Any]
) -> None
```

```python
nemo_curator.utils.split_large_files.main(
    args: argparse.ArgumentParser | None = None
) -> None
```

```python
nemo_curator.utils.split_large_files.parse_args(
    args: argparse.ArgumentParser | None = None
) -> argparse.Namespace
```

```python
nemo_curator.utils.split_large_files.split_jsonl_file_by_size(
    input_file: str,
    output_path: str,
    target_size_mb: int,
    storage_options: dict[str, typing.Any] | None = None
) -> None
```

```python
nemo_curator.utils.split_large_files.split_parquet_file_by_size(
    input_file: str,
    output_path: str,
    target_size_mb: int,
    storage_options: dict[str, typing.Any] | None = None
) -> None
```