> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.utils.split_large_files

## Module Contents

### Functions

| Name                                                                                             | Description                                                                               |
| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| [`_basename_and_ext`](#nemo_curator-utils-split_large_files-_basename_and_ext)                   | Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl). |
| [`_flush_jsonl_chunk`](#nemo_curator-utils-split_large_files-_flush_jsonl_chunk)                 | -                                                                                         |
| [`_join_out_path`](#nemo_curator-utils-split_large_files-_join_out_path)                         | Join output directory and filename using the target filesystem (local or remote).         |
| [`_split_table`](#nemo_curator-utils-split_large_files-_split_table)                             | -                                                                                         |
| [`_storage_options`](#nemo_curator-utils-split_large_files-_storage_options)                     | -                                                                                         |
| [`_write_table_to_file`](#nemo_curator-utils-split_large_files-_write_table_to_file)             | -                                                                                         |
| [`main`](#nemo_curator-utils-split_large_files-main)                                             | -                                                                                         |
| [`parse_args`](#nemo_curator-utils-split_large_files-parse_args)                                 | -                                                                                         |
| [`split_jsonl_file_by_size`](#nemo_curator-utils-split_large_files-split_jsonl_file_by_size)     | -                                                                                         |
| [`split_parquet_file_by_size`](#nemo_curator-utils-split_large_files-split_parquet_file_by_size) | -                                                                                         |

### API

<Anchor id="nemo_curator-utils-split_large_files-_basename_and_ext">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._basename_and_ext(
        path: str
    ) -> tuple[str, str]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).
</Indent>

<Anchor id="nemo_curator-utils-split_large_files-_flush_jsonl_chunk">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._flush_jsonl_chunk(
        lines: list[bytes],
        output_file: str,
        storage_options: dict[str, typing.Any]
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-_join_out_path">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._join_out_path(
        output_path: str,
        filename: str,
        storage_options: dict[str, typing.Any]
    ) -> str
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Join output directory and filename using the target filesystem (local or remote).
</Indent>

<Anchor id="nemo_curator-utils-split_large_files-_split_table">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._split_table(
        table: pyarrow.Table,
        target_size: int
    ) -> list[pyarrow.Table]
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-_storage_options">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._storage_options(
        storage_options: dict[str, typing.Any] | None
    ) -> dict[str, typing.Any]
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-_write_table_to_file">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files._write_table_to_file(
        table: pyarrow.Table,
        output_file: str,
        storage_options: dict[str, typing.Any]
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-main">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files.main(
        args: argparse.ArgumentParser | None = None
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-parse_args">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files.parse_args(
        args: argparse.ArgumentParser | None = None
    ) -> argparse.Namespace
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-split_jsonl_file_by_size">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files.split_jsonl_file_by_size(
        input_file: str,
        output_path: str,
        target_size_mb: int,
        storage_options: dict[str, typing.Any] | None = None
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-split_large_files-split_parquet_file_by_size">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.split_large_files.split_parquet_file_by_size(
        input_file: str,
        output_path: str,
        target_size_mb: int,
        storage_options: dict[str, typing.Any] | None = None
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />