For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
          • Tasks
          • Utils
            • Client Utils
            • Column Utils
            • Decoder Utils
            • File Utils
            • Gpu Utils
            • Grouping
            • Hf Download Utils
            • Merge File Prefixes
            • Nvcodec Utils
            • Operation Utils
            • Performance Utils
            • Prompts
            • Ray Utils
            • Split Large Files
            • Storage Utils
            • Vllm Utils
            • Windowing Utils
            • Writer Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Functions
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorUtils

nemo_curator.utils.split_large_files

||View as Markdown|
Previous

nemo_curator.utils.ray_utils

Next

nemo_curator.utils.storage_utils

Module Contents

Functions

NameDescription
_basename_and_extBasename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).
_flush_jsonl_chunk-
_join_out_pathJoin output directory and filename using the target filesystem (local or remote).
_split_table-
_storage_options-
_write_table_to_file-
main-
parse_args-
split_jsonl_file_by_size-
split_parquet_file_by_size-

API

nemo_curator.utils.split_large_files._basename_and_ext(
path: str
) -> tuple[str, str]

Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).

nemo_curator.utils.split_large_files._flush_jsonl_chunk(
lines: list[bytes],
output_file: str,
storage_options: dict[str, typing.Any]
) -> None
nemo_curator.utils.split_large_files._join_out_path(
output_path: str,
filename: str,
storage_options: dict[str, typing.Any]
) -> str

Join output directory and filename using the target filesystem (local or remote).

nemo_curator.utils.split_large_files._split_table(
table: pyarrow.Table,
target_size: int
) -> list[pyarrow.Table]
nemo_curator.utils.split_large_files._storage_options(
storage_options: dict[str, typing.Any] | None
) -> dict[str, typing.Any]
nemo_curator.utils.split_large_files._write_table_to_file(
table: pyarrow.Table,
output_file: str,
storage_options: dict[str, typing.Any]
) -> None
nemo_curator.utils.split_large_files.main(
args: argparse.ArgumentParser | None = None
) -> None
nemo_curator.utils.split_large_files.parse_args(
args: argparse.ArgumentParser | None = None
) -> argparse.Namespace
nemo_curator.utils.split_large_files.split_jsonl_file_by_size(
input_file: str,
output_path: str,
target_size_mb: int,
storage_options: dict[str, typing.Any] | None = None
) -> None
nemo_curator.utils.split_large_files.split_parquet_file_by_size(
input_file: str,
output_path: str,
target_size_mb: int,
storage_options: dict[str, typing.Any] | None = None
) -> None