For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
            • Client
            • Constants
            • Serve
            • Utils
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Functions
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorCore

nemo_curator.core.utils

||View as Markdown|
Previous

nemo_curator.core.serve.subprocess_mgr

Next

nemo_curator.metrics

Module Contents

Functions

NameDescription
_logger_custom_deserializer-
_logger_custom_serializer-
check_ray_responsive-
get_free_portChecks if start_port is free.
ignore_ray_head_nodeReturn True if CURATOR_IGNORE_RAY_HEAD_NODE is set to a truthy value.
init_clusterInitialize a new local Ray cluster or connects to an existing one.
split_table_by_group_max_bytesSplit an Arrow table by approximate byte size without splitting group rows.

API

nemo_curator.core.utils._logger_custom_deserializer(
_: None
) -> loguru.Logger
nemo_curator.core.utils._logger_custom_serializer(
_: loguru.Logger
) -> None
nemo_curator.core.utils.check_ray_responsive(
timeout_s: int = RAY_CLUSTER_START_VERIFICAT...
) -> bool
nemo_curator.core.utils.get_free_port(
start_port: int,
get_next_free_port: bool = True
) -> int

Checks if start_port is free. If not, it will get the next free port starting from start_port if get_next_free_port is True. Else, it will raise an error if the free port is not equal to start_port.

nemo_curator.core.utils.ignore_ray_head_node() -> bool

Return True if CURATOR_IGNORE_RAY_HEAD_NODE is set to a truthy value.

Used by both the pipeline executors (to skip the head node when scheduling stage actors) and the inference-server backends (to emit a worker-only bundle-label selector on placement groups).

nemo_curator.core.utils.init_cluster(
ray_port: int,
ray_temp_dir: str,
ray_dashboard_port: int,
ray_metrics_port: int,
ray_client_server_port: int,
ray_dashboard_host: str,
num_gpus: int | None = None,
num_cpus: int | None = None,
object_store_memory: int | None = None,
enable_object_spilling: bool = False,
block: bool = True,
ip_address: str | None = None,
stdouterr_capture_file: str | None = None
) -> subprocess.Popen

Initialize a new local Ray cluster or connects to an existing one.

nemo_curator.core.utils.split_table_by_group_max_bytes(
table: pyarrow.Table,
group_column: str,
max_batch_bytes: int | None
) -> list[pyarrow.Table]

Split an Arrow table by approximate byte size without splitting group rows.

Each unique value in group_column is kept in a single output table. If a single group exceeds max_batch_bytes, it is still emitted as one chunk.

Note: null values in group_column are grouped together (consecutive nulls are not split). Callers should ensure the column is non-nullable or handle nulls upstream.