For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
            • Base
            • Internal
            • Ray Actor Pool
            • Ray Data
              • Adapter
              • Executor
              • Utils
            • Utils
            • Xenna
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Functions
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorBackendsRay Data

nemo_curator.backends.ray_data.adapter

||View as Markdown|
Previous

nemo_curator.backends.ray_data

Next

nemo_curator.backends.ray_data.executor

Module Contents

Classes

NameDescription
RayDataStageAdapterAdapts ProcessingStage to Ray Data operations.

Functions

NameDescription
create_actor_from_stageCreate a StageProcessor class with the proper stage name for display.
create_task_from_stageCreate a named Ray Data stage adapter function.

API

class nemo_curator.backends.ray_data.adapter.RayDataStageAdapter(
stage: nemo_curator.stages.base.ProcessingStage
)

Bases: BaseStageAdapter

Adapts ProcessingStage to Ray Data operations.

This adapter converts stages to work with Ray Data datasets by:

  1. Working directly with Task objects (no dictionary conversion)
  2. Using Ray Data’s map_batches for parallel processing a. If stage has both gpus and cpus specified, then we use actors b. If stage.setup is overridden, then we use actors c. Else we use tasks
_batch_size
= self.stage.batch_size
batch_size
int | None

Get the batch size for this stage.

nemo_curator.backends.ray_data.adapter.RayDataStageAdapter._process_batch_internal(
batch: dict[str, typing.Any]
) -> dict[str, typing.Any]

Internal method that handles the actual batch processing logic.

Parameters:

batch
dict[str, Any]

Dictionary with arrays/lists representing a batch of Task objects

Returns: dict[str, Any]

Dictionary with arrays/lists representing processed Task objects

nemo_curator.backends.ray_data.adapter.RayDataStageAdapter.process_dataset(
dataset: ray.data.Dataset,
ignore_head_node: bool = False
) -> ray.data.Dataset

Process a Ray Data dataset through this stage.

Parameters:

dataset
Dataset

Ray Data dataset containing Task objects

Returns: Dataset

Processed Ray Data dataset

nemo_curator.backends.ray_data.adapter.create_actor_from_stage(
stage: nemo_curator.stages.base.ProcessingStage
) -> type[nemo_curator.backends.ray_data.adapter.RayDataStageAdapter]

Create a StageProcessor class with the proper stage name for display.

nemo_curator.backends.ray_data.adapter.create_task_from_stage(
stage: nemo_curator.stages.base.ProcessingStage
) -> collections.abc.Callable[[dict[str, Any]], dict[str, typing.Any]]

Create a named Ray Data stage adapter function.

This creates a standalone function that wraps the stage processing logic with a clean name that doesn’t include the class qualification.

Parameters:

stage
ProcessingStage

Processing stage to adapt

Returns: Callable[[dict[str, Any]], dict[str, Any]]

A function that can be used directly with Ray Data’s map_batches