For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
          • Tasks
            • Audio Task
            • Document
            • File Group
            • Image
            • Interleaved
            • Tasks
            • Utils
            • Video
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Data
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorTasks

nemo_curator.tasks.interleaved

||View as Markdown|
Previous

nemo_curator.tasks.image

Next

nemo_curator.tasks.tasks

Interleaved task type and schema for row-wise interleaved multimodal records.

Schema columns fall into two categories:

Reserved columns (RESERVED_COLUMNS) — managed by pipeline stages:

================== ============= =========== =============================================== Column Type Category Description ================== ============= =========== =============================================== sample_id string (req) Identity Unique document/sample identifier position int32 (req) Identity Position within sample (-1 for metadata rows) modality string (req) Identity Row modality — built-in values are text, image, and metadata; extensible to audio, table, generated_image, etc. content_type string Content MIME type (e.g. text/plain, image/jpeg) text_content string Content Text payload for text rows binary_content large_binary Content Image bytes (populated by materialization) source_ref string Internal JSON locator {path, member, byte_offset, byte_size, frame_index}. path alone = direct/remote read;

  • member = tar extract;
  • byte_offset/size = range read (fastest). path accepts local or remote (s3://) URIs. materialize_error string Internal Error message if materialization failed ================== ============= =========== ===============================================

User columns (passthrough) — extra fields from source data added via the fields parameter on the reader. These flow through the pipeline untouched.

Module Contents

Classes

NameDescription
InterleavedBatchTask carrying row-wise multimodal records.

Data

INTERLEAVED_SCHEMA

RESERVED_COLUMNS

API

class nemo_curator.tasks.interleaved.InterleavedBatch(
task_id: str,
dataset_name: str,
data: pyarrow.Table | pandas.DataFrame = (lambda: pa.Table.from_pyli...,
_stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
_metadata: dict[str, typing.Any] = dict(),
REQUIRED_COLUMNS: frozenset[str] = frozenset(name for name, f ...
)
Dataclass

Bases: Task[Table | DataFrame]

Task carrying row-wise multimodal records.

See module docstring for the full schema reference (reserved vs user columns).

REQUIRED_COLUMNS
frozenset[str]
data
Table | DataFrame
num_items
int

Number of unique samples (distinct sample_id values).

nemo_curator.tasks.interleaved.InterleavedBatch.add_rows(
rows: pyarrow.Table | pandas.DataFrame | list[dict],
sample_id: str | None = None,
auto_position: bool = True
) -> nemo_curator.tasks.interleaved.InterleavedBatch

Add rows to this task.

Parameters:

rows
pa.Table | pd.DataFrame | list[dict]

New rows to append. Must contain required columns unless overridden by sample_id / auto_position.

sample_id
str | NoneDefaults to None

If provided, assign this sample_id to all new rows.

auto_position
boolDefaults to True

If True, auto-assign position values continuing from the existing maximum per sample.

nemo_curator.tasks.interleaved.InterleavedBatch.build_source_ref(
path: str | None,
member: str | None,
byte_offset: int | None = None,
byte_size: int | None = None,
frame_index: int | None = None
) -> str
staticmethod

Build a source_ref JSON locator string.

nemo_curator.tasks.interleaved.InterleavedBatch.count(
modality: str | None = None
) -> int

Return row count, optionally filtered by modality.

Examples::

task.count() # total rows task.count(modality=“image”) # image rows only task.count(modality=“text”) # text rows only

nemo_curator.tasks.interleaved.InterleavedBatch.delete_rows(
mask: pandas.Series
) -> nemo_curator.tasks.interleaved.InterleavedBatch

Delete rows where mask is True.

Parameters:

mask
pd.Series

Boolean Series aligned to the data. True marks a row for deletion.

nemo_curator.tasks.interleaved.InterleavedBatch.get_columns() -> list[str]
nemo_curator.tasks.interleaved.InterleavedBatch.parse_source_ref(
source_value: str | None
) -> dict[str, str | int | None]
staticmethod

Parse a source_ref JSON string into a locator dict.

nemo_curator.tasks.interleaved.InterleavedBatch.to_pandas() -> pandas.DataFrame
nemo_curator.tasks.interleaved.InterleavedBatch.to_pyarrow() -> pyarrow.Table
nemo_curator.tasks.interleaved.InterleavedBatch.validate() -> bool
nemo_curator.tasks.interleaved.InterleavedBatch.with_parsed_source_ref_columns(
prefix: str = '_src_'
) -> pandas.DataFrame

Return a DataFrame copy with parsed source_ref columns added.

Columns: {prefix}path, {prefix}member, {prefix}byte_offset, {prefix}byte_size, {prefix}frame_index.

nemo_curator.tasks.interleaved.INTERLEAVED_SCHEMA = pa.schema([pa.field('sample_id', pa.string(), nullable=False), pa.field('positio...
nemo_curator.tasks.interleaved.RESERVED_COLUMNS: frozenset[str] = frozenset(INTERLEAVED_SCHEMA.names)