***

layout: overview
slug: nemo-curator/nemo\_curator/tasks/interleaved
title: nemo\_curator.tasks.interleaved
--------------------------------------

Interleaved task type and schema for row-wise interleaved multimodal records.

Schema columns fall into two categories:

**Reserved columns** (`RESERVED_COLUMNS`) -- managed by pipeline stages:

\==================  =============  ===========  ===============================================
Column              Type           Category     Description
\==================  =============  ===========  ===============================================
`sample_id`       string (req)   Identity     Unique document/sample identifier
`position`        int32 (req)    Identity     Position within sample (-1 for metadata rows)
`modality`        string (req)   Identity     Row modality -- built-in values are `text`,
`image`, and `metadata`; extensible to
`audio`, `table`, `generated_image`, etc.
`content_type`    string         Content      MIME type (e.g. `text/plain`, `image/jpeg`)
`text_content`    string         Content      Text payload for text rows
`binary_content`  large\_binary   Content      Image bytes (populated by materialization)
`source_ref`      string         Internal     JSON locator `&#123;path, member,
                                                   byte_offset, byte_size, frame_index&#125;`.
`path` alone = direct/remote read;

* `member` = tar extract;
* `byte_offset/size` = range read (fastest).
  `path` accepts local or remote (`s3://`) URIs.
  `materialize_error` string       Internal     Error message if materialization failed
  \==================  =============  ===========  ===============================================

**User columns** (passthrough) -- extra fields from source data added via the
`fields` parameter on the reader. These flow through the pipeline untouched.

## Module Contents

### Classes

| Name                                                                   | Description                                |
| ---------------------------------------------------------------------- | ------------------------------------------ |
| [`InterleavedBatch`](#nemo_curator-tasks-interleaved-InterleavedBatch) | Task carrying row-wise multimodal records. |

### Data

[`INTERLEAVED_SCHEMA`](#nemo_curator-tasks-interleaved-INTERLEAVED_SCHEMA)

[`RESERVED_COLUMNS`](#nemo_curator-tasks-interleaved-RESERVED_COLUMNS)

### API

<Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch">
  <CodeBlock links={{"nemo_curator.utils.performance_utils.StagePerfStats":"/nemo-curator/nemo_curator/utils/performance_utils#nemo_curator-utils-performance_utils-StagePerfStats"}} showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.tasks.interleaved.InterleavedBatch(
        task_id: str,
        dataset_name: str,
        data: pyarrow.Table | pandas.DataFrame = (lambda: pa.Table.from_pyli...,
        _stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(),
        _metadata: dict[str, typing.Any] = dict(),
        REQUIRED_COLUMNS: frozenset[str] = frozenset(name for name, f ...
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [Task\[Table | DataFrame\]](/nemo-curator/nemo_curator/tasks/tasks#nemo_curator-tasks-tasks-Task)

  Task carrying row-wise multimodal records.

  See module docstring for the full schema reference (reserved vs user columns).

  <ParamField path="REQUIRED_COLUMNS" type="frozenset[str]" />

  <ParamField path="data" type="Table | DataFrame" />

  <ParamField path="num_items" type="int">
    Number of unique samples (distinct `sample_id` values).
  </ParamField>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-add_rows">
    <CodeBlock links={{"nemo_curator.tasks.interleaved.InterleavedBatch":"#nemo_curator-tasks-interleaved-InterleavedBatch"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.add_rows(
          rows: pyarrow.Table | pandas.DataFrame | list[dict],
          sample_id: str | None = None,
          auto_position: bool = True
      ) -> nemo_curator.tasks.interleaved.InterleavedBatch
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Add rows to this task.

    **Parameters:**

    <ParamField path="rows" type="pa.Table | pd.DataFrame | list[dict]">
      New rows to append. Must contain required columns unless
      overridden by *sample\_id* / *auto\_position*.
    </ParamField>

    <ParamField path="sample_id" type="str | None" default="None">
      If provided, assign this `sample_id` to all new rows.
    </ParamField>

    <ParamField path="auto_position" type="bool" default="True">
      If `True`, auto-assign `position` values
      continuing from the existing maximum per sample.
    </ParamField>
  </Indent>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-build_source_ref">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.build_source_ref(
          path: str | None,
          member: str | None,
          byte_offset: int | None = None,
          byte_size: int | None = None,
          frame_index: int | None = None
      ) -> str
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      staticmethod
    </Badge>

    Build a `source_ref` JSON locator string.
  </Indent>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-count">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.count(
          modality: str | None = None
      ) -> int
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Return row count, optionally filtered by modality.

    Examples::

    task.count()                    # total rows
    task.count(modality="image")    # image rows only
    task.count(modality="text")     # text rows only
  </Indent>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-delete_rows">
    <CodeBlock links={{"nemo_curator.tasks.interleaved.InterleavedBatch":"#nemo_curator-tasks-interleaved-InterleavedBatch"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.delete_rows(
          mask: pandas.Series
      ) -> nemo_curator.tasks.interleaved.InterleavedBatch
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Delete rows where *mask* is `True`.

    **Parameters:**

    <ParamField path="mask" type="pd.Series">
      Boolean Series aligned to the data. `True` marks a row
      for deletion.
    </ParamField>
  </Indent>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-get_columns">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.get_columns() -> list[str]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-parse_source_ref">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.parse_source_ref(
          source_value: str | None
      ) -> dict[str, str | int | None]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      staticmethod
    </Badge>

    Parse a `source_ref` JSON string into a locator dict.
  </Indent>

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-to_pandas">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.to_pandas() -> pandas.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-to_pyarrow">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.to_pyarrow() -> pyarrow.Table
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-validate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.validate() -> bool
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-tasks-interleaved-InterleavedBatch-with_parsed_source_ref_columns">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.tasks.interleaved.InterleavedBatch.with_parsed_source_ref_columns(
          prefix: str = '_src_'
      ) -> pandas.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Return a DataFrame copy with parsed `source_ref` columns added.

    Columns: `&#123;prefix&#125;path`, `&#123;prefix&#125;member`, `&#123;prefix&#125;byte_offset`,
    `&#123;prefix&#125;byte_size`, `&#123;prefix&#125;frame_index`.
  </Indent>
</Indent>

<Anchor id="nemo_curator-tasks-interleaved-INTERLEAVED_SCHEMA">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.tasks.interleaved.INTERLEAVED_SCHEMA = pa.schema([pa.field('sample_id', pa.string(), nullable=False), pa.field('positio...
    ```
  </CodeBlock>
</Anchor>

<Anchor id="nemo_curator-tasks-interleaved-RESERVED_COLUMNS">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.tasks.interleaved.RESERVED_COLUMNS: frozenset[str] = frozenset(INTERLEAVED_SCHEMA.names)
    ```
  </CodeBlock>
</Anchor>
