*** layout: overview slug: nemo-curator/nemo\_curator/tasks/interleaved title: nemo\_curator.tasks.interleaved -------------------------------------- Interleaved task type and schema for row-wise interleaved multimodal records. Schema columns fall into two categories: **Reserved columns** (`RESERVED_COLUMNS`) -- managed by pipeline stages: \================== ============= =========== =============================================== Column Type Category Description \================== ============= =========== =============================================== `sample_id` string (req) Identity Unique document/sample identifier `position` int32 (req) Identity Position within sample (-1 for metadata rows) `modality` string (req) Identity Row modality -- built-in values are `text`, `image`, and `metadata`; extensible to `audio`, `table`, `generated_image`, etc. `content_type` string Content MIME type (e.g. `text/plain`, `image/jpeg`) `text_content` string Content Text payload for text rows `binary_content` large\_binary Content Image bytes (populated by materialization) `source_ref` string Internal JSON locator `{path, member, byte_offset, byte_size, frame_index}`. `path` alone = direct/remote read; * `member` = tar extract; * `byte_offset/size` = range read (fastest). `path` accepts local or remote (`s3://`) URIs. `materialize_error` string Internal Error message if materialization failed \================== ============= =========== =============================================== **User columns** (passthrough) -- extra fields from source data added via the `fields` parameter on the reader. These flow through the pipeline untouched. ## Module Contents ### Classes | Name | Description | | ---------------------------------------------------------------------- | ------------------------------------------ | | [`InterleavedBatch`](#nemo_curator-tasks-interleaved-InterleavedBatch) | Task carrying row-wise multimodal records. | ### Data [`INTERLEAVED_SCHEMA`](#nemo_curator-tasks-interleaved-INTERLEAVED_SCHEMA) [`RESERVED_COLUMNS`](#nemo_curator-tasks-interleaved-RESERVED_COLUMNS) ### API ```python class nemo_curator.tasks.interleaved.InterleavedBatch( task_id: str, dataset_name: str, data: pyarrow.Table | pandas.DataFrame = (lambda: pa.Table.from_pyli..., _stage_perf: list[nemo_curator.utils.performance_utils.StagePerfStats] = list(), _metadata: dict[str, typing.Any] = dict(), REQUIRED_COLUMNS: frozenset[str] = frozenset(name for name, f ... ) ``` Dataclass **Bases:** [Task\[Table | DataFrame\]](/nemo-curator/nemo_curator/tasks/tasks#nemo_curator-tasks-tasks-Task) Task carrying row-wise multimodal records. See module docstring for the full schema reference (reserved vs user columns). Number of unique samples (distinct `sample_id` values). ```python nemo_curator.tasks.interleaved.InterleavedBatch.add_rows( rows: pyarrow.Table | pandas.DataFrame | list[dict], sample_id: str | None = None, auto_position: bool = True ) -> nemo_curator.tasks.interleaved.InterleavedBatch ``` Add rows to this task. **Parameters:** New rows to append. Must contain required columns unless overridden by *sample\_id* / *auto\_position*. If provided, assign this `sample_id` to all new rows. If `True`, auto-assign `position` values continuing from the existing maximum per sample. ```python nemo_curator.tasks.interleaved.InterleavedBatch.build_source_ref( path: str | None, member: str | None, byte_offset: int | None = None, byte_size: int | None = None, frame_index: int | None = None ) -> str ``` staticmethod Build a `source_ref` JSON locator string. ```python nemo_curator.tasks.interleaved.InterleavedBatch.count( modality: str | None = None ) -> int ``` Return row count, optionally filtered by modality. Examples:: task.count() # total rows task.count(modality="image") # image rows only task.count(modality="text") # text rows only ```python nemo_curator.tasks.interleaved.InterleavedBatch.delete_rows( mask: pandas.Series ) -> nemo_curator.tasks.interleaved.InterleavedBatch ``` Delete rows where *mask* is `True`. **Parameters:** Boolean Series aligned to the data. `True` marks a row for deletion. ```python nemo_curator.tasks.interleaved.InterleavedBatch.get_columns() -> list[str] ``` ```python nemo_curator.tasks.interleaved.InterleavedBatch.parse_source_ref( source_value: str | None ) -> dict[str, str | int | None] ``` staticmethod Parse a `source_ref` JSON string into a locator dict. ```python nemo_curator.tasks.interleaved.InterleavedBatch.to_pandas() -> pandas.DataFrame ``` ```python nemo_curator.tasks.interleaved.InterleavedBatch.to_pyarrow() -> pyarrow.Table ``` ```python nemo_curator.tasks.interleaved.InterleavedBatch.validate() -> bool ``` ```python nemo_curator.tasks.interleaved.InterleavedBatch.with_parsed_source_ref_columns( prefix: str = '_src_' ) -> pandas.DataFrame ``` Return a DataFrame copy with parsed `source_ref` columns added. Columns: `{prefix}path`, `{prefix}member`, `{prefix}byte_offset`, `{prefix}byte_size`, `{prefix}frame_index`. ```python nemo_curator.tasks.interleaved.INTERLEAVED_SCHEMA = pa.schema([pa.field('sample_id', pa.string(), nullable=False), pa.field('positio... ``` ```python nemo_curator.tasks.interleaved.RESERVED_COLUMNS: frozenset[str] = frozenset(INTERLEAVED_SCHEMA.names) ```