nemo_curator.tasks.interleaved
Interleaved task type and schema for row-wise interleaved multimodal records.
Schema columns fall into two categories:
Reserved columns (RESERVED_COLUMNS) — managed by pipeline stages:
================== ============= =========== ===============================================
Column Type Category Description
================== ============= =========== ===============================================
sample_id string (req) Identity Unique document/sample identifier
position int32 (req) Identity Position within sample (-1 for metadata rows)
modality string (req) Identity Row modality — built-in values are text,
image, and metadata; extensible to
audio, table, generated_image, etc.
content_type string Content MIME type (e.g. text/plain, image/jpeg)
text_content string Content Text payload for text rows
binary_content large_binary Content Image bytes (populated by materialization)
source_ref string Internal JSON locator {path, member, byte_offset, byte_size, frame_index}.
path alone = direct/remote read;
member= tar extract;byte_offset/size= range read (fastest).pathaccepts local or remote (s3://) URIs.materialize_errorstring Internal Error message if materialization failed ================== ============= =========== ===============================================
User columns (passthrough) — extra fields from source data added via the
fields parameter on the reader. These flow through the pipeline untouched.
Module Contents
Classes
Data
API
Bases: Task[Table | DataFrame]
Task carrying row-wise multimodal records.
See module docstring for the full schema reference (reserved vs user columns).
Number of unique samples (distinct sample_id values).
Add rows to this task.
Parameters:
New rows to append. Must contain required columns unless overridden by sample_id / auto_position.
If provided, assign this sample_id to all new rows.
If True, auto-assign position values
continuing from the existing maximum per sample.
Build a source_ref JSON locator string.
Return row count, optionally filtered by modality.
Examples::
task.count() # total rows task.count(modality=“image”) # image rows only task.count(modality=“text”) # text rows only
Delete rows where mask is True.
Parameters:
Boolean Series aligned to the data. True marks a row
for deletion.
Parse a source_ref JSON string into a locator dict.
Return a DataFrame copy with parsed source_ref columns added.
Columns: {prefix}path, {prefix}member, {prefix}byte_offset,
{prefix}byte_size, {prefix}frame_index.