nemo_curator.stages.interleaved.io.writers.webdataset
nemo_curator.stages.interleaved.io.writers.webdataset
Module Contents
Classes
Functions
Data
API
Bases: BaseInterleavedWriter
Write an InterleavedBatch as a MINT-1T-style WebDataset tar shard.
Each sample is reconstructed from its row-based representation:
metadatarows supply passthrough fields embedded in the JSON.textrows are assembled into the"texts"list (Noneat gaps).imagerows are assembled into the"images"list and written as individual tar members;binary_contentmust be populated (either by the upstream pipeline or viamaterialize_on_write=True).
The JSON member key is urllib.parse.quote(sample_id, safe="") so that
roundtripping via :class:InterleavedWebdatasetReaderStage with
sample_id_field="sample_id" recovers the original sample_id.
Only "metadata", "text", and "image" modalities are supported.
Any other modality raises ValueError at write time.
Percent-encode a sample_id so it is safe as a tar member name stem.
Return a file extension for content_type, falling back to "bin".
Return True for Python/pandas null-ish scalars.
Build per-modality passthrough lists from content rows.
For each passthrough column not already in meta_keys:
- Pure per-image (non-null only in image rows): emitted as a list with one entry per image row (None preserved for sparse nulls).
- Pure per-text (non-null only in text rows): emitted as a list with one entry per text row (None preserved for sparse nulls).
- Mixed (non-null in both image and text rows): emitted as a
position-aligned list with one entry per content row in position order
(None where the value is absent). When read back without declaring the
field in
per_image_fields/per_text_fields, the reader treats the entire list as a sample-level passthrough on the metadata row.
Return a Python list from series, replacing any null-like value with None.
Write one sample (JSON + image binaries) into tf.