nemo_curator.stages.interleaved.io.writers.webdataset

View as Markdown

Module Contents

Classes

NameDescription
InterleavedWebdatasetWriterStageWrite an InterleavedBatch as a MINT-1T-style WebDataset tar shard.

Functions

NameDescription
_escape_keyPercent-encode a sample_id so it is safe as a tar member name stem.
_ext_from_content_typeReturn a file extension for content_type, falling back to "bin".
_is_nullReturn True for Python/pandas null-ish scalars.
_per_modality_passthroughBuild per-modality passthrough lists from content rows.
_series_to_nullable_listReturn a Python list from series, replacing any null-like value with None.
_write_sampleWrite one sample (JSON + image binaries) into tf.

Data

_CONTENT_TYPE_TO_EXT

API

class nemo_curator.stages.interleaved.io.writers.webdataset.InterleavedWebdatasetWriterStage(
path: str,
file_extension: str = 'tar',
write_kwargs: dict[str, typing.Any] = dict(),
materialize_on_write: bool = True,
name: str = 'interleaved_webdataset_wri...,
mode: typing.Literal['ignore', 'overwrite', 'append', 'error'] = 'ignore',
append_mode_implemented: bool = False,
on_materialize_error: typing.Literal['error', 'warn', 'drop_row', 'drop_sample'] = 'error',
schema: pyarrow.Schema | None = None,
schema_overrides: dict[str, pyarrow.DataType] | None = None
)
Dataclass

Bases: BaseInterleavedWriter

Write an InterleavedBatch as a MINT-1T-style WebDataset tar shard.

Each sample is reconstructed from its row-based representation:

  • metadata rows supply passthrough fields embedded in the JSON.
  • text rows are assembled into the "texts" list (None at gaps).
  • image rows are assembled into the "images" list and written as individual tar members; binary_content must be populated (either by the upstream pipeline or via materialize_on_write=True).

The JSON member key is urllib.parse.quote(sample_id, safe="") so that roundtripping via :class:InterleavedWebdatasetReaderStage with sample_id_field="sample_id" recovers the original sample_id.

Only "metadata", "text", and "image" modalities are supported. Any other modality raises ValueError at write time.

_SUPPORTED_MODALITIES
frozenset[str] = frozenset({'metadata', 'text', 'image'})
file_extension
str = 'tar'
name
str = 'interleaved_webdataset_writer'
nemo_curator.stages.interleaved.io.writers.webdataset.InterleavedWebdatasetWriterStage._write_dataframe(
df: pandas.DataFrame,
file_path: str,
_write_kwargs: dict[str, typing.Any]
) -> None
nemo_curator.stages.interleaved.io.writers.webdataset._escape_key(
sample_id: str
) -> str

Percent-encode a sample_id so it is safe as a tar member name stem.

nemo_curator.stages.interleaved.io.writers.webdataset._ext_from_content_type(
content_type: str | None
) -> str

Return a file extension for content_type, falling back to "bin".

nemo_curator.stages.interleaved.io.writers.webdataset._is_null(
value: object
) -> bool

Return True for Python/pandas null-ish scalars.

nemo_curator.stages.interleaved.io.writers.webdataset._per_modality_passthrough(
passthrough_cols: list[str],
meta_keys: set[str],
image_rows: pandas.DataFrame,
text_rows: pandas.DataFrame,
content_rows: pandas.DataFrame
) -> dict[str, typing.Any]

Build per-modality passthrough lists from content rows.

For each passthrough column not already in meta_keys:

  • Pure per-image (non-null only in image rows): emitted as a list with one entry per image row (None preserved for sparse nulls).
  • Pure per-text (non-null only in text rows): emitted as a list with one entry per text row (None preserved for sparse nulls).
  • Mixed (non-null in both image and text rows): emitted as a position-aligned list with one entry per content row in position order (None where the value is absent). When read back without declaring the field in per_image_fields / per_text_fields, the reader treats the entire list as a sample-level passthrough on the metadata row.
nemo_curator.stages.interleaved.io.writers.webdataset._series_to_nullable_list(
series: pandas.Series
) -> list

Return a Python list from series, replacing any null-like value with None.

nemo_curator.stages.interleaved.io.writers.webdataset._write_sample(
tf: tarfile.TarFile,
sample_df: pandas.DataFrame,
sample_id: str,
passthrough_cols: list[str]
) -> None

Write one sample (JSON + image binaries) into tf.

nemo_curator.stages.interleaved.io.writers.webdataset._CONTENT_TYPE_TO_EXT: dict[str, str] = {'image/jpeg': 'jpg', 'image/png': 'png', 'image/tiff': 'tiff', 'image/webp': 'w...