nemo_curator.stages.interleaved.io.readers.parquet

View as Markdown

Module Contents

Classes

NameDescription
InterleavedParquetReaderStageRead interleaved Parquet files into an InterleavedBatch.

API

class nemo_curator.stages.interleaved.io.readers.parquet.InterleavedParquetReaderStage(
read_kwargs: dict[str, typing.Any] = dict(),
schema: pyarrow.Schema | None = None,
schema_overrides: dict[str, pyarrow.DataType] | None = None,
name: str = 'interleaved_parquet_reader',
fields: tuple[str, ...] | None = None,
max_batch_bytes: int | None = None
)
Dataclass

Bases: BaseInterleavedReader

Read interleaved Parquet files into an InterleavedBatch.

fields lists extra (passthrough) column names to read beyond the reserved schema columns. Any fields entry that is absent from a given file is null-filled, consistent with how the WebDataset reader handles fields. Reserved columns are always read regardless of fields.

When max_batch_bytes is set, the combined table is split into multiple batches so that no single batch exceeds the byte limit. Each split’s source_files metadata lists only the parquet files that contributed rows to that batch.

fields
tuple[str, ...] | None = None
max_batch_bytes
int | None = None
name
str = 'interleaved_parquet_reader'
nemo_curator.stages.interleaved.io.readers.parquet.InterleavedParquetReaderStage.__post_init__() -> None
nemo_curator.stages.interleaved.io.readers.parquet.InterleavedParquetReaderStage._columns_to_read(
file_schema: pyarrow.Schema
) -> list[str] | None

Return the column list to pass to pq.read_table.

When fields is None (the default) returns None, which tells PyArrow to read all columns — non-lossy by default, consistent with the WebDataset reader.

When fields is set, returns reserved columns plus those extra columns that exist in the file; missing declared fields are null-filled after the read.

nemo_curator.stages.interleaved.io.readers.parquet.InterleavedParquetReaderStage._null_fill_missing_columns(
table: pyarrow.Table
) -> pyarrow.Table

Null-fill any reserved or extra fields columns absent from table.

Handles both reserved columns (typed from INTERLEAVED_SCHEMA) and user-requested passthrough fields (pa.null() typed, resolved later by _align_output). A single set() pass avoids duplicate schema introspection.

nemo_curator.stages.interleaved.io.readers.parquet.InterleavedParquetReaderStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.InterleavedBatch | list[nemo_curator.tasks.InterleavedBatch]