nemo_curator.stages.interleaved.io.readers.parquet
nemo_curator.stages.interleaved.io.readers.parquet
Module Contents
Classes
API
Bases: BaseInterleavedReader
Read interleaved Parquet files into an InterleavedBatch.
fields lists extra (passthrough) column names to read beyond the reserved
schema columns. Any fields entry that is absent from a given file is
null-filled, consistent with how the WebDataset reader handles fields.
Reserved columns are always read regardless of fields.
When max_batch_bytes is set, the combined table is split into multiple
batches so that no single batch exceeds the byte limit. Each split’s
source_files metadata lists only the parquet files that contributed
rows to that batch.
Return the column list to pass to pq.read_table.
When fields is None (the default) returns None, which tells
PyArrow to read all columns — non-lossy by default, consistent with the
WebDataset reader.
When fields is set, returns reserved columns plus those extra columns that exist in the file; missing declared fields are null-filled after the read.
Null-fill any reserved or extra fields columns absent from table.
Handles both reserved columns (typed from INTERLEAVED_SCHEMA) and user-requested passthrough fields (pa.null() typed, resolved later by _align_output). A single set() pass avoids duplicate schema introspection.