nemo_curator.stages.interleaved.utils.schema
nemo_curator.stages.interleaved.utils.schema
nemo_curator.stages.interleaved.utils.schema
Centralized schema utilities for interleaved IO readers and writers.
All arrow-based readers/writers share these functions for type reconciliation and schema alignment (null-fill + reorder).
Pad, reorder, and cast table to match target exactly.
Reserved INTERLEAVED_SCHEMA columns allow safe=False casts so that
explicit large↔small type overrides work (e.g. large_string→string
for Parquet compat). Passthrough (user-defined) columns always use
safe=True so that overflow errors surface rather than silently corrupt
data (e.g. large_string→string on a >2 GB column).
Build a schema with canonical types for reserved columns and inferred types for passthrough.
Avoids unsafe downcasts (e.g. large_string -> string) that cause offset overflow on large tables read via the pyarrow backend.
Return the effective schema from user-supplied schema or overrides.
Priority: schema > overrides merged on top of INTERLEAVED_SCHEMA > None.
If schema is provided and overrides is also provided, overrides are
ignored and a warning is emitted. Returns None if both are None.