For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
              • Filter
              • Io
              • Pdf
              • Stages
              • Utils
                • Constants
                • Image Utils
                • Materialization
                • Schema
                • Validation Utils
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Functions
  • Data
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesInterleavedUtils

nemo_curator.stages.interleaved.utils.schema

||View as Markdown|
Previous

nemo_curator.stages.interleaved.utils.materialization

Next

nemo_curator.stages.interleaved.utils.validation_utils

Centralized schema utilities for interleaved IO readers and writers.

All arrow-based readers/writers share these functions for type reconciliation and schema alignment (null-fill + reorder).

Module Contents

Functions

NameDescription
align_tablePad, reorder, and cast table to match target exactly.
reconcile_schemaBuild a schema with canonical types for reserved columns and inferred types for passthrough.
resolve_schemaReturn the effective schema from user-supplied schema or overrides.

Data

_LARGE_COMPAT

API

nemo_curator.stages.interleaved.utils.schema.align_table(
table: pyarrow.Table,
target: pyarrow.Schema
) -> pyarrow.Table

Pad, reorder, and cast table to match target exactly.

  • Columns in target absent from table are added as null arrays.
  • Columns in table absent from target are dropped.
  • Column order matches target.

Reserved INTERLEAVED_SCHEMA columns allow safe=False casts so that explicit large↔small type overrides work (e.g. large_string→string for Parquet compat). Passthrough (user-defined) columns always use safe=True so that overflow errors surface rather than silently corrupt data (e.g. large_string→string on a >2 GB column).

nemo_curator.stages.interleaved.utils.schema.reconcile_schema(
inferred: pyarrow.Schema
) -> pyarrow.Schema

Build a schema with canonical types for reserved columns and inferred types for passthrough.

Avoids unsafe downcasts (e.g. large_string -> string) that cause offset overflow on large tables read via the pyarrow backend.

nemo_curator.stages.interleaved.utils.schema.resolve_schema(
schema: pyarrow.Schema | None,
overrides: dict[str, pyarrow.DataType] | None
) -> pyarrow.Schema | None

Return the effective schema from user-supplied schema or overrides.

Priority: schema > overrides merged on top of INTERLEAVED_SCHEMA > None.

If schema is provided and overrides is also provided, overrides are ignored and a warning is emitted. Returns None if both are None.

nemo_curator.stages.interleaved.utils.schema._LARGE_COMPAT: dict[tuple[DataType, DataType], DataType] = {(pa.large_string(), pa.string()): pa.large_string(), (pa.large_binary(), pa.bin...