nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess
nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess
CPU postprocess stage: parse model output, align images, build interleaved rows.
Module Contents
Classes
API
Bases: ProcessingStage[InterleavedBatch, InterleavedBatch]
CPU stage: parse raw model output and build the final interleaved schema.
Reads page images from binary_content and raw Nemotron-Parse output
from text_content, then constructs one row per element (text, image,
table, metadata) in the interleaved schema.
Floater reordering (Pictures/Captions) is applied automatically for
Nemotron-Parse v1.1 and skipped for v1.2+, based on the model_path
stored in task metadata by the inference stage.
Parameters
proc_size
Default model processor size (height, width). Overridden at
runtime by task._metadata["proc_size"] when available.
min_crop_px
Minimum pixel dimension for image crops. Smaller crops (typically
degenerate bboxes) are filtered out.