nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess

View as Markdown

CPU postprocess stage: parse model output, align images, build interleaved rows.

Module Contents

Classes

NameDescription
NemotronParsePostprocessStageCPU stage: parse raw model output and build the final interleaved schema.

API

class nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess.NemotronParsePostprocessStage(
proc_size: tuple[int, int] = (2048, 1664),
min_crop_px: int = DEFAULT_MIN_CROP_PX,
name: str = 'nemotron_parse_postprocess',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=2.0...
)
Dataclass

Bases: ProcessingStage[InterleavedBatch, InterleavedBatch]

CPU stage: parse raw model output and build the final interleaved schema.

Reads page images from binary_content and raw Nemotron-Parse output from text_content, then constructs one row per element (text, image, table, metadata) in the interleaved schema.

Floater reordering (Pictures/Captions) is applied automatically for Nemotron-Parse v1.1 and skipped for v1.2+, based on the model_path stored in task metadata by the inference stage.

Parameters

proc_size Default model processor size (height, width). Overridden at runtime by task._metadata["proc_size"] when available. min_crop_px Minimum pixel dimension for image crops. Smaller crops (typically degenerate bboxes) are filtered out.

min_crop_px
int = DEFAULT_MIN_CROP_PX
name
str = 'nemotron_parse_postprocess'
proc_size
tuple[int, int] = (2048, 1664)
resources
Resources
nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess.NemotronParsePostprocessStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess.NemotronParsePostprocessStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess.NemotronParsePostprocessStage.process(
task: nemo_curator.tasks.InterleavedBatch
) -> nemo_curator.tasks.InterleavedBatch | None