nemo_curator.stages.interleaved.pdf.nemotron_parse.inference
nemo_curator.stages.interleaved.pdf.nemotron_parse.inference
GPU inference stage for Nemotron-Parse.
Module Contents
Classes
Functions
Data
API
Bases: ProcessingStage[InterleavedBatch, InterleavedBatch]
GPU stage: run Nemotron-Parse inference on pre-rendered page images.
Reads PNG page images from binary_content, runs model inference, and
writes raw Nemotron-Parse output into text_content.
Supports two inference backends:
"vllm"(recommended): vLLM offline mode with continuous batching. Batching is handled internally by vLLM viamax_num_seqs."hf": HuggingFace Transformers with manual micro-batching viainference_batch_size.
Parameters
model_path
HuggingFace model ID or local path (e.g. nvidia/NVIDIA-Nemotron-Parse-v1.2).
text_in_pic
Whether to predict text inside pictures. When True, uses the
<predict_text_in_pic> prompt token; when False (default), uses
<predict_no_text_in_pic>. Only applies to Nemotron-Parse v1.2+.
task_prompt
Override the full prompt string. When set, text_in_pic is ignored.
backend
Inference backend: "vllm" or "hf".
inference_batch_size
Pages per GPU forward pass (HF backend only).
max_num_seqs
Maximum concurrent sequences (vLLM backend only).
Process each image individually when batch inference fails.
Teardown and reinit vLLM engine (mirrors Cosmos Curate’s _reset pattern).
Initialize model once per node (serially) to avoid torch.compile race conditions.
Build the Nemotron-Parse task prompt with the appropriate text-in-pic token.