nemo_curator.stages.interleaved.pdf.nemotron_parse.composite
nemo_curator.stages.interleaved.pdf.nemotron_parse.composite
Composite stage that bundles the full PDF -> Nemotron-Parse -> interleaved pipeline.
Module Contents
Classes
API
Bases: CompositeStage[_EmptyTask, InterleavedBatch]
Composite reader: partition -> preprocess -> infer -> postprocess.
Decomposes into four execution stages:
- :class:
PDFPartitioningStage— read manifest, create FileGroupTasks - :class:
PDFPreprocessStage— extract PDFs, render pages to images - :class:
NemotronParseInferenceStage— GPU model inference - :class:
NemotronParsePostprocessStage— parse output, align, crop
Parameters
manifest_path
Path to JSONL manifest listing PDFs.
zip_base_dir
Root of CC-MAIN zip archive hierarchy.
pdf_dir
Directory containing PDF files.
jsonl_base_dir
Root directory for JSONL-based PDF datasets (e.g. GitHub PDFs).
model_path
HuggingFace model ID or local path.
backend
Inference backend: "vllm" (recommended) or "hf".
pdfs_per_task
Number of PDFs per processing task.
max_pdfs
Maximum PDFs to process (for testing).
dpi
PDF rendering resolution.
max_pages
Maximum pages to render per PDF.
inference_batch_size
Pages per GPU forward pass (HF only).
max_num_seqs
Maximum concurrent sequences (vLLM only).
text_in_pic
Whether to predict text inside pictures (v1.2+ prompt control).
min_crop_px
Minimum pixel dimension for image crops.
dataset_name
Name assigned to output tasks.
file_name_field
JSONL field containing a single PDF filename.
file_names_field
JSONL field containing a list of PDF filenames (CC-MAIN style).
url_field
JSONL field containing the source URL.