nemo_curator.stages.interleaved.pdf.nemotron_parse.composite

View as Markdown

Composite stage that bundles the full PDF -> Nemotron-Parse -> interleaved pipeline.

Module Contents

Classes

NameDescription
NemotronParsePDFReaderComposite reader: partition -> preprocess -> infer -> postprocess.

API

class nemo_curator.stages.interleaved.pdf.nemotron_parse.composite.NemotronParsePDFReader(
manifest_path: str | None = None,
zip_base_dir: str | None = None,
pdf_dir: str | None = None,
jsonl_base_dir: str | None = None,
model_path: str = DEFAULT_MODEL_PATH,
backend: str = 'vllm',
pdfs_per_task: int = 10,
max_pdfs: int | None = None,
dpi: int = 300,
max_pages: int = 50,
inference_batch_size: int = 4,
max_num_seqs: int = 64,
text_in_pic: bool = False,
enforce_eager: bool = False,
min_crop_px: int = 10,
dataset_name: str = 'pdf_dataset',
file_name_field: str = 'file_name',
file_names_field: str = 'cc_pdf_file_names',
url_field: str = 'url'
)
Dataclass

Bases: CompositeStage[_EmptyTask, InterleavedBatch]

Composite reader: partition -> preprocess -> infer -> postprocess.

Decomposes into four execution stages:

  1. :class:PDFPartitioningStage — read manifest, create FileGroupTasks
  2. :class:PDFPreprocessStage — extract PDFs, render pages to images
  3. :class:NemotronParseInferenceStage — GPU model inference
  4. :class:NemotronParsePostprocessStage — parse output, align, crop

Parameters

manifest_path Path to JSONL manifest listing PDFs. zip_base_dir Root of CC-MAIN zip archive hierarchy. pdf_dir Directory containing PDF files. jsonl_base_dir Root directory for JSONL-based PDF datasets (e.g. GitHub PDFs). model_path HuggingFace model ID or local path. backend Inference backend: "vllm" (recommended) or "hf". pdfs_per_task Number of PDFs per processing task. max_pdfs Maximum PDFs to process (for testing). dpi PDF rendering resolution. max_pages Maximum pages to render per PDF. inference_batch_size Pages per GPU forward pass (HF only). max_num_seqs Maximum concurrent sequences (vLLM only). text_in_pic Whether to predict text inside pictures (v1.2+ prompt control). min_crop_px Minimum pixel dimension for image crops. dataset_name Name assigned to output tasks. file_name_field JSONL field containing a single PDF filename. file_names_field JSONL field containing a list of PDF filenames (CC-MAIN style). url_field JSONL field containing the source URL.

backend
str = 'vllm'
dataset_name
str = 'pdf_dataset'
dpi
int = 300
enforce_eager
bool = False
file_name_field
str = 'file_name'
file_names_field
str = 'cc_pdf_file_names'
inference_batch_size
int = 4
jsonl_base_dir
str | None = None
manifest_path
str | None = None
max_num_seqs
int = 64
max_pages
int = 50
max_pdfs
int | None = None
min_crop_px
int = 10
model_path
str = DEFAULT_MODEL_PATH
pdf_dir
str | None = None
pdfs_per_task
int = 10
text_in_pic
bool = False
url_field
str = 'url'
zip_base_dir
str | None = None
nemo_curator.stages.interleaved.pdf.nemotron_parse.composite.NemotronParsePDFReader.__post_init__() -> None
nemo_curator.stages.interleaved.pdf.nemotron_parse.composite.NemotronParsePDFReader.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.interleaved.pdf.nemotron_parse.composite.NemotronParsePDFReader.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.composite.NemotronParsePDFReader.outputs() -> tuple[list[str], list[str]]