nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess

View as Markdown

CPU preprocess stage: extract PDFs and render pages to images.

Module Contents

Classes

NameDescription
PDFPreprocessStageCPU stage: extract PDFs and render pages to images.

API

class nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage(
zip_base_dir: str | None = None,
pdf_dir: str | None = None,
jsonl_base_dir: str | None = None,
dpi: int = 300,
max_pages: int = 50,
name: str = 'pdf_preprocess',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=2.0...,
_RENDER_TIMEOUT_S: int = 60
)
Dataclass

Bases: ProcessingStage[FileGroupTask, InterleavedBatch]

CPU stage: extract PDFs and render pages to images.

Each entry in the input FileGroupTask.data is a JSON string with at minimum a file_name key and optionally a url key.

PDF bytes are obtained in one of three ways:

  • Zip archive mode (zip_base_dir is set): PDFs are extracted from CC-MAIN-style zip archives using :func:extract_pdf_from_zip.
  • Directory mode (pdf_dir is set): PDFs are read directly from <pdf_dir>/<file_name>.
  • JSONL mode (jsonl_base_dir is set): PDFs are decoded from base64 content fields in JSONL files (e.g. GitHub PDF datasets). Entries must include jsonl_file and either byte_offset (preferred, O(1) seek) or line_idx (legacy, O(N) scan).

Produces an :class:InterleavedBatch with one row per page, where binary_content holds the PNG-encoded page image and text_content is empty (to be filled by the GPU inference stage).

Parameters

zip_base_dir Root of CC-MAIN zip archive hierarchy. pdf_dir Directory containing loose PDF files. jsonl_base_dir Root directory for JSONL-based PDF datasets (e.g. GitHub PDFs). dpi Resolution for PDF page rendering. max_pages Maximum number of pages to render per PDF.

_RENDER_TIMEOUT_S
int = 60
dpi
int = 300
jsonl_base_dir
str | None = None
max_pages
int = 50
name
str = 'pdf_preprocess'
pdf_dir
str | None = None
resources
Resources
zip_base_dir
str | None = None
nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage._batch_fetch_jsonl(
entries: list[dict]
) -> dict[str, bytes | None]

Fetch PDF bytes for all JSONL-mode entries using one file open per JSONL.

Groups entries by jsonl_file, then calls extract_pdfs_from_jsonl_batch so each source file is opened exactly once. Entries without byte_offset fall back to the single-entry path.

Returns a dict mapping entry index (position in entries) -> pdf_bytes.

nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage._get_pdf_bytes(
file_name: str,
entry: dict | None = None
) -> bytes | None
nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage._render_with_timeout(
pdf_bytes: bytes,
file_name: str
) -> list

Render PDF with a process-based timeout.

SIGALRM cannot be used here because Xenna runs stage workers in non-main threads and signal.signal() is restricted to the main thread. Instead we fork a child process (inheriting pdf_bytes via copy-on-write) and kill it if it exceeds the timeout, which reliably escapes any C-extension hang.

nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess.PDFPreprocessStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.InterleavedBatch | None