nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess
nemo_curator.stages.interleaved.pdf.nemotron_parse.preprocess
CPU preprocess stage: extract PDFs and render pages to images.
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, InterleavedBatch]
CPU stage: extract PDFs and render pages to images.
Each entry in the input FileGroupTask.data is a JSON string with at
minimum a file_name key and optionally a url key.
PDF bytes are obtained in one of three ways:
- Zip archive mode (
zip_base_diris set): PDFs are extracted from CC-MAIN-style zip archives using :func:extract_pdf_from_zip. - Directory mode (
pdf_diris set): PDFs are read directly from<pdf_dir>/<file_name>. - JSONL mode (
jsonl_base_diris set): PDFs are decoded from base64contentfields in JSONL files (e.g. GitHub PDF datasets). Entries must includejsonl_fileand eitherbyte_offset(preferred, O(1) seek) orline_idx(legacy, O(N) scan).
Produces an :class:InterleavedBatch with one row per page, where
binary_content holds the PNG-encoded page image and text_content
is empty (to be filled by the GPU inference stage).
Parameters
zip_base_dir Root of CC-MAIN zip archive hierarchy. pdf_dir Directory containing loose PDF files. jsonl_base_dir Root directory for JSONL-based PDF datasets (e.g. GitHub PDFs). dpi Resolution for PDF page rendering. max_pages Maximum number of pages to render per PDF.
Fetch PDF bytes for all JSONL-mode entries using one file open per JSONL.
Groups entries by jsonl_file, then calls extract_pdfs_from_jsonl_batch so each source file is opened exactly once. Entries without byte_offset fall back to the single-entry path.
Returns a dict mapping entry index (position in entries) -> pdf_bytes.
Render PDF with a process-based timeout.
SIGALRM cannot be used here because Xenna runs stage workers in non-main threads and signal.signal() is restricted to the main thread. Instead we fork a child process (inheriting pdf_bytes via copy-on-write) and kill it if it exceeds the timeout, which reliably escapes any C-extension hang.