nemo_curator.stages.interleaved.pdf.nemotron_parse.utils
nemo_curator.stages.interleaved.pdf.nemotron_parse.utils
Utility functions for Nemotron-Parse PDF processing.
Provides output parsing, image canvas construction, bbox cropping, and element reordering used by the preprocess / postprocess stages.
Module Contents
Functions
Data
API
Convert a pypdfium2 bitmap to an RGB PIL image using OpenCV.
Group each Caption with its nearest Picture by bbox proximity.
Render a single PDF page; returns None on any error.
Return the render scale capped so the output fits within max_wh pixels.
Mirrors NeMo-Retriever’s _compute_render_scale_to_fit: uses the
standard fit-to-box formula min(target_w/page_w, target_h/page_h) and
clamps to a minimum of 1e-3 to avoid degenerate renders. When max_wh is
None the base_scale is returned unchanged.
Replicate the model processor’s resize-then-center-pad to build the canvas.
This lets us crop bboxes directly in the model’s coordinate space.
Convert Nemotron-Parse page outputs into interleaved-schema rows.
Parameters:
Unique identifier for this PDF.
Source URL of the PDF.
Original PDF filename.
Rendered page images.
Raw Nemotron-Parse output per page.
Model processor’s expected (height, width).
If True, re-insert Pictures/Captions in reading order (needed for v1.1). If False, preserve raw model output order (v1.2+).
Minimum pixel dimension for image crops.
Crop a region from the padded canvas using normalized bbox coordinates.
Returns None if the crop is too small (likely a degenerate bbox).
Extract a base64-encoded PDF from a JSONL file.
Used for GitHub-style PDF datasets where each line contains a JSON object
with a content field holding a base64-encoded PDF.
Prefer byte_offset (O(1) seek) over line_idx (O(N) linear scan).
When both are absent, returns None.
Extract a PDF file from a CC-MAIN zip archive.
Returns None if extraction fails.
Extract multiple PDFs from a JSONL file in a single file open.
Opens the file once and seeks to each byte offset in sorted order. Returns a dict mapping byte_offset -> pdf_bytes (None on error).
Serialize a PIL Image to bytes.
Insert floater elements (Pictures/Captions) next to the closest anchor.
Anchored elements keep their original model output order. Pictures and Captions are first paired, then each pair is inserted after the anchored element whose bbox center-y is closest.
This is needed for Nemotron-Parse v1.1 which emits Picture/Caption at the end of the page output rather than in reading order. v1.2+ outputs them in correct reading order so this reordering can be skipped.
Parse Nemotron-Parse raw output into structured elements.
Each element is a dict with keys class, text, and bbox
(normalized [x1, y1, x2, y2]).
Render PDF pages to PIL images using pypdfium2.
Follows the same pattern as NeMo-Retriever to avoid two pdfium pitfalls:
- Explicitly close each page/bitmap after use so the weakref finalizer never fires (avoids SIGABRT in _close_impl during GC).
- Use
bitmap.to_numpy().copy()+ OpenCV for BGR->RGB conversion instead of pdfium’srev_byteorderflag, which triggers a non-thread-safe code path in CFX_AggDeviceDriver::GetDIBits().
The render scale is capped per page via _render_scale_to_fit so that
no rendered image exceeds max_size pixels (default: 1664x2048 =
Nemotron-Parse processor size). This bounds the bitmap size regardless of
how large the PDF page dimensions are, eliminating decompression-bomb
errors downstream and keeping render time predictable.
Map a CC-MAIN PDF filename to its zip archive path and member name.
The CC-MAIN-2021-31-PDF-UNTRUNCATED dataset organises PDFs into zip archives using a two-level numeric grouping::
<zip_base_dir>/0000-0999/0001.zip → contains 0001000.pdf .. 0001999.pdf <zip_base_dir>/1000-1999/1234.zip → contains 1234000.pdf .. 1234999.pdf
Parameters:
PDF filename (e.g. "0001234.pdf").
Root directory containing the zip archive hierarchy.
Returns: tuple[str, str]
Tuple of (zip_path, member_name).