nemo_curator.stages.interleaved.pdf.nemotron_parse.utils

View as Markdown

Utility functions for Nemotron-Parse PDF processing.

Provides output parsing, image canvas construction, bbox cropping, and element reordering used by the preprocess / postprocess stages.

Module Contents

Functions

NameDescription
_bbox_center_y-
_bitmap_to_rgbConvert a pypdfium2 bitmap to an RGB PIL image using OpenCV.
_pair_pictures_and_captionsGroup each Caption with its nearest Picture by bbox proximity.
_render_pageRender a single PDF page; returns None on any error.
_render_scale_to_fitReturn the render scale capped so the output fits within max_wh pixels.
build_canvasReplicate the model processor’s resize-then-center-pad to build the canvas.
build_interleaved_rowsConvert Nemotron-Parse page outputs into interleaved-schema rows.
crop_to_bboxCrop a region from the padded canvas using normalized bbox coordinates.
extract_pdf_from_jsonlExtract a base64-encoded PDF from a JSONL file.
extract_pdf_from_zipExtract a PDF file from a CC-MAIN zip archive.
extract_pdfs_from_jsonl_batchExtract multiple PDFs from a JSONL file in a single file open.
image_to_bytesSerialize a PIL Image to bytes.
interleave_floatersInsert floater elements (Pictures/Captions) next to the closest anchor.
parse_nemotron_outputParse Nemotron-Parse raw output into structured elements.
render_pdf_pagesRender PDF pages to PIL images using pypdfium2.
resolve_cc_pdf_zip_pathMap a CC-MAIN PDF filename to its zip archive path and member name.

Data

DEFAULT_MAX_PAGES

DEFAULT_MIN_CROP_PX

API

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils._bbox_center_y(
bbox: list[float] | None
) -> float
nemo_curator.stages.interleaved.pdf.nemotron_parse.utils._bitmap_to_rgb(
bitmap: typing.Any
) -> PIL.Image.Image

Convert a pypdfium2 bitmap to an RGB PIL image using OpenCV.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils._pair_pictures_and_captions(
floaters: list[dict[str, typing.Any]]
) -> list[list[dict[str, typing.Any]]]

Group each Caption with its nearest Picture by bbox proximity.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils._render_page(
doc: typing.Any,
page_num: int,
base_scale: float,
max_size: tuple[int, int] | None
) -> PIL.Image.Image | None

Render a single PDF page; returns None on any error.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils._render_scale_to_fit(
page: typing.Any,
base_scale: float,
max_wh: tuple[int, int] | None
) -> float

Return the render scale capped so the output fits within max_wh pixels.

Mirrors NeMo-Retriever’s _compute_render_scale_to_fit: uses the standard fit-to-box formula min(target_w/page_w, target_h/page_h) and clamps to a minimum of 1e-3 to avoid degenerate renders. When max_wh is None the base_scale is returned unchanged.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.build_canvas(
page_img: PIL.Image.Image,
proc_size: tuple[int, int]
) -> PIL.Image.Image

Replicate the model processor’s resize-then-center-pad to build the canvas.

This lets us crop bboxes directly in the model’s coordinate space.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.build_interleaved_rows(
sample_id: str,
url: str,
pdf_name: str,
page_images: list[PIL.Image.Image],
page_outputs: list[str],
proc_size: tuple[int, int] = (2048, 1664),
reorder_floaters: bool = True,
min_crop_px: int = DEFAULT_MIN_CROP_PX
) -> list[dict[str, typing.Any]]

Convert Nemotron-Parse page outputs into interleaved-schema rows.

Parameters:

sample_id
str

Unique identifier for this PDF.

url
str

Source URL of the PDF.

pdf_name
str

Original PDF filename.

page_images
list[Image.Image]

Rendered page images.

page_outputs
list[str]

Raw Nemotron-Parse output per page.

proc_size
tuple[int, int]Defaults to (2048, 1664)

Model processor’s expected (height, width).

reorder_floaters
boolDefaults to True

If True, re-insert Pictures/Captions in reading order (needed for v1.1). If False, preserve raw model output order (v1.2+).

min_crop_px
intDefaults to DEFAULT_MIN_CROP_PX

Minimum pixel dimension for image crops.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.crop_to_bbox(
canvas: PIL.Image.Image,
bbox: list[float] | None,
proc_size: tuple[int, int],
min_crop_px: int = DEFAULT_MIN_CROP_PX
) -> PIL.Image.Image | None

Crop a region from the padded canvas using normalized bbox coordinates.

Returns None if the crop is too small (likely a degenerate bbox).

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.extract_pdf_from_jsonl(
jsonl_file: str,
line_idx: int | None = None,
byte_offset: int | None = None
) -> bytes | None

Extract a base64-encoded PDF from a JSONL file.

Used for GitHub-style PDF datasets where each line contains a JSON object with a content field holding a base64-encoded PDF.

Prefer byte_offset (O(1) seek) over line_idx (O(N) linear scan). When both are absent, returns None.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.extract_pdf_from_zip(
file_name: str,
zip_base_dir: str
) -> bytes | None

Extract a PDF file from a CC-MAIN zip archive.

Returns None if extraction fails.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.extract_pdfs_from_jsonl_batch(
jsonl_file: str,
offsets: list[int]
) -> dict[int, bytes | None]

Extract multiple PDFs from a JSONL file in a single file open.

Opens the file once and seeks to each byte offset in sorted order. Returns a dict mapping byte_offset -> pdf_bytes (None on error).

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.image_to_bytes(
image: PIL.Image.Image,
fmt: str = 'PNG'
) -> bytes

Serialize a PIL Image to bytes.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.interleave_floaters(
anchored: list[dict[str, typing.Any]],
floaters: list[dict[str, typing.Any]]
) -> list[dict[str, typing.Any]]

Insert floater elements (Pictures/Captions) next to the closest anchor.

Anchored elements keep their original model output order. Pictures and Captions are first paired, then each pair is inserted after the anchored element whose bbox center-y is closest.

This is needed for Nemotron-Parse v1.1 which emits Picture/Caption at the end of the page output rather than in reading order. v1.2+ outputs them in correct reading order so this reordering can be skipped.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.parse_nemotron_output(
raw_text: str
) -> list[dict[str, typing.Any]]

Parse Nemotron-Parse raw output into structured elements.

Each element is a dict with keys class, text, and bbox (normalized [x1, y1, x2, y2]).

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.render_pdf_pages(
pdf_bytes: bytes,
dpi: int = 300,
max_pages: int = DEFAULT_MAX_PAGES,
max_size: tuple[int, int] | None = (1664, 2048)
) -> list[PIL.Image.Image]

Render PDF pages to PIL images using pypdfium2.

Follows the same pattern as NeMo-Retriever to avoid two pdfium pitfalls:

  1. Explicitly close each page/bitmap after use so the weakref finalizer never fires (avoids SIGABRT in _close_impl during GC).
  2. Use bitmap.to_numpy().copy() + OpenCV for BGR->RGB conversion instead of pdfium’s rev_byteorder flag, which triggers a non-thread-safe code path in CFX_AggDeviceDriver::GetDIBits().

The render scale is capped per page via _render_scale_to_fit so that no rendered image exceeds max_size pixels (default: 1664x2048 = Nemotron-Parse processor size). This bounds the bitmap size regardless of how large the PDF page dimensions are, eliminating decompression-bomb errors downstream and keeping render time predictable.

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.resolve_cc_pdf_zip_path(
file_name: str,
zip_base_dir: str
) -> tuple[str, str]

Map a CC-MAIN PDF filename to its zip archive path and member name.

The CC-MAIN-2021-31-PDF-UNTRUNCATED dataset organises PDFs into zip archives using a two-level numeric grouping::

<zip_base_dir>/0000-0999/0001.zip → contains 0001000.pdf .. 0001999.pdf <zip_base_dir>/1000-1999/1234.zip → contains 1234000.pdf .. 1234999.pdf

Parameters:

file_name
str

PDF filename (e.g. "0001234.pdf").

zip_base_dir
str

Root directory containing the zip archive hierarchy.

Returns: tuple[str, str]

Tuple of (zip_path, member_name).

nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.DEFAULT_MAX_PAGES = 50
nemo_curator.stages.interleaved.pdf.nemotron_parse.utils.DEFAULT_MIN_CROP_PX = 10