nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning

Partitioning stage for PDF processing pipelines.

Module Contents

Classes

Name	Description
`PDFPartitioningStage`	Read a JSONL manifest and produce FileGroupTasks for downstream processing.

API

class nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage(
    manifest_path: str,
    pdfs_per_task: int = 10,
    max_pdfs: int | None = None,
    dataset_name: str = 'pdf_dataset',
    file_name_field: str = 'file_name',
    file_names_field: str = 'cc_pdf_file_names',
    url_field: str = 'url',
    name: str = 'pdf_partitioning',
    resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=0.5...
)

Dataclass

Bases: ProcessingStage[_EmptyTask, FileGroupTask]

Read a JSONL manifest and produce FileGroupTasks for downstream processing.

Each line in the JSONL file must contain at least a file_name field. An optional url field is preserved for provenance tracking.

For CC-MAIN-2021-31-PDF-UNTRUNCATED datasets, the manifest can also use the cc_pdf_file_names field (a list of filenames per URL entry) along with url. Each filename is expanded into an individual entry.

Example JSONL formats::

Simple: one PDF per line

{“file_name”: “0001234.pdf”, “url”: “http://example.com/doc.pdf”}

CC-MAIN: multiple PDFs per URL

{“cc_pdf_file_names”: [“0001234.pdf”, “0001235.pdf”], “url”: “http://…”}

Parameters

manifest_path Path to a JSONL file listing PDFs to process. pdfs_per_task Number of PDFs to pack into each FileGroupTask. max_pdfs If set, limit the total number of PDFs to process. dataset_name Name assigned to output tasks. file_name_field JSONL field containing a single PDF filename. file_names_field JSONL field containing a list of PDF filenames (CC-MAIN style). url_field JSONL field containing the source URL.

dataset_name

str = 'pdf_dataset'

file_name_field

str = 'file_name'

file_names_field

str = 'cc_pdf_file_names'

manifest_path

str

max_pdfs

int | None = None

name

str = 'pdf_partitioning'

pdfs_per_task

int = 10

resources

Resources

url_field

str = 'url'

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage._parse_manifest() -> list[str]

Read manifest and return list of JSON-serialized entries.

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.inputs() -> tuple[list[str], list[str]]

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.outputs() -> tuple[list[str], list[str]]

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.process(
    _: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.FileGroupTask]

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.xenna_stage_spec() -> dict[str, typing.Any]