For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
              • Filter
              • Io
              • Pdf
                • Nemotron Parse
                  • Composite
                  • Inference
                  • Partitioning
                  • Postprocess
                  • Preprocess
                  • Utils
              • Stages
              • Utils
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • API
  • Simple: one PDF per line
  • CC-MAIN: multiple PDFs per URL
  • Parameters
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesInterleavedPdfNemotron Parse

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning

||View as Markdown|
Previous

nemo_curator.stages.interleaved.pdf.nemotron_parse.inference

Next

nemo_curator.stages.interleaved.pdf.nemotron_parse.postprocess

Partitioning stage for PDF processing pipelines.

Module Contents

Classes

NameDescription
PDFPartitioningStageRead a JSONL manifest and produce FileGroupTasks for downstream processing.

API

class nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage(
manifest_path: str,
pdfs_per_task: int = 10,
max_pdfs: int | None = None,
dataset_name: str = 'pdf_dataset',
file_name_field: str = 'file_name',
file_names_field: str = 'cc_pdf_file_names',
url_field: str = 'url',
name: str = 'pdf_partitioning',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=0.5...
)
Dataclass

Bases: ProcessingStage[_EmptyTask, FileGroupTask]

Read a JSONL manifest and produce FileGroupTasks for downstream processing.

Each line in the JSONL file must contain at least a file_name field. An optional url field is preserved for provenance tracking.

For CC-MAIN-2021-31-PDF-UNTRUNCATED datasets, the manifest can also use the cc_pdf_file_names field (a list of filenames per URL entry) along with url. Each filename is expanded into an individual entry.

Example JSONL formats::

Simple: one PDF per line

{“file_name”: “0001234.pdf”, “url”: “http://example.com/doc.pdf”}

CC-MAIN: multiple PDFs per URL

{“cc_pdf_file_names”: [“0001234.pdf”, “0001235.pdf”], “url”: “http://…”}

Parameters

manifest_path Path to a JSONL file listing PDFs to process. pdfs_per_task Number of PDFs to pack into each FileGroupTask. max_pdfs If set, limit the total number of PDFs to process. dataset_name Name assigned to output tasks. file_name_field JSONL field containing a single PDF filename. file_names_field JSONL field containing a list of PDF filenames (CC-MAIN style). url_field JSONL field containing the source URL.

dataset_name
str = 'pdf_dataset'
file_name_field
str = 'file_name'
file_names_field
str = 'cc_pdf_file_names'
manifest_path
str
max_pdfs
int | None = None
name
str = 'pdf_partitioning'
pdfs_per_task
int = 10
resources
Resources
url_field
str = 'url'
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage._parse_manifest() -> list[str]

Read manifest and return list of JSON-serialized entries.

nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.process(
_: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.FileGroupTask]
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning.PDFPartitioningStage.xenna_stage_spec() -> dict[str, typing.Any]