nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning
nemo_curator.stages.interleaved.pdf.nemotron_parse.partitioning
Partitioning stage for PDF processing pipelines.
Bases: ProcessingStage[_EmptyTask, FileGroupTask]
Read a JSONL manifest and produce FileGroupTasks for downstream processing.
Each line in the JSONL file must contain at least a file_name field.
An optional url field is preserved for provenance tracking.
For CC-MAIN-2021-31-PDF-UNTRUNCATED datasets, the manifest can also use the
cc_pdf_file_names field (a list of filenames per URL entry) along with
url. Each filename is expanded into an individual entry.
Example JSONL formats::
{“file_name”: “0001234.pdf”, “url”: “http://example.com/doc.pdf”}
{“cc_pdf_file_names”: [“0001234.pdf”, “0001235.pdf”], “url”: “http://…”}
manifest_path Path to a JSONL file listing PDFs to process. pdfs_per_task Number of PDFs to pack into each FileGroupTask. max_pdfs If set, limit the total number of PDFs to process. dataset_name Name assigned to output tasks. file_name_field JSONL field containing a single PDF filename. file_names_field JSONL field containing a list of PDF filenames (CC-MAIN style). url_field JSONL field containing the source URL.
Read manifest and return list of JSON-serialized entries.