For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
      • ArXiv
      • Common Crawl
      • Custom Sources
      • Nemotron-Parse PDF Pipeline
      • Read Existing Data
      • Wikipedia
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Example: Read JSONL and Filter
  • Example: Read Parquet and Filter
  • Reader Configuration
  • Common Parameters
  • Parquet-Specific Features
  • Performance Tips
  • Output Integration
Curate TextLoad Data

Read Existing Data

||View as Markdown|
Previous

Nemotron-Parse PDF Pipeline

Next

Wikipedia

Use Curator’s JsonlReader and ParquetReader to read existing datasets into a pipeline, then optionally add processing stages.

JSONL Reader
Parquet Reader

:sync: jsonl

Example: Read JSONL and Filter

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.io.reader import JsonlReader
4from nemo_curator.stages.text.filters import ScoreFilter
5from nemo_curator.stages.text.filters.heuristic import WordCountFilter
6
7# Initialize Ray client
8ray_client = RayClient()
9ray_client.start()
10
11# Create pipeline for processing existing JSONL files
12pipeline = Pipeline(name="jsonl_data_processing")
13
14# Read JSONL files
15reader = JsonlReader(
16 file_paths="/path/to/data",
17 files_per_partition=4,
18 fields=["text", "url"] # Only read specific columns
19)
20pipeline.add_stage(reader)
21
22# Add filtering stage
23word_filter = ScoreFilter(
24 filter_obj=WordCountFilter(min_words=50, max_words=1000),
25 text_field="text"
26)
27pipeline.add_stage(word_filter)
28
29# Add more stages to pipeline...
30
31# Execute pipeline
32results = pipeline.run()
33
34# Stop Ray client
35ray_client.stop()

Reader Configuration

Common Parameters

Both JsonlReader and ParquetReader support these configuration options:

ParameterTypeDescriptionDefault
file_pathsstr | list[str]File paths or glob patterns to readRequired
files_per_partitionint | NoneNumber of files per partition. Overrides blocksize if both are provided.None
blocksizeint | str | NoneTarget partition size (e.g., “128MB”). Ignored if files_per_partition is provided.None
fieldslist[str] | NoneColumn names to read (column selection)None (all columns)
read_kwargsdict[str, Any] | NoneExtra arguments for the underlying readerNone

Parquet-Specific Features

ParquetReader provides these optimizations:

  • PyArrow Engine: Uses pyarrow engine by default for better performance
  • Storage Options: Supports cloud storage via storage_options in read_kwargs
  • Schema Handling: Automatic schema inference and validation
  • Columnar Efficiency: Optimized for reading specific columns

Performance Tips

  • Use fields parameter to read required columns for better performance
  • Set files_per_partition based on your cluster size and memory constraints
  • Use blocksize for fine-grained control over partition sizes

Output Integration

Both readers produce DocumentBatch tasks that integrate seamlessly with:

  • Processing Stages: Apply filters, transformations, and quality checks
  • Writer Stages: Export to JSONL, Parquet, or other formats
  • Analysis Tools: Convert to Pandas/PyArrow for inspection and debugging