Read Existing Data

Use Curator’s JsonlReader and ParquetReader to read existing datasets into a pipeline, then optionally add processing stages.

JSONL Reader

Parquet Reader

:sync: jsonl

Example: Read JSONL and Filter

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.pipeline import Pipeline
3 from nemo_curator.stages.text.io.reader import JsonlReader
4 from nemo_curator.stages.text.filters import ScoreFilter
5 from nemo_curator.stages.text.filters.heuristic import WordCountFilter
6 
7 # Initialize Ray client
8 ray_client = RayClient()
9 ray_client.start()
10 
11 # Create pipeline for processing existing JSONL files
12 pipeline = Pipeline(name="jsonl_data_processing")
13 
14 # Read JSONL files
15 reader = JsonlReader(
16     file_paths="/path/to/data",
17     files_per_partition=4,
18     fields=["text", "url"]  # Only read specific columns
19 )
20 pipeline.add_stage(reader)
21 
22 # Add filtering stage
23 word_filter = ScoreFilter(
24     filter_obj=WordCountFilter(min_words=50, max_words=1000),
25     text_field="text"
26 )
27 pipeline.add_stage(word_filter)
28 
29 # Add more stages to pipeline...
30 
31 # Execute pipeline
32 results = pipeline.run()
33 
34 # Stop Ray client
35 ray_client.stop()

Reader Configuration

Common Parameters

Both JsonlReader and ParquetReader support these configuration options:

Parameter	Type	Description	Default
`file_paths`	str \| list[str]	File paths or glob patterns to read	Required
`files_per_partition`	int \| None	Number of files per partition. Overrides `blocksize` if both are provided.	None
`blocksize`	int \| str \| None	Target partition size (e.g., “128MB”). Ignored if `files_per_partition` is provided.	None
`fields`	list[str] \| None	Column names to read (column selection)	None (all columns)
`read_kwargs`	dict[str, Any] \| None	Extra arguments for the underlying reader	None

Parquet-Specific Features

ParquetReader provides these optimizations:

PyArrow Engine: Uses pyarrow engine by default for better performance
Storage Options: Supports cloud storage via storage_options in read_kwargs
Schema Handling: Automatic schema inference and validation
Columnar Efficiency: Optimized for reading specific columns

Performance Tips

Use fields parameter to read required columns for better performance
Set files_per_partition based on your cluster size and memory constraints
Use blocksize for fine-grained control over partition sizes

Output Integration

Both readers produce DocumentBatch tasks that integrate seamlessly with:

Processing Stages: Apply filters, transformations, and quality checks
Writer Stages: Export to JSONL, Parquet, or other formats
Analysis Tools: Convert to Pandas/PyArrow for inspection and debugging