Read Existing Data
Use Curator’s JsonlReader and ParquetReader to read existing datasets into a pipeline, then optionally add processing stages.
JSONL Reader
Parquet Reader
:sync: jsonl
Example: Read JSONL and Filter
Reader Configuration
Common Parameters
Both JsonlReader and ParquetReader support these configuration options:
Parquet-Specific Features
ParquetReader provides these optimizations:
- PyArrow Engine: Uses
pyarrowengine by default for better performance - Storage Options: Supports cloud storage via
storage_optionsinread_kwargs - Schema Handling: Automatic schema inference and validation
- Columnar Efficiency: Optimized for reading specific columns
Performance Tips
- Use
fieldsparameter to read required columns for better performance - Set
files_per_partitionbased on your cluster size and memory constraints - Use
blocksizefor fine-grained control over partition sizes
Output Integration
Both readers produce DocumentBatch tasks that integrate seamlessly with:
- Processing Stages: Apply filters, transformations, and quality checks
- Writer Stages: Export to JSONL, Parquet, or other formats
- Analysis Tools: Convert to Pandas/PyArrow for inspection and debugging