Use Curator’s JsonlReader and ParquetReader to read existing datasets into a pipeline, then optionally add processing stages.
:sync: jsonl
Both JsonlReader and ParquetReader support these configuration options:
ParquetReader provides these optimizations:
pyarrow engine by default for better performancestorage_options in read_kwargsfields parameter to read required columns for better performancefiles_per_partition based on your cluster size and memory constraintsblocksize for fine-grained control over partition sizesBoth readers produce DocumentBatch tasks that integrate seamlessly with: