This guide covers the core concepts for loading and managing text data from local files in NVIDIA NeMo Curator.
NeMo Curator uses a pipeline-based architecture for handling large-scale text data processing. Data flows through processing stages that transform tasks, enabling distributed processing of local files.
The system provides two primary readers for text data:
Both readers support optimization through:
blocksize or files_per_partition to optimize DocumentBatch sizes during distributed processingPartitioning Strategy: Specify either files_per_partition or blocksize. If files_per_partition is provided, blocksize is ignored.
fields parameter to read required columns onlydtype_backend="pyarrow" for optimal performance and memory efficiency. If you encounter compatibility issues with certain data types or schemas, you can override these defaults through read_kwargs:
NeMo Curator provides flexible export options for processed datasets:
You cannot combine different reader types (JsonlReader + ParquetReader) in the same pipeline stage. For different file types, you would need to create a new CustomReader from the underlying BaseReader that can read based on different extensions provided.
This page focuses on loading text data from local files using JsonlReader and ParquetReader. Both readers support remote storage locations (Amazon S3, Azure) when you provide remote file paths.
For downloading and processing data from remote sources like ArXiv, Common Crawl, and Wikipedia, refer to the Data Acquisition Concepts page which covers:
The data acquisition process produces standardized output that integrates seamlessly with the pipeline-based loading concepts described on this page.