Data Export Concepts (Image)
This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
Key Topics
- Saving curated images and metadata
- Understanding output format structure
- Configuring output sharding
- Preparing data for downstream training or analysis
Saving Results
After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.
Example:
Key Parameters:
output_dir: Directory where tar archives and metadata files are writtenimages_per_tar: Number of images per tar file for optimal shardingremove_image_data: Whether to remove image data from memory after writingdeterministic_name: Ensures reproducible file naming based on input content
Behavior:
- The writer stage creates tar files with curated images
- Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
- Adjust
images_per_tarto balance I/O, parallelism, and storage efficiency - Smaller values create more files but enable better parallelism
- Larger values reduce file count but may impact loading performance
Output Format
The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:
Output Structure:
Format Details:
- Tar contents: JPEG images with sequential or ID-based filenames
- Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
- Naming: Deterministic or random naming based on configuration
- Sharding: Configurable number of images per tar file for optimal performance
Preparing for Downstream Use
- Ensure your exported dataset matches the requirements of your training or analysis pipeline.
- Use consistent naming and metadata fields for compatibility.
- Document any filtering or processing steps for reproducibility.
- Test loading the exported dataset before large-scale training.