Data Export Concepts (Image)#
This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
Key Topics#
Saving curated images and metadata
Understanding output format structure
Configuring output sharding
Preparing data for downstream training or analysis
Saving Results#
After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.
Example:
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
output_dir="/output/curated_dataset",
images_per_tar=1000, # Images per tar file
remove_image_data=True,
verbose=True,
deterministic_name=True, # Use deterministic naming for reproducible output
))
Key Parameters:
output_dir: Directory where tar archives and metadata files are writtenimages_per_tar: Number of images per tar file for optimal shardingremove_image_data: Whether to remove image data from memory after writingdeterministic_name: Ensures reproducible file naming based on input content
Behavior:
The writer stage creates tar files with curated images
Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
Adjust
images_per_tarto balance I/O, parallelism, and storage efficiencySmaller values create more files but enable better parallelism
Larger values reduce file count but may impact loading performance
Output Format#
The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:
Output Structure:
output/
├── images-{hash}-000000.tar # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
Format Details:
Tar contents: JPEG images with sequential or ID-based filenames
Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
Naming: Deterministic or random naming based on configuration
Sharding: Configurable number of images per tar file for optimal performance
Preparing for Downstream Use#
Ensure your exported dataset matches the requirements of your training or analysis pipeline.
Use consistent naming and metadata fields for compatibility.
Document any filtering or processing steps for reproducibility.
Test loading the exported dataset before large-scale training.