Data Export Concepts (Image)#

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

Key Topics#

  • Saving curated images and metadata

  • Understanding output format structure

  • Configuring output sharding

  • Preparing data for downstream training or analysis

Saving Results#

After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.

Example:

from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_dataset",
    images_per_tar=1000,  # Images per tar file
    remove_image_data=True,
    verbose=True,
    deterministic_name=True,  # Use deterministic naming for reproducible output
))

Key Parameters:

  • output_dir: Directory where tar archives and metadata files are written

  • images_per_tar: Number of images per tar file for optimal sharding

  • remove_image_data: Whether to remove image data from memory after writing

  • deterministic_name: Ensures reproducible file naming based on input content

Behavior:

  • The writer stage creates tar files with curated images

  • Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives

  • Adjust images_per_tar to balance I/O, parallelism, and storage efficiency

  • Smaller values create more files but enable better parallelism

  • Larger values reduce file count but may impact loading performance

Output Format#

The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:

Output Structure:

output/
├── images-{hash}-000000.tar    # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet

Format Details:

  • Tar contents: JPEG images with sequential or ID-based filenames

  • Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata

  • Naming: Deterministic or random naming based on configuration

  • Sharding: Configurable number of images per tar file for optimal performance

Preparing for Downstream Use#

  • Ensure your exported dataset matches the requirements of your training or analysis pipeline.

  • Use consistent naming and metadata fields for compatibility.

  • Document any filtering or processing steps for reproducibility.

  • Test loading the exported dataset before large-scale training.