Data Export Concepts (Image)#

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

Key Topics#

Saving metadata to Parquet files
Exporting filtered datasets as tar archives
Configuring output sharding
Understanding output format structure
Preparing data for downstream training or analysis

Saving Results#

After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.

Example:

from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_dataset",
    images_per_tar=1000,
    remove_image_data=True,
    verbose=True,
    deterministic_name=True,  # Use deterministic naming for reproducible output
))

The writer stage creates tar files with curated images
Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
Configurable images per tar file for optimal sharding
deterministic_name=True ensures reproducible file naming based on input content

Pipeline-Based Filtering#

Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don’t meet the configured thresholds, so only curated images reach the final ImageWriterStage.

Example Pipeline Flow:

from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Complete pipeline with filtering
pipeline = Pipeline(name="image_curation")

# Load images
pipeline.add_stage(FilePartitioningStage(...))
pipeline.add_stage(ImageReaderStage(...))

# Generate embeddings
pipeline.add_stage(ImageEmbeddingStage(...))

# Filter by quality (removes low aesthetic scores)
pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))

# Filter NSFW content (removes high NSFW scores)
pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))

# Save curated results
pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))

Filtering is built into the stages - no separate filtering step needed
Images passing all filters reach the output
Thresholds are configurable per stage

Output Format#

The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:

Output Structure:

output/
├── images-{hash}-000000.tar    # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet

Format Details:

Tar contents: JPEG images with sequential or ID-based filenames
Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
Naming: Deterministic or random naming based on configuration
Sharding: Configurable number of images per tar file for optimal performance

Configuring Output Sharding#

The ImageWriterStage parameters control how images get distributed across output tar files.

Example:

# Configure output sharding
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_dataset",
    images_per_tar=5000,  # Images per tar file
    remove_image_data=True,
    deterministic_name=True,
))

Adjust images_per_tar to balance I/O, parallelism, and storage efficiency
Smaller values create more files but enable better parallelism
Larger values reduce file count but may impact loading performance

Preparing for Downstream Use#

Ensure your exported dataset matches the requirements of your training or analysis pipeline.
Use consistent naming and metadata fields for compatibility.
Document any filtering or processing steps for reproducibility.
Test loading the exported dataset before large-scale training.