Data Export Concepts (Image)#
This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
Key Topics#
Saving metadata to Parquet files
Exporting filtered datasets as tar archives
Configuring output sharding
Understanding output format structure
Preparing data for downstream training or analysis
Saving Results#
After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage
.
Example:
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
output_dir="/output/curated_dataset",
images_per_tar=1000,
remove_image_data=True,
verbose=True,
deterministic_name=True, # Use deterministic naming for reproducible output
))
The writer stage creates tar files with curated images
Metadata (if updated during curation pipeline) is stored in separate Parquet files alongside tar archives
Configurable images per tar file for optimal sharding
deterministic_name=True
ensures reproducible file naming based on input content
Pipeline-Based Filtering#
Filtering happens automatically within the pipeline stages. Each filter stage (aesthetic, NSFW) removes images that don’t meet the configured thresholds, so only curated images reach the final ImageWriterStage
.
Example Pipeline Flow:
from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Complete pipeline with filtering
pipeline = Pipeline(name="image_curation")
# Load images
pipeline.add_stage(FilePartitioningStage(...))
pipeline.add_stage(ImageReaderStage(...))
# Generate embeddings
pipeline.add_stage(ImageEmbeddingStage(...))
# Filter by quality (removes low aesthetic scores)
pipeline.add_stage(ImageAestheticFilterStage(score_threshold=0.5))
# Filter NSFW content (removes high NSFW scores)
pipeline.add_stage(ImageNSFWFilterStage(score_threshold=0.5))
# Save curated results
pipeline.add_stage(ImageWriterStage(output_dir="/output/curated"))
Filtering is built into the stages - no separate filtering step needed
Images passing all filters reach the output
Thresholds are configurable per stage
Output Format#
The ImageWriterStage
creates tar archives containing curated images with accompanying metadata files:
Output Structure:
output/
├── images-{hash}-000000.tar # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
Format Details:
Tar contents: JPEG images with sequential or ID-based filenames
Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
Naming: Deterministic or random naming based on configuration
Sharding: Configurable number of images per tar file for optimal performance
Configuring Output Sharding#
The ImageWriterStage
parameters control how images get distributed across output tar files.
Example:
# Configure output sharding
pipeline.add_stage(ImageWriterStage(
output_dir="/output/curated_dataset",
images_per_tar=5000, # Images per tar file
remove_image_data=True,
deterministic_name=True,
))
Adjust
images_per_tar
to balance I/O, parallelism, and storage efficiencySmaller values create more files but enable better parallelism
Larger values reduce file count but may impact loading performance
Preparing for Downstream Use#
Ensure your exported dataset matches the requirements of your training or analysis pipeline.
Use consistent naming and metadata fields for compatibility.
Document any filtering or processing steps for reproducibility.
Test loading the exported dataset before large-scale training.