For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Saving Results with ImageWriterStage
  • Parameters
  • Output Format
Curate Images

Saving and Exporting Image Datasets

||View as Markdown|
Previous

NSFW Filter

Next

Overview

After processing and filtering your image datasets using NeMo Curator’s pipeline stages, you can save results and export curated data for downstream use. The pipeline-based approach provides flexible options for saving and exporting your curated image data.

Saving Results with ImageWriterStage

The ImageWriterStage is the primary method for saving curated images and metadata to tar archives with accompanying Parquet files. This stage is typically the final step in your image curation pipeline.

1from nemo_curator.stages.image.io.image_writer import ImageWriterStage
2
3# Add ImageWriterStage to your pipeline
4pipeline.add_stage(ImageWriterStage(
5 output_dir="/output/curated_images", # Output directory for tar files and metadata
6 images_per_tar=1000, # Number of images per tar file
7 remove_image_data=True, # Remove image data from memory after writing
8 verbose=True, # Enable progress logging
9))

Parameters

ParameterTypeDefaultDescription
output_dirstrRequiredOutput directory for tar files and metadata
images_per_tarint1000Number of images per tar file (controls sharding)
verboseboolFalseEnable verbose logging for debugging
deterministic_nameboolTrueUse deterministic hash-based naming for output files
remove_image_databoolFalseRemove image data from memory after writing (saves memory)

Output Format

The ImageWriterStage creates:

  • Tar Archives: .tar files containing JPEG images
  • Parquet Files: .parquet files with metadata for each corresponding tar file
  • Deterministic Naming: Files named with content-based hashes for reproducibility
  • Preserved Metadata: All scores and metadata from processing stages stored in Parquet files

Output Structure:

$output/
$├── images-{hash}-000000.tar # Contains JPEG images
$├── images-{hash}-000000.parquet # Metadata for corresponding tar
$├── images-{hash}-000001.tar
$├── images-{hash}-000001.parquet

Each tar file contains JPEG images with sequential or ID-based filenames, while metadata (including aesthetic_score, nsfw_score, and other processing data) is stored in the accompanying Parquet files.


For more details on stage parameters and customization options, see the ImageWriterStage documentation and the Complete Tutorial.