About NeMo CuratorConceptsImage ConceptsData

Data Export Concepts (Image)

View as MarkdownOpen in Claude

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

Key Topics

  • Saving curated images and metadata
  • Understanding output format structure
  • Configuring output sharding
  • Preparing data for downstream training or analysis

Saving Results

After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.

Example:

1from nemo_curator.stages.image.io.image_writer import ImageWriterStage
2
3# Add writer stage to pipeline
4pipeline.add_stage(ImageWriterStage(
5 output_dir="/output/curated_dataset",
6 images_per_tar=1000, # Images per tar file
7 remove_image_data=True,
8 verbose=True,
9 deterministic_name=True, # Use deterministic naming for reproducible output
10))

Key Parameters:

  • output_dir: Directory where tar archives and metadata files are written
  • images_per_tar: Number of images per tar file for optimal sharding
  • remove_image_data: Whether to remove image data from memory after writing
  • deterministic_name: Ensures reproducible file naming based on input content

Behavior:

  • The writer stage creates tar files with curated images
  • Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
  • Adjust images_per_tar to balance I/O, parallelism, and storage efficiency
  • Smaller values create more files but enable better parallelism
  • Larger values reduce file count but may impact loading performance

Output Format

The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:

Output Structure:

$output/
$├── images-{hash}-000000.tar # Contains JPEG images
$├── images-{hash}-000000.parquet # Metadata for corresponding tar
$├── images-{hash}-000001.tar
$├── images-{hash}-000001.parquet

Format Details:

  • Tar contents: JPEG images with sequential or ID-based filenames
  • Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
  • Naming: Deterministic or random naming based on configuration
  • Sharding: Configurable number of images per tar file for optimal performance

Preparing for Downstream Use

  • Ensure your exported dataset matches the requirements of your training or analysis pipeline.
  • Use consistent naming and metadata fields for compatibility.
  • Document any filtering or processing steps for reproducibility.
  • Test loading the exported dataset before large-scale training.