About NeMo CuratorConceptsImage ConceptsData

Data Export Concepts (Image)

View as Markdown

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

Key Topics

  • Saving curated images and metadata
  • Understanding output format structure
  • Configuring output sharding
  • Preparing data for downstream training or analysis

Saving Results

After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.

Example:

1from nemo_curator.stages.image.io.image_writer import ImageWriterStage
2
3# Add writer stage to pipeline
4pipeline.add_stage(ImageWriterStage(
5 output_dir="/output/curated_dataset",
6 images_per_tar=1000, # Images per tar file
7 remove_image_data=True,
8 verbose=True,
9 deterministic_name=True, # Use deterministic naming for reproducible output
10))

Key Parameters:

  • output_dir: Directory where tar archives and metadata files are written
  • images_per_tar: Number of images per tar file for optimal sharding
  • remove_image_data: Whether to remove image data from memory after writing
  • deterministic_name: Ensures reproducible file naming based on input content

Behavior:

  • The writer stage creates tar files with curated images
  • Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
  • Adjust images_per_tar to balance I/O, parallelism, and storage efficiency
  • Smaller values create more files but enable better parallelism
  • Larger values reduce file count but may impact loading performance

Output Format

The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:

Output Structure:

$output/
$├── images-{hash}-000000.tar # Contains JPEG images
$├── images-{hash}-000000.parquet # Metadata for corresponding tar
$├── images-{hash}-000001.tar
$├── images-{hash}-000001.parquet

Format Details:

  • Tar contents: JPEG images with sequential or ID-based filenames
  • Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
  • Naming: Deterministic or random naming based on configuration
  • Sharding: Configurable number of images per tar file for optimal performance

Preparing for Downstream Use

  • Ensure your exported dataset matches the requirements of your training or analysis pipeline.
  • Use consistent naming and metadata fields for compatibility.
  • Document any filtering or processing steps for reproducibility.
  • Test loading the exported dataset before large-scale training.