*** description: >- Core concepts for saving and exporting curated image datasets including metadata and resharding categories: * concepts-architecture tags: * data-export * tar-files * parquet * resharding * metadata personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: concept modality: image-only *** # Data Export Concepts (Image) This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator. ## Key Topics * Saving curated images and metadata * Understanding output format structure * Configuring output sharding * Preparing data for downstream training or analysis ## Saving Results After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`. **Example:** ```python from nemo_curator.stages.image.io.image_writer import ImageWriterStage # Add writer stage to pipeline pipeline.add_stage(ImageWriterStage( output_dir="/output/curated_dataset", images_per_tar=1000, # Images per tar file remove_image_data=True, verbose=True, deterministic_name=True, # Use deterministic naming for reproducible output )) ``` **Key Parameters:** * `output_dir`: Directory where tar archives and metadata files are written * `images_per_tar`: Number of images per tar file for optimal sharding * `remove_image_data`: Whether to remove image data from memory after writing * `deterministic_name`: Ensures reproducible file naming based on input content **Behavior:** * The writer stage creates tar files with curated images * Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives * Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency * Smaller values create more files but enable better parallelism * Larger values reduce file count but may impact loading performance ## Output Format The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files: **Output Structure:** ```bash output/ ├── images-{hash}-000000.tar # Contains JPEG images ├── images-{hash}-000000.parquet # Metadata for corresponding tar ├── images-{hash}-000001.tar ├── images-{hash}-000001.parquet ``` **Format Details:** * **Tar contents**: JPEG images with sequential or ID-based filenames * **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata * **Naming**: Deterministic or random naming based on configuration * **Sharding**: Configurable number of images per tar file for optimal performance ## Preparing for Downstream Use * Ensure your exported dataset matches the requirements of your training or analysis pipeline. * Use consistent naming and metadata fields for compatibility. * Document any filtering or processing steps for reproducibility. * Test loading the exported dataset before large-scale training.