***

description: >-
Core concepts for saving and exporting curated image datasets including
metadata and resharding
categories:

* concepts-architecture
  tags:
* data-export
* tar-files
* parquet
* resharding
* metadata
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: concept
  modality: image-only

***

# Data Export Concepts (Image)

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

## Key Topics

* Saving curated images and metadata
* Understanding output format structure
* Configuring output sharding
* Preparing data for downstream training or analysis

## Saving Results

After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`.

**Example:**

```python
from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_dataset",
    images_per_tar=1000,  # Images per tar file
    remove_image_data=True,
    verbose=True,
    deterministic_name=True,  # Use deterministic naming for reproducible output
))
```

**Key Parameters:**

* `output_dir`: Directory where tar archives and metadata files are written
* `images_per_tar`: Number of images per tar file for optimal sharding
* `remove_image_data`: Whether to remove image data from memory after writing
* `deterministic_name`: Ensures reproducible file naming based on input content

**Behavior:**

* The writer stage creates tar files with curated images
* Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
* Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
* Smaller values create more files but enable better parallelism
* Larger values reduce file count but may impact loading performance

## Output Format

The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files:

**Output Structure:**

```bash
output/
├── images-{hash}-000000.tar    # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
```

**Format Details:**

* **Tar contents**: JPEG images with sequential or ID-based filenames
* **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata
* **Naming**: Deterministic or random naming based on configuration
* **Sharding**: Configurable number of images per tar file for optimal performance

## Preparing for Downstream Use

* Ensure your exported dataset matches the requirements of your training or analysis pipeline.
* Use consistent naming and metadata fields for compatibility.
* Document any filtering or processing steps for reproducibility.
* Test loading the exported dataset before large-scale training.
