***

description: >-
Save metadata, export filtered datasets, and configure output sharding for
downstream use after image curation
categories:

* how-to-guides
  tags:
* data-export
* parquet
* tar-files
* filtering
* sharding
* metadata
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: image-only

***

# Saving and Exporting Image Datasets

After processing and filtering your image datasets using NeMo Curator's pipeline stages, you can save results and export curated data for downstream use. The pipeline-based approach provides flexible options for saving and exporting your curated image data.

## Saving Results with ImageWriterStage

The `ImageWriterStage` is the primary method for saving curated images and metadata to tar archives with accompanying Parquet files. This stage is typically the final step in your image curation pipeline.

```python
from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Add ImageWriterStage to your pipeline
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_images",    # Output directory for tar files and metadata
    images_per_tar=1000,                    # Number of images per tar file
    remove_image_data=True,                 # Remove image data from memory after writing
    verbose=True,                           # Enable progress logging
))
```

### Parameters

| Parameter            | Type | Default  | Description                                                |
| -------------------- | ---- | -------- | ---------------------------------------------------------- |
| `output_dir`         | str  | Required | Output directory for tar files and metadata                |
| `images_per_tar`     | int  | 1000     | Number of images per tar file (controls sharding)          |
| `verbose`            | bool | False    | Enable verbose logging for debugging                       |
| `deterministic_name` | bool | True     | Use deterministic hash-based naming for output files       |
| `remove_image_data`  | bool | False    | Remove image data from memory after writing (saves memory) |

## Output Format

The `ImageWriterStage` creates:

* **Tar Archives**: `.tar` files containing JPEG images
* **Parquet Files**: `.parquet` files with metadata for each corresponding tar file
* **Deterministic Naming**: Files named with content-based hashes for reproducibility
* **Preserved Metadata**: All scores and metadata from processing stages stored in Parquet files

**Output Structure:**

```bash
output/
├── images-{hash}-000000.tar     # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
```

Each tar file contains JPEG images with sequential or ID-based filenames, while metadata (including `aesthetic_score`, `nsfw_score`, and other processing data) is stored in the accompanying Parquet files.

***

For more details on stage parameters and customization options, see the [ImageWriterStage documentation](/curate-images/process-data) and the [Complete Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py).
