***

description: >-
Step-by-step guide to setting up and running your first image curation
pipeline with NeMo Curator
categories:

* getting-started
  tags:
* image-curation
* installation
* quickstart
* gpu-accelerated
* embedding
* classification
* tar-archives
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: tutorial
  modality: image-only

***

# Get Started with Image Curation

This guide provides step-by-step instructions for setting up NeMo Curator’s image curation capabilities. Follow these instructions to prepare your environment and execute your first image curation pipeline.

## Prerequisites

Ensure your environment meets the following prerequisites for NeMo Curator image curation modules:

* Python 3.10, 3.11, or 3.12
  * packaging >= 22.0
* Ubuntu 22.04/20.04
* NVIDIA GPU (required for all image modules)
  * Volta™ or higher (compute capability 7.0+)
  * CUDA 12 (or above)

<Tip>
  If `uv` is not installed, refer to the [Installation Guide](/admin/installation) for setup instructions, or install it quickly with:

  ```bash
  curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
  source $HOME/.local/bin/env
  ```
</Tip>

***

## Installation Options

You can install NeMo Curator using one of the following methods:

<Tabs>
  <Tab title="PyPI Installation">
    Install the image modules from PyPI:

    ```bash
    uv pip install "nemo-curator[image_cuda12]"
    ```
  </Tab>

  <Tab title="Source Installation">
    Install the latest version directly from GitHub using uv:

    ```bash
    git clone https://github.com/NVIDIA-NeMo/Curator.git
    cd Curator
    uv sync --extra image_cuda12
    ```

    Activate the environment and run your code:

    ```bash
    source .venv/bin/activate
    python your_script.py
    ```
  </Tab>

  <Tab title="Docker Container">
    NeMo Curator is available as a standalone container:

    ```bash
    # Pull the container
    docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}

    # Run the container
    docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
    ```

    <Note>
      For details on container environments and configurations, see [Container Environments](/reference/infra/container-environments).
    </Note>
  </Tab>
</Tabs>

## Download Sample Configuration

NeMo Curator provides a working image curation example in the [Image Curation Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py). You can adapt this pipeline for your own datasets.

## Set Up Data Directory

Create directories to store your image datasets and models:

```bash
mkdir -p ~/nemo_curator/data/tar_archives
mkdir -p ~/nemo_curator/data/curated
mkdir -p ~/nemo_curator/models
```

For this example, you'll need:

* **Tar Archives**: JPEG images in `.tar` files (text and JSON files are ignored during loading)
* **Model Directory**: CLIP and classifier model weights (downloaded automatically on first run)

## Basic Image Curation Example

Here's a simple example to get started with NeMo Curator's image curation pipeline:

<Note>
  **CPU Memory Considerations**

  Image loading and decoding happens in CPU memory before GPU processing. If you encounter out-of-memory errors during the `ImageReaderStage`, reduce:

  * `batch_size`: Number of images per batch (reduce to 32-50 for systems with limited RAM)
  * `num_threads`: Parallel decoding threads (reduce to 4 for systems with limited RAM)
  * `num_cpus`: Ray Client CPU allocation (reduce to 8-16 for systems with limited RAM)

  The example below uses conservative defaults suitable for most systems. For high-memory systems, you can increase these values for better performance.

  To configure Ray with limited CPU resources:

  ```python
  from nemo_curator.core.client import RayClient
  ray_client = RayClient(num_cpus=8)  # Adjust based on available CPU cores
  ray_client.start()
  ```
</Note>

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Create image curation pipeline
pipeline = Pipeline(name="image_curation", description="Basic image curation with quality filtering")

# Stage 1: Partition tar files for parallel processing
pipeline.add_stage(FilePartitioningStage(
    file_paths="~/nemo_curator/data/tar_archives",  # Path to your tar archive directory
    files_per_partition=1,
    file_extensions=[".tar"],
))

# Stage 2: Read images from tar files using DALI
pipeline.add_stage(ImageReaderStage(
    batch_size=50,
    verbose=True,
    num_threads=4,
    num_gpus_per_worker=0.25,
))

# Stage 3: Generate CLIP embeddings for images
pipeline.add_stage(ImageEmbeddingStage(
    model_dir="~/nemo_curator/models",  # Directory containing model weights
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    remove_image_data=False,
    verbose=True,
))

# Stage 4: Filter by aesthetic quality (keep images with score &gt;= 0.5)
pipeline.add_stage(ImageAestheticFilterStage(
    model_dir="~/nemo_curator/models",
    score_threshold=0.5,
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    verbose=True,
))

# Stage 5: Filter NSFW content (remove images with score &gt;= 0.5)
pipeline.add_stage(ImageNSFWFilterStage(
    model_dir="~/nemo_curator/models",
    score_threshold=0.5,
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    verbose=True,
))

# Stage 6: Save curated images to new tar archives
pipeline.add_stage(ImageWriterStage(
    output_dir="~/nemo_curator/data/curated",
    images_per_tar=1000,
    remove_image_data=True,
    verbose=True,
))

# Execute the pipeline
executor = XennaExecutor()
pipeline.run(executor)
```

## Expected Output

After running the pipeline, you'll have:

```text
~/nemo_curator/data/curated/
├── images-{hash}-000000.tar     # Curated images (first shard)
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar     # Curated images (second shard)
├── images-{hash}-000001.parquet # Metadata for corresponding tar
├── ...                          # Additional shards as needed
```

**Output Format Details:**

* **Tar Files**: Contain high-quality `.jpg` files that passed both aesthetic and NSFW filtering
* **Parquet Files**: Contain metadata for each corresponding tar file, including image paths, IDs, and processing scores
* **Naming Convention**: Files use hash-based prefixes (e.g., `images-a1b2c3d4e5f6-000000.tar`) for uniqueness across distributed processing
* **Scores**: Processing metadata includes `aesthetic_score` and `nsfw_score` stored in the Parquet files

## Alternative: Using the Complete Tutorial

For a more comprehensive example with data download and more configuration options, see:

```bash
# Download the complete tutorial
wget -O ~/nemo_curator/image_curation_example.py https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/tutorials/image/getting-started/image_curation_example.py

# Run with your data
python ~/nemo_curator/image_curation_example.py \
    --input-wds-dataset-dir ~/nemo_curator/data/tar_archives \
    --output-dataset-dir ~/nemo_curator/data/curated \
    --model-dir ~/nemo_curator/models \
    --aesthetic-threshold 0.5 \
    --nsfw-threshold 0.5
```

## Next Steps

Explore the [Image Curation documentation](/curate-images) for more advanced processing techniques:

* **[Tar Archive Loading](/curate-images/load-data/tar-archives)** - Learn about loading JPEG images from tar files
* **[CLIP Embeddings](/curate-images/process-data/embeddings/clip-embedder)** - Understand embedding generation
* **[Quality Filtering](/curate-images/process-data/filters)** - Advanced aesthetic and NSFW filtering
* **[Complete Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py)** - Full working example with data download