***
description: >-
Step-by-step guide to setting up and running your first image curation
pipeline with NeMo Curator
categories:
* getting-started
tags:
* image-curation
* installation
* quickstart
* gpu-accelerated
* embedding
* classification
* tar-archives
personas:
* data-scientist-focused
* mle-focused
difficulty: beginner
content\_type: tutorial
modality: image-only
***
# Get Started with Image Curation
This guide provides step-by-step instructions for setting up NeMo Curator’s image curation capabilities. Follow these instructions to prepare your environment and execute your first image curation pipeline.
## Prerequisites
Ensure your environment meets the following prerequisites for NeMo Curator image curation modules:
* Python 3.10, 3.11, or 3.12
* packaging >= 22.0
* Ubuntu 22.04/20.04
* NVIDIA GPU (required for all image modules)
* Volta™ or higher (compute capability 7.0+)
* CUDA 12 (or above)
If `uv` is not installed, refer to the [Installation Guide](/admin/installation) for setup instructions, or install it quickly with:
```bash
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
source $HOME/.local/bin/env
```
***
## Installation Options
You can install NeMo Curator using one of the following methods:
Install the image modules from PyPI:
```bash
uv pip install "nemo-curator[image_cuda12]"
```
Install the latest version directly from GitHub using uv:
```bash
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
uv sync --extra image_cuda12
```
Activate the environment and run your code:
```bash
source .venv/bin/activate
python your_script.py
```
NeMo Curator is available as a standalone container:
```bash
# Pull the container
docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
# Run the container
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
```
For details on container environments and configurations, see [Container Environments](/reference/infra/container-environments).
## Download Sample Configuration
NeMo Curator provides a working image curation example in the [Image Curation Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py). You can adapt this pipeline for your own datasets.
## Set Up Data Directory
Create directories to store your image datasets and models:
```bash
mkdir -p ~/nemo_curator/data/tar_archives
mkdir -p ~/nemo_curator/data/curated
mkdir -p ~/nemo_curator/models
```
For this example, you'll need:
* **Tar Archives**: JPEG images in `.tar` files (text and JSON files are ignored during loading)
* **Model Directory**: CLIP and classifier model weights (downloaded automatically on first run)
## Basic Image Curation Example
Here's a simple example to get started with NeMo Curator's image curation pipeline:
**CPU Memory Considerations**
Image loading and decoding happens in CPU memory before GPU processing. If you encounter out-of-memory errors during the `ImageReaderStage`, reduce:
* `batch_size`: Number of images per batch (reduce to 32-50 for systems with limited RAM)
* `num_threads`: Parallel decoding threads (reduce to 4 for systems with limited RAM)
* `num_cpus`: Ray Client CPU allocation (reduce to 8-16 for systems with limited RAM)
The example below uses conservative defaults suitable for most systems. For high-memory systems, you can increase these values for better performance.
To configure Ray with limited CPU resources:
```python
from nemo_curator.core.client import RayClient
ray_client = RayClient(num_cpus=8) # Adjust based on available CPU cores
ray_client.start()
```
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Create image curation pipeline
pipeline = Pipeline(name="image_curation", description="Basic image curation with quality filtering")
# Stage 1: Partition tar files for parallel processing
pipeline.add_stage(FilePartitioningStage(
file_paths="~/nemo_curator/data/tar_archives", # Path to your tar archive directory
files_per_partition=1,
file_extensions=[".tar"],
))
# Stage 2: Read images from tar files using DALI
pipeline.add_stage(ImageReaderStage(
batch_size=50,
verbose=True,
num_threads=4,
num_gpus_per_worker=0.25,
))
# Stage 3: Generate CLIP embeddings for images
pipeline.add_stage(ImageEmbeddingStage(
model_dir="~/nemo_curator/models", # Directory containing model weights
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
remove_image_data=False,
verbose=True,
))
# Stage 4: Filter by aesthetic quality (keep images with score >= 0.5)
pipeline.add_stage(ImageAestheticFilterStage(
model_dir="~/nemo_curator/models",
score_threshold=0.5,
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
verbose=True,
))
# Stage 5: Filter NSFW content (remove images with score >= 0.5)
pipeline.add_stage(ImageNSFWFilterStage(
model_dir="~/nemo_curator/models",
score_threshold=0.5,
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
verbose=True,
))
# Stage 6: Save curated images to new tar archives
pipeline.add_stage(ImageWriterStage(
output_dir="~/nemo_curator/data/curated",
images_per_tar=1000,
remove_image_data=True,
verbose=True,
))
# Execute the pipeline
executor = XennaExecutor()
pipeline.run(executor)
```
## Expected Output
After running the pipeline, you'll have:
```text
~/nemo_curator/data/curated/
├── images-{hash}-000000.tar # Curated images (first shard)
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar # Curated images (second shard)
├── images-{hash}-000001.parquet # Metadata for corresponding tar
├── ... # Additional shards as needed
```
**Output Format Details:**
* **Tar Files**: Contain high-quality `.jpg` files that passed both aesthetic and NSFW filtering
* **Parquet Files**: Contain metadata for each corresponding tar file, including image paths, IDs, and processing scores
* **Naming Convention**: Files use hash-based prefixes (e.g., `images-a1b2c3d4e5f6-000000.tar`) for uniqueness across distributed processing
* **Scores**: Processing metadata includes `aesthetic_score` and `nsfw_score` stored in the Parquet files
## Alternative: Using the Complete Tutorial
For a more comprehensive example with data download and more configuration options, see:
```bash
# Download the complete tutorial
wget -O ~/nemo_curator/image_curation_example.py https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/tutorials/image/getting-started/image_curation_example.py
# Run with your data
python ~/nemo_curator/image_curation_example.py \
--input-wds-dataset-dir ~/nemo_curator/data/tar_archives \
--output-dataset-dir ~/nemo_curator/data/curated \
--model-dir ~/nemo_curator/models \
--aesthetic-threshold 0.5 \
--nsfw-threshold 0.5
```
## Next Steps
Explore the [Image Curation documentation](/curate-images) for more advanced processing techniques:
* **[Tar Archive Loading](/curate-images/load-data/tar-archives)** - Learn about loading JPEG images from tar files
* **[CLIP Embeddings](/curate-images/process-data/embeddings/clip-embedder)** - Understand embedding generation
* **[Quality Filtering](/curate-images/process-data/filters)** - Advanced aesthetic and NSFW filtering
* **[Complete Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py)** - Full working example with data download