Get Started with Image Curation
This guide provides step-by-step instructions for setting up NeMo Curator’s image curation capabilities. Follow these instructions to prepare your environment and execute your first image curation pipeline.
Prerequisites
Ensure your environment meets the following prerequisites for NeMo Curator image curation modules:
- Python 3.10, 3.11, or 3.12
- packaging >= 22.0
- Ubuntu 22.04/20.04
- NVIDIA GPU (required for all image modules)
- Volta™ or higher (compute capability 7.0+)
- CUDA 12 (or above)
If uv is not installed, refer to the Installation Guide for setup instructions, or install it quickly with:
Installation Options
You can install NeMo Curator using one of the following methods:
PyPI Installation
Source Installation
Docker Container
Install the image modules from PyPI:
Download Sample Configuration
NeMo Curator provides a working image curation example in the Image Curation Tutorial. You can adapt this pipeline for your own datasets.
Set Up Data Directory
Create directories to store your image datasets and models:
For this example, you’ll need:
- Tar Archives: JPEG images in
.tarfiles (text and JSON files are ignored during loading) - Model Directory: CLIP and classifier model weights (downloaded automatically on first run)
Basic Image Curation Example
Here’s a simple example to get started with NeMo Curator’s image curation pipeline:
CPU Memory Considerations
Image loading and decoding happens in CPU memory before GPU processing. If you encounter out-of-memory errors during the ImageReaderStage, reduce:
batch_size: Number of images per batch (reduce to 32-50 for systems with limited RAM)num_threads: Parallel decoding threads (reduce to 4 for systems with limited RAM)num_cpus: Ray Client CPU allocation (reduce to 8-16 for systems with limited RAM)
The example below uses conservative defaults suitable for most systems. For high-memory systems, you can increase these values for better performance.
To configure Ray with limited CPU resources:
Expected Output
After running the pipeline, you’ll have:
Output Format Details:
- Tar Files: Contain high-quality
.jpgfiles that passed both aesthetic and NSFW filtering - Parquet Files: Contain metadata for each corresponding tar file, including image paths, IDs, and processing scores
- Naming Convention: Files use hash-based prefixes (e.g.,
images-a1b2c3d4e5f6-000000.tar) for uniqueness across distributed processing - Scores: Processing metadata includes
aesthetic_scoreandnsfw_scorestored in the Parquet files
Alternative: Using the Complete Tutorial
For a more comprehensive example with data download and more configuration options, see:
Next Steps
Explore the Image Curation documentation for more advanced processing techniques:
- Tar Archive Loading - Learn about loading JPEG images from tar files
- CLIP Embeddings - Understand embedding generation
- Quality Filtering - Advanced aesthetic and NSFW filtering
- Complete Tutorial - Full working example with data download