Deploy Image Curation on Slurm
This workflow covers the full image curation pipeline on Slurm, including model download, embedding generation, classification, filtering, and deduplication.
For details on image container environments and Slurm environment variables, see Container Environments.
Prerequisites
-
Create required directories for AWS credentials, NeMo Curator configuration, and local workspace:
Prepare configuration files:
AWS Credentials
NeMo Curator Configuration
Image Processing Configuration
$HOME/.aws/credentials (for S3 access)
See configuration files in the repository.
Model Download
-
Copy the following script for downloading all required image processing models into the Slurm cluster.
-
Update the
SBATCHparameters and paths to match your username and environment. -
Run the script.
Image Processing Pipeline
The workflow consists of three main Slurm scripts, to be run in order:
curator_image_embed.sh: Generates embeddings and applies classifications to images.curator_image_filter.sh: Filters images based on quality, aesthetic, and NSFW scores.curator_image_dedup.sh: Performs semantic deduplication using image embeddings.
1. Embedding
2. Filtering
3. Deduplication
curator_image_embed.sh - Generates embeddings and applies classifications to images.
- Update all
# Update Me!sections in the scripts for your environment (paths, usernames, S3 buckets, etc). - Submit each job with
sbatch:
Monitoring and Logs
-
Check job status:
-
View logs:
Performance Considerations
- GPU Memory: Image processing requires significant GPU memory. Consider using nodes with high-memory GPUs (40GB+ VRAM) for large batch sizes.
- Tar Archive Format: Ensure your input data is in tar archive format (
.tarfiles containing JPEG images). - Network I/O: Image data can be large. Consider local caching or high-bandwidth storage for better performance.
- Clustering Scale: For datasets with millions of images, increase
n_clustersto 50,000+ to improve deduplication performance.