This workflow covers the full image curation pipeline on Slurm, including model download, embedding generation, classification, filtering, and deduplication.
For details on image container environments and Slurm environment variables, see Container Environments.
Create required directories for AWS credentials, NeMo Curator configuration, and local workspace:
Prepare configuration files:
$HOME/.aws/credentials (for S3 access)
See configuration files in the repository.
Copy the following script for downloading all required image processing models into the Slurm cluster.
Update the SBATCH parameters and paths to match your username and environment.
Run the script.
The workflow consists of three main Slurm scripts, to be run in order:
curator_image_embed.sh: Generates embeddings and applies classifications to images.curator_image_filter.sh: Filters images based on quality, aesthetic, and NSFW scores.curator_image_dedup.sh: Performs semantic deduplication using image embeddings.curator_image_embed.sh - Generates embeddings and applies classifications to images.
# Update Me! sections in the scripts for your environment (paths, usernames, S3 buckets, etc).sbatch:Check job status:
View logs:
.tar files containing JPEG images).n_clusters to 50,000+ to improve deduplication performance.