Setup & DeploymentDeployment

Deploy Image Curation on Slurm

View as Markdown

This workflow covers the full image curation pipeline on Slurm, including model download, embedding generation, classification, filtering, and deduplication.

For details on image container environments and Slurm environment variables, see Container Environments.

Prerequisites

  • Create required directories for AWS credentials, NeMo Curator configuration, and local workspace:

    $mkdir $HOME/.aws
    $mkdir -p $HOME/.config/nemo_curator
    $mkdir $HOME/nemo_curator_local_workspace

Prepare configuration files:

$HOME/.aws/credentials (for S3 access)

See configuration files in the repository.


Model Download

  1. Copy the following script for downloading all required image processing models into the Slurm cluster.

    $#!/bin/bash
    $
    $#SBATCH --job-name=download_image_models
    $#SBATCH -p defq
    $#SBATCH --nodes=1
    $#SBATCH --ntasks-per-node=1
    $#SBATCH --exclusive
    $#SBATCH --gres=gpu:1
    $
    $# Update Me!
    $#SBATCH --output=/home/<username>/logs/%x_%j.log
    $USER_DIR="/home/${USER}"
    $CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
    $#
    $
    $LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
    $LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
    $NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
    $CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT}"
    $
    $export NEMO_CURATOR_RAY_SLURM_JOB=1
    $export NEMO_CURATOR_LOCAL_DOCKER_JOB=1
    $
    $# Download Image Processing Models
    $srun \
    > --mpi=none \
    > --container-writable \
    > --no-container-remap-root \
    > --export=NEMO_CURATOR_RAY_SLURM_JOB,NEMO_CURATOR_LOCAL_DOCKER_JOB \
    > --container-image "${CONTAINER_IMAGE}" \
    > --container-mounts "${CONTAINER_MOUNTS}" \
    > -- python3 -c "
    >import timm
    >from nemo_curator.image.embedders import TimmImageEmbedder
    >from nemo_curator.image.classifiers import AestheticClassifier, NsfwClassifier
    >
    ># Download and cache CLIP model
    >embedder = TimmImageEmbedder('vit_large_patch14_clip_quickgelu_224.openai', pretrained=True)
    >
    ># Download aesthetic and NSFW classifiers
    >aesthetic = AestheticClassifier()
    >nsfw = NsfwClassifier()
    >
    >print('Image models downloaded successfully')
    >"
  2. Update the SBATCH parameters and paths to match your username and environment.

  3. Run the script.

    $sbatch 1_curator_download_image_models.sh

Image Processing Pipeline

The workflow consists of three main Slurm scripts, to be run in order:

  1. curator_image_embed.sh: Generates embeddings and applies classifications to images.
  2. curator_image_filter.sh: Filters images based on quality, aesthetic, and NSFW scores.
  3. curator_image_dedup.sh: Performs semantic deduplication using image embeddings.

curator_image_embed.sh - Generates embeddings and applies classifications to images.

$#!/bin/bash
$
$#SBATCH --job-name=image-embed
$#SBATCH -p defq
$#SBATCH --nodes=4
$#SBATCH --ntasks-per-node=1
$#SBATCH --exclusive
$#SBATCH --gres=gpu:4
$#SBATCH --time=08:00:00
$
$# Update Me!
$#SBATCH --output=/home/<username>/logs/%x_%j.log
$#SBATCH --error=/home/<username>/logs/%x_%j.log
$USER_DIR="/home/${USER}"
$CONTAINER_IMAGE="${USER_DIR}/path-to/curator.sqsh"
$INPUT_TAR_PATH="s3://your-bucket/raw-images/{00000..00999}.tar"
$OUTPUT_DATA_PATH="s3://your-bucket/embedded-images/"
$#
$
$LOCAL_WORKSPACE="${USER_DIR}/nemo_curator_local_workspace"
$LOCAL_WORKSPACE_MOUNT="${LOCAL_WORKSPACE}:/config"
$NEMO_CONFIG_MOUNT="${HOME}/.config/nemo_curator/config.yaml:/nemo_curator/config/nemo_curator.yaml"
$AWS_MOUNT="${HOME}/.aws:/root/.aws"
$CONTAINER_MOUNTS="${LOCAL_WORKSPACE_MOUNT},${NEMO_CONFIG_MOUNT},${AWS_MOUNT}"
$
$export NEMO_CURATOR_RAY_SLURM_JOB=1
$
$srun \
> --mpi=none \
> --container-writable \
> --no-container-remap-root \
> --export=NEMO_CURATOR_RAY_SLURM_JOB \
> --container-image "${CONTAINER_IMAGE}" \
> --container-mounts "${CONTAINER_MOUNTS}" \
> -- python3 -c "
>from nemo_curator.datasets import ImageTextPairDataset
>from nemo_curator.image.embedders import TimmImageEmbedder
>from nemo_curator.image.classifiers import AestheticClassifier, NsfwClassifier
>from nemo_curator.utils.distributed_utils import get_client
>
># Initialize distributed client
>client = get_client(cluster_type='gpu')
>
># Load dataset
>dataset = ImageTextPairDataset.from_webdataset('${INPUT_TAR_PATH}', id_col='key')
>
># Generate embeddings
>embedder = TimmImageEmbedder(
> 'vit_large_patch14_clip_quickgelu_224.openai',
> pretrained=True,
> batch_size=1024,
> num_threads_per_worker=16,
> normalize_embeddings=True,
> autocast=False
>)
>dataset = embedder(dataset)
>
># Apply aesthetic classification
>aesthetic_classifier = AestheticClassifier()
>dataset = aesthetic_classifier(dataset)
>
># Apply NSFW classification
>nsfw_classifier = NsfwClassifier()
>dataset = nsfw_classifier(dataset)
>
># Save results
>dataset.to_webdataset('${OUTPUT_DATA_PATH}')
>client.close()
>"
  1. Update all # Update Me! sections in the scripts for your environment (paths, usernames, S3 buckets, etc).
  2. Submit each job with sbatch:
1sbatch curator_image_embed.sh
2sbatch curator_image_filter.sh
3sbatch curator_image_dedup.sh

Monitoring and Logs

  1. Check job status:

    $squeue
  2. View logs:

    $tail -f /path/to/logs/<jobname>-<jobid>.log

Performance Considerations

  • GPU Memory: Image processing requires significant GPU memory. Consider using nodes with high-memory GPUs (40GB+ VRAM) for large batch sizes.
  • Tar Archive Format: Ensure your input data is in tar archive format (.tar files containing JPEG images).
  • Network I/O: Image data can be large. Consider local caching or high-bandwidth storage for better performance.
  • Clustering Scale: For datasets with millions of images, increase n_clusters to 50,000+ to improve deduplication performance.