ReferenceInfra

Container Environments

View as Markdown

Deploy NeMo Curator in containerized environments for reproducible, scalable data curation pipelines with pre-configured dependencies and optimized runtime settings.

Overview

NeMo Curator provides official Docker containers with all dependencies pre-installed and optimized for production workloads. Containers offer:

  • Reproducible Environments: Consistent software stack across development, testing, and production
  • Simplified Deployment: No manual dependency installation or environment configuration
  • GPU Acceleration: Pre-configured CUDA, cuDNN, and NVIDIA libraries for optimal performance
  • Multi-Modal Support: Built-in support for text, image, video, and audio curation
  • Cloud-Ready: Compatible with Kubernetes, Docker Swarm, and cloud container orchestries

When to use containers:

  • Production deployments requiring consistency and reliability
  • Multi-node cluster processing with identical environments
  • CI/CD pipelines for automated data curation workflows
  • Quick prototyping without local environment setup
  • GPU-accelerated processing in cloud environments

Available Containers

Main NeMo Curator Container

The primary container includes comprehensive support for all curation modalities:

Container registry: nvcr.io/nvidia/nemo-curator:{{ container_version }}

Supported modalities:

  • ✅ Text curation (CPU/GPU)
  • ✅ Image curation (GPU required)
  • ✅ Video curation (GPU required, FFmpeg included)
  • ✅ Audio curation (GPU required for ASR)

Pre-installed components:

  • NeMo Curator with all optional dependencies ([all] extras)
  • CUDA 12.8.1 with cuDNN
  • Python 3.12 with uv package manager
  • FFmpeg 8+ with NVENC support (for video processing)
  • Ray, Dask, and distributed computing frameworks
  • NVIDIA optimized Python packages

Curator Environment

PropertyValue
Python Version3.12
CUDA Version12.8.1 (configurable)
Operating SystemUbuntu 24.04 (configurable)
Base Imagenvidia/cuda:${CUDA_VER}-cudnn-devel-${LINUX_VER}
Package Manageruv (Ultrafast Python package installer)
InstallationNeMo Curator installed with all optional dependencies ([all] extras) using uv with NVIDIA index
Environment PathVirtual environment at /opt/venv. Activate with source /opt/venv/env.sh after entering the container.

Container Build Arguments

The main container accepts these build-time arguments for environment customization:

ArgumentDefaultDescription
CUDA_VER12.8.1CUDA version
LINUX_VERubuntu24.04Base OS version
CURATOR_ENVciCurator environment type
NVIDIA_BUILD_ID<unknown>NVIDIA build identifier
NVIDIA_BUILD_REF-NVIDIA build reference

Environment Usage Examples

Text Curation

Uses the default container environment with CPU or GPU workers depending on the module.

Image Curation

Requires GPU-enabled workers in the container environment.