Setup & Deployment

Installation Guide

View as Markdown

This guide covers installing NeMo Curator with support for all modalities and verifying your installation is working correctly.

Before You Start

System Requirements

For comprehensive system requirements and production deployment specifications, refer to Production Deployment Requirements.

Quick Start Requirements:

  • OS: Ubuntu 24.04/22.04/20.04 (recommended)
  • Python: 3.10, 3.11, or 3.12
  • Memory: 16GB+ RAM for basic text processing
  • GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
  • CUDA 12 (required for audio_cuda12, video_cuda12, image_cuda12, and text_cuda12 extras)

Development vs Production

Use CaseRequirementsSee
Local DevelopmentMinimum specs listed aboveContinue below
Production ClustersDetailed hardware, network, storage specsDeployment Requirements
Multi-node SetupAdvanced infrastructure planningDeployment Options

Installation Methods

Choose one of the following installation methods based on your needs:

Docker is the recommended installation method for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the Container Installation tab below.

Install NeMo Curator from the Python Package Index using uv for proper dependency resolution.

  1. Install uv:

    $curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
    $source $HOME/.local/bin/env
  2. Create and activate a virtual environment:

    $uv venv
    $source .venv/bin/activate
  3. Install NeMo Curator:

    $uv pip install torch wheel_stub psutil setuptools setuptools_scm
    $echo "transformers==4.55.2" > override.txt
    $uv pip install --no-build-isolation "nemo-curator[all]" --override override.txt

Install FFmpeg and Encoders (Required for Video)

Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.

Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.

$curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
$chmod +x install_ffmpeg.sh
$sudo bash install_ffmpeg.sh

FFmpeg build requires CUDA toolkit (nvcc): If you encounter ERROR: failed checking for nvcc during FFmpeg installation, ensure that the CUDA toolkit is installed and nvcc is available on your PATH. You can verify with nvcc --version. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.


Package Extras

NeMo Curator provides several installation extras to install only the components you need:

ExtraInstallation CommandDescription
text_cpuuv pip install nemo-curator[text_cpu]CPU-only text processing and filtering
text_cuda12uv pip install nemo-curator[text_cuda12]GPU-accelerated text processing with RAPIDS
audio_cpuuv pip install nemo-curator[audio_cpu]CPU-only audio curation with NeMo Toolkit ASR
audio_cuda12uv pip install nemo-curator[audio_cuda12]GPU-accelerated audio curation. When using uv, requires transformers==4.55.2 override.
image_cpuuv pip install nemo-curator[image_cpu]CPU-only image processing
image_cuda12uv pip install nemo-curator[image_cuda12]GPU-accelerated image processing with NVIDIA DALI
video_cpuuv pip install nemo-curator[video_cpu]CPU-only video processing
video_cuda12uv pip install --no-build-isolation nemo-curator[video_cuda12]GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using uv.

Development Dependencies: For development tools (pre-commit, ruff, pytest), use uv sync --group dev --group linting --group test instead of pip extras. Development dependencies are managed as dependency groups, not optional dependencies.


Installation Verification

After installation, verify that NeMo Curator is working correctly:

1. Basic Import Test

1# Test basic imports
2import nemo_curator
3print(f"NeMo Curator version: {nemo_curator.__version__}")
4
5# Test core modules
6from nemo_curator.pipeline import Pipeline
7from nemo_curator.tasks import DocumentBatch
8print("✓ Core modules imported successfully")

2. GPU Availability Check

If you installed GPU support, verify GPU access:

1# Check GPU availability
2try:
3 import torch
4 if torch.cuda.is_available():
5 print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
6 print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
7 else:
8 print("⚠ No GPU detected")
9
10 # Check cuDF for GPU deduplication
11 import cudf
12 print("✓ cuDF available for GPU-accelerated deduplication")
13except ImportError as e:
14 print(f"⚠ Some GPU modules not available: {e}")

3. Run a Quickstart Tutorial

Try a modality-specific quickstart to see NeMo Curator in action: