Installation Guide#

This guide covers installing NeMo Curator with support for all modalities and verifying your installation is working correctly.

Before You Start#

System Requirements#

For comprehensive system requirements and production deployment specifications, refer to Production Deployment Requirements.

Quick Start Requirements:

  • OS: Ubuntu 24.04/22.04/20.04 (recommended)

  • Python: 3.10, 3.11, or 3.12

  • Memory: 16GB+ RAM for basic text processing

  • GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration

  • CUDA 12 (required for audio_cuda12, video_cuda12, image_cuda12, and text_cuda12 extras)

Development vs Production#

Use Case

Requirements

See

Local Development

Minimum specs listed above

Continue below

Production Clusters

Detailed hardware, network, storage specs

Deployment Requirements

Multi-node Setup

Advanced infrastructure planning

Deployment Options


Installation Methods#

Choose one of the following installation methods based on your needs:

Tip

Docker is the recommended installation method for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the Container Installation tab below.

Install NeMo Curator from the Python Package Index using uv for proper dependency resolution.

  1. Install uv:

    curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
    source $HOME/.local/bin/env
    
  2. Create and activate a virtual environment:

    uv venv
    source .venv/bin/activate
    
  3. Install NeMo Curator:

    uv pip install torch wheel_stub psutil setuptools setuptools_scm
    echo "transformers==4.55.2" > override.txt
    uv pip install --no-build-isolation "nemo-curator[all]" --override override.txt
    

Install the latest development version directly from GitHub:

# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator

# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install with all extras using uv
uv sync --all-extras --all-groups

NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.

# Pull the container from NGC
docker pull nvcr.io/nvidia/nemo-curator:26.02

# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:26.02

Important

After entering the container, activate the virtual environment before running any NeMo Curator commands:

source /opt/venv/env.sh

The container uses a virtual environment at /opt/venv. If you see No module named nemo_curator, the environment has not been activated.

Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:

# Build the container locally
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
docker build -t nemo-curator:latest -f docker/Dockerfile .

# Run the container with GPU support
docker run --gpus all -it --rm nemo-curator:latest

Benefits:

  • Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)

  • Consistent runtime across different systems

  • Ideal for production deployments

Install FFmpeg and Encoders (Required for Video)#

Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.

Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.

curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
chmod +x install_ffmpeg.sh
sudo bash install_ffmpeg.sh

Confirm that FFmpeg is on your PATH and that at least one H.264 encoder is available:

ffmpeg -hide_banner -version | head -n 5
ffmpeg -encoders | grep -E "h264_nvenc|libopenh264|libx264" | cat

If encoders are missing, reinstall FFmpeg with the required options or use the Debian/Ubuntu script above.

Note

FFmpeg build requires CUDA toolkit (nvcc): If you encounter ERROR: failed checking for nvcc during FFmpeg installation, ensure that the CUDA toolkit is installed and nvcc is available on your PATH. You can verify with nvcc --version. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.


Package Extras#

NeMo Curator provides several installation extras to install only the components you need:

Table 2 Available Package Extras#

Extra

Installation Command

Description

text_cpu

uv pip install nemo-curator[text_cpu]

CPU-only text processing and filtering

text_cuda12

uv pip install nemo-curator[text_cuda12]

GPU-accelerated text processing with RAPIDS

audio_cpu

uv pip install nemo-curator[audio_cpu]

CPU-only audio curation with NeMo Toolkit ASR

audio_cuda12

uv pip install nemo-curator[audio_cuda12]

GPU-accelerated audio curation. When using uv, requires transformers==4.55.2 override.

image_cpu

uv pip install nemo-curator[image_cpu]

CPU-only image processing

image_cuda12

uv pip install nemo-curator[image_cuda12]

GPU-accelerated image processing with NVIDIA DALI

video_cpu

uv pip install nemo-curator[video_cpu]

CPU-only video processing

video_cuda12

uv pip install --no-build-isolation nemo-curator[video_cuda12]

GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using uv.

Note

Development Dependencies: For development tools (pre-commit, ruff, pytest), use uv sync --group dev --group linting --group test instead of pip extras. Development dependencies are managed as dependency groups, not optional dependencies.


Installation Verification#

After installation, verify that NeMo Curator is working correctly:

1. Basic Import Test#

# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")

# Test core modules
from nemo_curator.pipeline import Pipeline
from nemo_curator.tasks import DocumentBatch
print("✓ Core modules imported successfully")

2. GPU Availability Check#

If you installed GPU support, verify GPU access:

# Check GPU availability
try:
    import torch
    if torch.cuda.is_available():
        print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
        print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("⚠ No GPU detected")
    
    # Check cuDF for GPU deduplication
    import cudf
    print("✓ cuDF available for GPU-accelerated deduplication")
except ImportError as e:
    print(f"⚠ Some GPU modules not available: {e}")

3. Run a Quickstart Tutorial#

Try a modality-specific quickstart to see NeMo Curator in action: