Setup & Deployment

Installation Guide

View as Markdown

This guide covers installing NeMo Curator with support for all modalities and verifying your installation is working correctly.

Before You Start

System Requirements

For comprehensive system requirements and production deployment specifications, see Production Deployment Requirements.

Quick Start Requirements:

  • OS: Ubuntu 24.04/22.04/20.04 (recommended)
  • Python: 3.10, 3.11, or 3.12
  • Memory: 16GB+ RAM for basic text processing
  • GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration

Development vs Production

Use CaseRequirementsSee
Local DevelopmentMinimum specs listed aboveContinue below
Production ClustersDetailed hardware, network, storage specsDeployment Requirements
Multi-node SetupAdvanced infrastructure planningDeployment Options

Installation Methods

Choose one of the following installation methods based on your needs:

Install FFmpeg and Encoders (Required for Video)

Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.

Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.

$curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
$chmod +x install_ffmpeg.sh
$sudo bash install_ffmpeg.sh

InternVideo2 Support (Optional for Video)

Video processing includes optional support for InternVideo2. To install InternVideo2, run these commands before installing NeMo Curator based on whether you install via PyPI or from source:

$# Clone and set up InternVideo2
$git clone https://github.com/OpenGVLab/InternVideo.git
$cd InternVideo
$git checkout 09d872e5093296c6f36b8b3a91fc511b76433bf7
$
$# Download and apply NeMo Curator patch
$curl -fsSL https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/external/intern_video2_multimodal.patch -o intern_video2_multimodal.patch
$patch -p1 < intern_video2_multimodal.patch
$cd ..
$
$# Add InternVideo2 to the environment
$uv add InternVideo/InternVideo2/multi_modality

Package Extras

NeMo Curator provides several installation extras to install only the components you need:

Available Package Extras

ExtraInstallation CommandDescription
text_cpuuv pip install nemo-curator[text_cpu]CPU-only text processing and filtering
text_cuda12uv pip install nemo-curator[text_cuda12]GPU-accelerated text processing with RAPIDS
audio_cpuuv pip install nemo-curator[audio_cpu]CPU-only audio curation with NeMo Toolkit ASR
audio_cuda12uv pip install nemo-curator[audio_cuda12]GPU-accelerated audio curation. When using uv, requires transformers==4.55.2 override.
image_cpuuv pip install nemo-curator[image_cpu]CPU-only image processing
image_cuda12uv pip install nemo-curator[image_cuda12]GPU-accelerated image processing with NVIDIA DALI
video_cpuuv pip install nemo-curator[video_cpu]CPU-only video processing
video_cuda12uv pip install --no-build-isolation nemo-curator[video_cuda12]GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using uv.

Development Dependencies: For development tools (pre-commit, ruff, pytest), use uv sync --group dev --group linting --group test instead of pip extras. Development dependencies are managed as dependency groups, not optional dependencies.


Installation Verification

After installation, verify that NeMo Curator is working correctly:

1. Basic Import Test

1# Test basic imports
2import nemo_curator
3print(f"NeMo Curator version: {nemo_curator.__version__}")
4
5# Test core modules
6from nemo_curator.pipeline import Pipeline
7from nemo_curator.tasks import DocumentBatch
8print("✓ Core modules imported successfully")

2. GPU Availability Check

If you installed GPU support, verify GPU access:

1# Check GPU availability
2try:
3 import torch
4 if torch.cuda.is_available():
5 print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
6 print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
7 else:
8 print("⚠ No GPU detected")
9
10 # Check cuDF for GPU deduplication
11 import cudf
12 print("✓ cuDF available for GPU-accelerated deduplication")
13except ImportError as e:
14 print(f"⚠ Some GPU modules not available: {e}")

3. Run a Quickstart Tutorial

Try a modality-specific quickstart to see NeMo Curator in action: