Installation Guide#
This guide covers installing NeMo Curator with support for all modalities and verifying your installation is working correctly.
Before You Start#
System Requirements#
For comprehensive system requirements and production deployment specifications, refer to Production Deployment Requirements.
Quick Start Requirements:
OS: Ubuntu 24.04/22.04/20.04 (recommended)
Python: 3.10, 3.11, or 3.12
Memory: 16GB+ RAM for basic text processing
GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
CUDA 12 (required for
audio_cuda12,video_cuda12,image_cuda12, andtext_cuda12extras)
Development vs Production#
Use Case |
Requirements |
See |
|---|---|---|
Local Development |
Minimum specs listed above |
Continue below |
Production Clusters |
Detailed hardware, network, storage specs |
|
Multi-node Setup |
Advanced infrastructure planning |
Installation Methods#
Choose one of the following installation methods based on your needs:
Tip
Docker is the recommended installation method for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the Container Installation tab below.
Install NeMo Curator from the Python Package Index using uv for proper dependency resolution.
Install uv:
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh source $HOME/.local/bin/env
Create and activate a virtual environment:
uv venv source .venv/bin/activate
Install NeMo Curator:
uv pip install torch wheel_stub psutil setuptools setuptools_scm echo "transformers==4.55.2" > override.txt uv pip install --no-build-isolation "nemo-curator[all]" --override override.txt
Install the latest development version directly from GitHub:
# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install with all extras using uv
uv sync --all-extras --all-groups
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
# Pull the container from NGC
docker pull nvcr.io/nvidia/nemo-curator:26.02
# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:26.02
Important
After entering the container, activate the virtual environment before running any NeMo Curator commands:
source /opt/venv/env.sh
The container uses a virtual environment at /opt/venv. If you see No module named nemo_curator, the environment has not been activated.
Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
# Build the container locally
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
docker build -t nemo-curator:latest -f docker/Dockerfile .
# Run the container with GPU support
docker run --gpus all -it --rm nemo-curator:latest
Benefits:
Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
Consistent runtime across different systems
Ideal for production deployments
Install FFmpeg and Encoders (Required for Video)#
Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.
Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.
Script source: docker/common/install_ffmpeg.sh
curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
chmod +x install_ffmpeg.sh
sudo bash install_ffmpeg.sh
Confirm that FFmpeg is on your PATH and that at least one H.264 encoder is available:
ffmpeg -hide_banner -version | head -n 5
ffmpeg -encoders | grep -E "h264_nvenc|libopenh264|libx264" | cat
If encoders are missing, reinstall FFmpeg with the required options or use the Debian/Ubuntu script above.
Note
FFmpeg build requires CUDA toolkit (nvcc): If you encounter ERROR: failed checking for nvcc during FFmpeg installation, ensure that the CUDA toolkit is installed and nvcc is available on your PATH. You can verify with nvcc --version. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.
Package Extras#
NeMo Curator provides several installation extras to install only the components you need:
Extra |
Installation Command |
Description |
|---|---|---|
text_cpu |
|
CPU-only text processing and filtering |
text_cuda12 |
|
GPU-accelerated text processing with RAPIDS |
audio_cpu |
|
CPU-only audio curation with NeMo Toolkit ASR |
audio_cuda12 |
|
GPU-accelerated audio curation. When using |
image_cpu |
|
CPU-only image processing |
image_cuda12 |
|
GPU-accelerated image processing with NVIDIA DALI |
video_cpu |
|
CPU-only video processing |
video_cuda12 |
|
GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using |
Note
Development Dependencies: For development tools (pre-commit, ruff, pytest), use uv sync --group dev --group linting --group test instead of pip extras. Development dependencies are managed as dependency groups, not optional dependencies.
Installation Verification#
After installation, verify that NeMo Curator is working correctly:
1. Basic Import Test#
# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")
# Test core modules
from nemo_curator.pipeline import Pipeline
from nemo_curator.tasks import DocumentBatch
print("✓ Core modules imported successfully")
2. GPU Availability Check#
If you installed GPU support, verify GPU access:
# Check GPU availability
try:
import torch
if torch.cuda.is_available():
print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
print("⚠ No GPU detected")
# Check cuDF for GPU deduplication
import cudf
print("✓ cuDF available for GPU-accelerated deduplication")
except ImportError as e:
print(f"⚠ Some GPU modules not available: {e}")
3. Run a Quickstart Tutorial#
Try a modality-specific quickstart to see NeMo Curator in action:
Text Curation Quickstart - Set up and run your first text curation pipeline
Audio Curation Quickstart - Get started with audio dataset curation
Image Curation Quickstart - Curate image-text datasets for generative models
Video Curation Quickstart - Split, encode, and curate video clips at scale