Installation Guide#
This guide covers installing NeMo Curator and verifying your installation is working correctly. For configuration after installation, see Configuration.
System Requirements#
For comprehensive system requirements and production deployment specifications, see Production Deployment Requirements.
Quick Start Requirements:
OS: Ubuntu 22.04/20.04 (recommended)
Python: 3.10 or 3.12 (Python 3.11 is not supported)
Memory: 16GB+ RAM for basic text processing
GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
Development vs Production#
Use Case |
Requirements |
See |
---|---|---|
Local Development |
Minimum specs listed above |
Continue below |
Production Clusters |
Detailed hardware, network, storage specs |
|
Multi-node Setup |
Advanced infrastructure planning |
Installation Methods#
Choose one of the following installation methods based on your needs:
The simplest way to install NeMo Curator from the Python Package Index:
CPU-only installation:
pip install nemo-curator
GPU-accelerated installation:
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
Full installation with all modules:
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
Install the latest development version directly from GitHub:
# Clone the repository
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
# Install with desired extras
pip install --extra-index-url https://pypi.nvidia.com ".[all]"
Benefits:
Access to latest features and bug fixes
Ability to modify source code for custom needs
Easier contribution to the project
NeMo Curator is available as a standalone container:
Warning
Container Availability: The standalone NeMo Curator container is currently in development. Check the NGC Catalog for the latest availability and container path.
# Pull the container (path will be updated when available)
docker pull nvcr.io/nvidia/nemo-curator:latest
# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:latest
# For custom installations inside container
pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com "/opt/NeMo-Curator[all]"
Benefits:
Pre-configured environment with all dependencies
Consistent runtime across different systems
Ideal for production deployments
Package Extras#
NeMo Curator provides several installation extras to install only the components you need:
Extra |
Installation Command |
Description |
---|---|---|
Base |
|
CPU-only text curation modules |
dev |
|
Development tools (pre-commit, ruff, pytest) |
cuda12x |
|
CPU + GPU text curation with RAPIDS |
image |
|
CPU + GPU text and image curation |
bitext |
|
Bilingual text curation modules |
all |
|
All stable modules (recommended) |
Nightly Dependencies#
For cutting-edge RAPIDS features, use nightly builds:
# Nightly RAPIDS with all modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all_nightly]
# Nightly RAPIDS with image modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[image_nightly]
Warning
Nightly builds may be unstable and are not recommended for production use.
Installation Verification#
After installation, verify that NeMo Curator is working correctly:
1. Basic Import Test#
# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")
# Test core modules
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modules import ExactDuplicates
print("✓ Core modules imported successfully")
2. GPU Availability Check#
If you installed GPU support, verify GPU access:
# Check GPU availability
try:
import cudf
import dask_cudf
print("✓ GPU modules available")
# Test GPU memory
import cupy
mempool = cupy.get_default_memory_pool()
print(f"✓ GPU memory pool initialized: {mempool.total_bytes() / 1e9:.1f} GB")
except ImportError as e:
print(f"⚠ GPU modules not available: {e}")
3. CLI Tools Verification#
Test that command-line tools are properly installed:
# Check if CLI tools are available
text_cleaning --help
add_id --help
gpu_exact_dups --help
# Test specific functionality
echo '{"id": "doc1", "text": "Hello world"}' | text_cleaning --input-format jsonl
4. Dask Cluster Test#
Verify distributed computing capabilities:
from nemo_curator.utils.distributed_utils import get_client
# Test local cluster creation
client = get_client(cluster_type="local", n_workers=2)
print(f"✓ Dask cluster created: {client}")
# Test basic distributed operation
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3, 4]}), npartitions=2)
result = df.x.sum().compute()
print(f"✓ Distributed computation successful: {result}")
client.close()
Common Installation Issues#
CUDA/GPU Issues#
Problem: GPU modules not available after installation
ImportError: No module named 'cudf'
Solutions:
Ensure you installed with the correct extra:
nemo-curator[cuda12x]
ornemo-curator[all]
Verify CUDA is properly installed:
nvidia-smi
Check CUDA version compatibility (CUDA 12.0+ required)
Install RAPIDS manually:
pip install --extra-index-url https://pypi.nvidia.com cudf-cu12
Python Version Issues#
Problem: Installation fails with Python version errors
ERROR: Package 'nemo_curator' requires a different Python: 3.9.0 not in '>=3.10'
Solutions:
Upgrade to Python 3.10 or 3.12
Use conda to manage Python versions:
conda create -n curator python=3.12
Avoid Python 3.11 (not supported due to RAPIDS compatibility)
Network/Registry Issues#
Problem: Cannot access NVIDIA PyPI registry
ERROR: Could not find a version that satisfies the requirement cudf-cu12
Solutions:
Ensure you’re using the NVIDIA registry:
--extra-index-url https://pypi.nvidia.com
Check network connectivity to PyPI and NVIDIA registry
Try installing with
--trusted-host pypi.nvidia.com
Use container installation as alternative
Memory Issues#
Problem: Installation fails due to insufficient memory
MemoryError: Unable to allocate array
Solutions:
Increase system memory or swap space
Install packages individually rather than
[all]
Use
--no-cache-dir
flag:pip install --no-cache-dir nemo-curator[all]
Consider container installation
Next Steps#
Choose your next step based on your goals:
For Local Development & Learning#
Try a tutorial: Start with Get Started guides
Configure your environment: See Configuration Guide for basic setup
For Production Deployment#
Review requirements: See Production Deployment Requirements
Choose deployment method: See Deployment Options
Configure for production: See Configuration Guide for advanced settings
See also
Configuration Guide - Configure NeMo Curator for your environment
Container Environments - Container-specific setup
Deployment Requirements - Production deployment prerequisites