Installation Guide#
This guide covers installing NeMo Curator and verifying your installation is working correctly. For configuration after installation, see Configuration.
System Requirements#
For comprehensive system requirements and production deployment specifications, see Production Deployment Requirements.
Quick Start Requirements:
- OS: Ubuntu 22.04/20.04 (recommended) 
- Python: 3.10 or 3.12 (Python 3.11 is not supported) 
- Memory: 16GB+ RAM for basic text processing 
- GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration 
Development vs Production#
| Use Case | Requirements | See | 
|---|---|---|
| Local Development | Minimum specs listed above | Continue below | 
| Production Clusters | Detailed hardware, network, storage specs | |
| Multi-node Setup | Advanced infrastructure planning | 
Installation Methods#
Choose one of the following installation methods based on your needs:
The simplest way to install NeMo Curator from the Python Package Index:
CPU-only installation:
pip install nemo-curator
GPU-accelerated installation:
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
Full installation with all modules:
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
Install the latest development version directly from GitHub:
# Clone the repository
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
# Install with desired extras
pip install --extra-index-url https://pypi.nvidia.com ".[all]"
Benefits:
- Access to latest features and bug fixes 
- Ability to modify source code for custom needs 
- Easier contribution to the project 
NeMo Curator is available as a standalone container:
Warning
Container Availability: The standalone NeMo Curator container is currently in development. Check the NGC Catalog for the latest availability and container path.
# Pull the container (path will be updated when available)
docker pull nvcr.io/nvidia/nemo-curator:latest
# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:latest
# For custom installations inside container
pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com "/opt/NeMo-Curator[all]"
Benefits:
- Pre-configured environment with all dependencies 
- Consistent runtime across different systems 
- Ideal for production deployments 
Package Extras#
NeMo Curator provides several installation extras to install only the components you need:
| Extra | Installation Command | Description | 
|---|---|---|
| Base | 
 | CPU-only text curation modules | 
| dev | 
 | Development tools (pre-commit, ruff, pytest) | 
| cuda12x | 
 | CPU + GPU text curation with RAPIDS | 
| image | 
 | CPU + GPU text and image curation | 
| bitext | 
 | Bilingual text curation modules | 
| all | 
 | All stable modules (recommended) | 
Nightly Dependencies#
For cutting-edge RAPIDS features, use nightly builds:
# Nightly RAPIDS with all modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all_nightly]
# Nightly RAPIDS with image modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[image_nightly]
Warning
Nightly builds may be unstable and are not recommended for production use.
Installation Verification#
After installation, verify that NeMo Curator is working correctly:
1. Basic Import Test#
# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")
# Test core modules
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modules import ExactDuplicates
print("✓ Core modules imported successfully")
2. GPU Availability Check#
If you installed GPU support, verify GPU access:
# Check GPU availability
try:
    import cudf
    import dask_cudf
    print("✓ GPU modules available")
    
    # Test GPU memory
    import cupy
    mempool = cupy.get_default_memory_pool()
    print(f"✓ GPU memory pool initialized: {mempool.total_bytes() / 1e9:.1f} GB")
except ImportError as e:
    print(f"⚠ GPU modules not available: {e}")
3. CLI Tools Verification#
Test that command-line tools are properly installed:
# Check if CLI tools are available
text_cleaning --help
add_id --help
gpu_exact_dups --help
# Test specific functionality
echo '{"id": "doc1", "text": "Hello world"}' | text_cleaning --input-format jsonl
4. Dask Cluster Test#
Verify distributed computing capabilities:
from nemo_curator.utils.distributed_utils import get_client
# Test local cluster creation
client = get_client(cluster_type="local", n_workers=2)
print(f"✓ Dask cluster created: {client}")
# Test basic distributed operation
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3, 4]}), npartitions=2)
result = df.x.sum().compute()
print(f"✓ Distributed computation successful: {result}")
client.close()
Common Installation Issues#
CUDA/GPU Issues#
Problem: GPU modules not available after installation
ImportError: No module named 'cudf'
Solutions:
- Ensure you installed with the correct extra: - nemo-curator[cuda12x]or- nemo-curator[all]
- Verify CUDA is properly installed: - nvidia-smi
- Check CUDA version compatibility (CUDA 12.0+ required) 
- Install RAPIDS manually: - pip install --extra-index-url https://pypi.nvidia.com cudf-cu12
Python Version Issues#
Problem: Installation fails with Python version errors
ERROR: Package 'nemo_curator' requires a different Python: 3.9.0 not in '>=3.10'
Solutions:
- Upgrade to Python 3.10 or 3.12 
- Use conda to manage Python versions: - conda create -n curator python=3.12
- Avoid Python 3.11 (not supported due to RAPIDS compatibility) 
Network/Registry Issues#
Problem: Cannot access NVIDIA PyPI registry
ERROR: Could not find a version that satisfies the requirement cudf-cu12
Solutions:
- Ensure you’re using the NVIDIA registry: - --extra-index-url https://pypi.nvidia.com
- Check network connectivity to PyPI and NVIDIA registry 
- Try installing with - --trusted-host pypi.nvidia.com
- Use container installation as alternative 
Memory Issues#
Problem: Installation fails due to insufficient memory
MemoryError: Unable to allocate array
Solutions:
- Increase system memory or swap space 
- Install packages individually rather than - [all]
- Use - --no-cache-dirflag:- pip install --no-cache-dir nemo-curator[all]
- Consider container installation 
Next Steps#
Choose your next step based on your goals:
For Local Development & Learning#
- Try a tutorial: Start with Get Started guides 
- Configure your environment: See Configuration Guide for basic setup 
For Production Deployment#
- Review requirements: See Production Deployment Requirements 
- Choose deployment method: See Deployment Options 
- Configure for production: See Configuration Guide for advanced settings 
See also
- Configuration Guide - Configure NeMo Curator for your environment 
- Container Environments - Container-specific setup 
- Deployment Requirements - Production deployment prerequisites