Configuration Guide#

Configure NeMo Curator for your deployment environment including infrastructure settings, storage access, credentials, and environment variables. This section focuses on operational configuration for deployment and management.

For technical API documentation and development guidance, see the Infrastructure Reference.


Configuration Areas#

This section covers the three main areas of operational configuration for NeMo Curator deployments. Each area addresses different aspects of system setup and management, from infrastructure deployment to data access and runtime settings.

Deployment Environments

Configure NeMo Curator for different deployment scenarios including Slurm, Kubernetes, and local environments.

Deployment Environment Configuration
Storage & Credentials

Configure cloud storage access, API keys, and security credentials for data processing and model access.

Storage & Credentials Configuration
Environment Variables

Comprehensive reference of all environment variables used by NeMo Curator across different deployment scenarios.

Environment Variables Reference

Module-Specific Configuration#

Module-specific configuration handles processing pipeline settings for different data modalities. These configurations complement the deployment settings above and focus on algorithm parameters, model configurations, and processing behavior rather than infrastructure concerns.

For configuration of specific processing modules (deduplication, classifiers, filters), see the relevant modality sections:

Text Processing

Configuration for text deduplication, classification, and filtering modules.

About Text Curation
Image Processing

Configuration for image classifiers, embedders, and filtering.

About Image Curation

Configuration Hierarchy#

NeMo Curator follows a hierarchical configuration system where settings can be specified at multiple levels. This hierarchy ensures flexibility while maintaining clear precedence rules for resolving configuration conflicts across different deployment environments.

NeMo Curator uses the following configuration precedence (highest to lowest priority):

  1. Command-line arguments - Direct parameter overrides

  2. Environment variables - Runtime configuration

  3. Configuration files - YAML/JSON configuration files

  4. Default values - Built-in defaults

Configuration File Locations#

Table 15 Configuration File Search Order#

Location

Description

./config/

Current working directory config folder

~/.nemo_curator/

User-specific configuration directory

/etc/nemo_curator/

System-wide configuration directory

Package defaults

Built-in default configurations

Example Configuration Structure#

# Typical deployment configuration layout
config/
├── deployment.yaml          # Deployment-specific settings
├── storage.yaml             # Storage and credential configuration  
├── logging.yaml             # Logging configuration
└── modules/
    ├── deduplication.yaml   # Module-specific configs
    ├── classification.yaml
    └── filtering.yaml

Quick Start Examples#

These examples demonstrate common configuration patterns for different deployment scenarios. Each example includes the essential environment variables and settings needed to get NeMo Curator running in that specific environment.

# Set basic environment variables
export DASK_CLUSTER_TYPE="cpu"
export NEMO_CURATOR_LOG_LEVEL="INFO"
export NEMO_CURATOR_CACHE_DIR="./cache"
# Production Slurm environment
export DASK_CLUSTER_TYPE="gpu"
export DASK_PROTOCOL="ucx"
export RMM_WORKER_POOL_SIZE="80GiB"
export NEMO_CURATOR_LOG_LEVEL="WARNING"
# AWS S3 configuration
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"