Configuration Guide#
Configure NeMo Curator for your deployment environment including infrastructure settings, storage access, credentials, and environment variables. This section focuses on operational configuration for deployment and management.
For technical API documentation and development guidance, see the Infrastructure Reference.
Configuration Areas#
This section covers the three main areas of operational configuration for NeMo Curator deployments. Each area addresses different aspects of system setup and management, from infrastructure deployment to data access and runtime settings.
Configure NeMo Curator for different deployment scenarios including Slurm, Kubernetes, and local environments.
Configure cloud storage access, API keys, and security credentials for data processing and model access.
Comprehensive reference of all environment variables used by NeMo Curator across different deployment scenarios.
Module-Specific Configuration#
Module-specific configuration handles processing pipeline settings for different data modalities. These configurations complement the deployment settings above and focus on algorithm parameters, model configurations, and processing behavior rather than infrastructure concerns.
For configuration of specific processing modules (deduplication, classifiers, filters), see the relevant modality sections:
Configuration for text deduplication, classification, and filtering modules.
Configuration for image classifiers, embedders, and filtering.
Configuration Hierarchy#
NeMo Curator follows a hierarchical configuration system where settings can be specified at multiple levels. This hierarchy ensures flexibility while maintaining clear precedence rules for resolving configuration conflicts across different deployment environments.
NeMo Curator uses the following configuration precedence (highest to lowest priority):
Command-line arguments - Direct parameter overrides
Environment variables - Runtime configuration
Configuration files - YAML/JSON configuration files
Default values - Built-in defaults
Configuration File Locations#
Location |
Description |
---|---|
|
Current working directory config folder |
|
User-specific configuration directory |
|
System-wide configuration directory |
Package defaults |
Built-in default configurations |
Example Configuration Structure#
# Typical deployment configuration layout
config/
├── deployment.yaml # Deployment-specific settings
├── storage.yaml # Storage and credential configuration
├── logging.yaml # Logging configuration
└── modules/
├── deduplication.yaml # Module-specific configs
├── classification.yaml
└── filtering.yaml
Quick Start Examples#
These examples demonstrate common configuration patterns for different deployment scenarios. Each example includes the essential environment variables and settings needed to get NeMo Curator running in that specific environment.
# Set basic environment variables
export DASK_CLUSTER_TYPE="cpu"
export NEMO_CURATOR_LOG_LEVEL="INFO"
export NEMO_CURATOR_CACHE_DIR="./cache"
# Production Slurm environment
export DASK_CLUSTER_TYPE="gpu"
export DASK_PROTOCOL="ucx"
export RMM_WORKER_POOL_SIZE="80GiB"
export NEMO_CURATOR_LOG_LEVEL="WARNING"
# AWS S3 configuration
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"