Deployment Environment Configuration#

Configure NeMo Curator for different deployment environments including local development, Slurm clusters, and Kubernetes. This guide focuses on deployment-specific settings and operational concerns.

For technical API documentation and implementation details, see the Infrastructure Reference.

Tip

Applying These Configurations: This guide shows you how to configure NeMo Curator for different environments. To learn how to actually deploy and run NeMo Curator in these environments, see:


Deployment Scenarios#

Local Development Environment#

Basic configuration for single-machine development and testing.

# Environment variables for local CPU development
export DASK_CLUSTER_TYPE="cpu"
export DASK_N_WORKERS="4"
export DASK_THREADS_PER_WORKER="2"
export DASK_MEMORY_LIMIT="4GB"
export NEMO_CURATOR_LOG_LEVEL="INFO"
export NEMO_CURATOR_CACHE_DIR="./cache"
# Environment variables for local GPU development
export DASK_CLUSTER_TYPE="gpu"
export DASK_PROTOCOL="tcp"
export RMM_WORKER_POOL_SIZE="4GB"
export CUDF_SPILL="1"
export NEMO_CURATOR_LOG_LEVEL="DEBUG"

Production Slurm Environment#

Optimized configuration for Slurm-managed GPU clusters.

# Production Slurm environment variables
export DEVICE="gpu"
export PROTOCOL="ucx"  # Use UCX for multi-GPU communication
export INTERFACE="ib0"  # InfiniBand interface if available
export CPU_WORKER_MEMORY_LIMIT="0"  # No memory limit
export RAPIDS_NO_INITIALIZE="0"
export CUDF_SPILL="0"  # Disable spilling for performance
export RMM_SCHEDULER_POOL_SIZE="1GB"
export RMM_WORKER_POOL_SIZE="80GiB"  # 80-90% of GPU memory
export LIBCUDF_CUFILE_POLICY="ON"  # Enable GPUDirect Storage
# High-performance Slurm configuration
export DEVICE="gpu"
export PROTOCOL="ucx"
export INTERFACE="ib0"
export UCX_MEMTYPE_CACHE="n"  # Disable UCX memory type cache
export UCX_TLS="rc,cuda_copy,cuda_ipc"  # Optimized transport layers
export RMM_WORKER_POOL_SIZE="90GiB"  # Maximum GPU memory allocation
export CUDF_SPILL="0"
export LIBCUDF_CUFILE_POLICY="ON"
export NEMO_CURATOR_LOG_LEVEL="WARNING"  # Reduce logging overhead

For maximum performance on large clusters.

Kubernetes Environment#

Configuration for Kubernetes deployments with Dask Operator.

# kubernetes-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nemo-curator-config
data:
  DASK_CLUSTER_TYPE: "kubernetes"
  PROTOCOL: "tcp"
  RMM_WORKER_POOL_SIZE: "16GB"
  CUDF_SPILL: "1"
  NEMO_CURATOR_LOG_LEVEL: "INFO"
  NEMO_CURATOR_CACHE_DIR: "/shared/cache"
# gpu-kubernetes-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nemo-curator-gpu-config
data:
  DEVICE: "gpu"
  PROTOCOL: "ucx"
  RMM_WORKER_POOL_SIZE: "32GB"
  CUDF_SPILL: "0"
  RAPIDS_NO_INITIALIZE: "0"
  LIBCUDF_CUFILE_POLICY: "OFF"  # Usually not available in K8s
  NEMO_CURATOR_LOG_LEVEL: "WARNING"

Dask Cluster Configuration#

Cluster Connection Methods#

from nemo_curator.utils.distributed_utils import get_client

# Connect to existing scheduler
client = get_client(scheduler_address="tcp://scheduler:8786")

# Using scheduler file (common in Slurm)
client = get_client(scheduler_file="/shared/scheduler.json")
# Create local CPU cluster
client = get_client(
    cluster_type="cpu",
    n_workers=4,
    threads_per_worker=2,
    memory_limit="4GB"
)

# Create local GPU cluster
client = get_client(
    cluster_type="gpu",
    rmm_pool_size="8GB",
    enable_spilling=True
)

Cluster Sizing Guidelines#

Table 16 Recommended Cluster Configurations#

Use Case

Workers

Memory per Worker

GPU Memory Pool

Development

1-2

4-8 GB

2-4 GB

Small Production

4-8

16-32 GB

16-32 GB

Large Production

16-64

32-128 GB

64-90 GB

Massive Scale

64+

128+ GB

80-90 GB


GPU Memory Management#

RMM Pool Configuration#

Configure RAPIDS Memory Manager for optimal GPU memory usage:

# Conservative setup (development)
export RMM_WORKER_POOL_SIZE="4GB"
export CUDF_SPILL="1"

Recommended for development and testing environments.

# Aggressive setup (production)
export RMM_WORKER_POOL_SIZE="80GiB"  # 80-90% of GPU memory
export CUDF_SPILL="0"  # Disable spilling for performance

Optimized for production environments with dedicated GPU resources.

Memory Pool Sizing#

Table 17 RMM Pool Sizing Guidelines#

GPU Memory

Conservative Pool

Balanced Pool

Aggressive Pool

16 GB

8 GB

12 GB

14 GB

32 GB

16 GB

24 GB

28 GB

80 GB (A100)

40 GB

64 GB

72 GB

128 GB (H100)

64 GB

96 GB

115 GB


Networking Configuration#

Protocol Selection#

Table 18 Network Protocol Recommendations#

Deployment Type

Recommended Protocol

Performance

Requirements

Single Machine

TCP

Good

None

Multi-Node CPU

TCP

Good

Standard networking

Multi-Node GPU

UCX

Excellent

UCX-enabled cluster

InfiniBand Cluster

UCX

Excellent

InfiniBand + UCX

Network Interface Selection#

# Ethernet (most common)
export INTERFACE="eth0"

# InfiniBand (high-performance clusters)
export INTERFACE="ib0"

# Auto-detect (let Dask choose)
export INTERFACE=""  # Empty string for auto-detection

Logging and Monitoring#

Deployment-Specific Logging#

export NEMO_CURATOR_LOG_LEVEL="DEBUG"
export NEMO_CURATOR_LOG_DIR="./logs"
export DASK_LOGGING__DISTRIBUTED="debug"
export NEMO_CURATOR_LOG_LEVEL="WARNING"
export NEMO_CURATOR_LOG_DIR="/shared/logs"
export DASK_LOGGING__DISTRIBUTED="warning"

Log Directory Structure#

# Typical production log structure
/shared/logs/
├── scheduler.log          # Dask scheduler logs
├── worker-*.log          # Individual worker logs
├── nemo-curator.log      # Application logs
└── performance/          # Performance profiles
    ├── scheduler.html
    └── worker-*.html

Environment-Specific Optimizations#

# Slurm job integration
export SLURM_JOB_ID="${SLURM_JOB_ID}"
export LOGDIR="${SLURM_SUBMIT_DIR}/logs"
export SCHEDULER_FILE="${LOGDIR}/scheduler.json"

# Slurm-aware resource allocation
export DASK_N_WORKERS="${SLURM_NTASKS}"
export DASK_MEMORY_LIMIT="${SLURM_MEM_PER_NODE}MB"
# Kubernetes pod integration
export K8S_NAMESPACE="${MY_POD_NAMESPACE}"
export K8S_POD_NAME="${MY_POD_NAME}"
export DASK_SCHEDULER_ADDRESS="tcp://dask-scheduler:8786"

# Kubernetes resource limits
export DASK_MEMORY_LIMIT="${MEMORY_LIMIT}"
export RMM_WORKER_POOL_SIZE="${GPU_MEMORY_LIMIT}"

Validation and Testing#

from nemo_curator.utils.distributed_utils import get_client

# Test cluster connection
client = get_client()
print(f"✓ Connected to cluster: {client}")
print(f"✓ Workers: {len(client.scheduler_info()['workers'])}")
print(f"✓ Dashboard: {client.dashboard_link}")
# Test GPU availability and configuration
try:
    import cudf
    df = cudf.DataFrame({"test": [1, 2, 3]})
    print("✓ GPU processing available")
    
    # Test RMM configuration
    import rmm
    print(f"✓ RMM pool size: {rmm.get_current_device_resource()}")
except ImportError as e:
    print(f"⚠ GPU not available: {e}")