Production Deployment Requirements#

This page details the comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments.

System Requirements#

  • Operating System: Ubuntu 22.04/20.04 (recommended)

  • Python: Python 3.10, 3.11, or 3.12

    • packaging >= 22.0

Hardware Requirements#

CPU Requirements#

  • Multi-core CPU with sufficient cores for parallel processing

  • Memory: Minimum 16GB RAM recommended for text processing

    • For large datasets: 32GB+ RAM recommended

    • Memory requirements scale with dataset size and number of workers

Software Dependencies#

Core Dependencies#

  • Python 3.10+ with required packages for distributed computing

  • RAPIDS libraries (cuDF) for GPU-accelerated deduplication operations

Network Requirements#

  • Reliable network connectivity between nodes

  • High-bandwidth network for large dataset transfers

  • InfiniBand recommended for multi-node GPU clusters

Storage Requirements#

  • Capacity: Storage capacity should be 3-5x the size of input datasets

    • Input data storage

    • Intermediate processing files

    • Output data storage

  • Performance: High-throughput storage system recommended

    • SSD storage preferred for frequently accessed data

    • Parallel filesystem for multi-node access

Deployment-Specific Requirements#

  • Resource quotas configured for GPU and memory allocation

Performance Considerations#

Memory Management#

  • Monitor memory usage across distributed workers

  • Configure appropriate memory limits per worker

  • Use memory-efficient data formats (e.g., Parquet)

GPU Optimization#

  • Ensure CUDA drivers are compatible with RAPIDS versions

  • Configure GPU memory pools (RMM) for optimal performance

  • Monitor GPU utilization and memory usage

Network Optimization#

  • Use high-bandwidth interconnects for multi-node deployments

  • Configure appropriate network protocols (TCP vs UCX)

  • Optimize data transfer patterns to minimize network overhead