*** description: >- Comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments categories: * reference tags: * requirements * system-requirements * hardware * software * gpu * storage personas: * admin-focused * devops-focused difficulty: reference content\_type: reference modality: universal *** # Production Deployment Requirements This page details the comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments. ## System Requirements * **Operating System**: Ubuntu 22.04/20.04 (recommended) * **Python**: Python 3.10, 3.11, or 3.12 * packaging >= 22.0 ## Hardware Requirements ### CPU Requirements * Multi-core CPU with sufficient cores for parallel processing * **Memory**: Minimum 16GB RAM recommended for text processing * For large datasets: 32GB+ RAM recommended * Memory requirements scale with dataset size and number of workers ### GPU Requirements (Optional but Recommended) * **GPU**: NVIDIA GPU with Volta™ architecture or higher * Compute capability 7.0+ required * **Memory**: Minimum 16GB VRAM for GPU-accelerated operations * For video processing: 21GB+ VRAM (reducible with optimization) * For large-scale deduplication: 32GB+ VRAM recommended * **CUDA**: CUDA 12.0 or above with compatible drivers ## Software Dependencies ### Core Dependencies * Python 3.10+ with required packages for distributed computing * RAPIDS libraries (cuDF) for GPU-accelerated deduplication operations ### Container Support (Recommended) * **Docker** or **Podman** for containerized deployment * Access to NVIDIA NGC registry for official containers ## Network Requirements * Reliable network connectivity between nodes * High-bandwidth network for large dataset transfers * InfiniBand recommended for multi-node GPU clusters ## Storage Requirements * **Capacity**: Storage capacity should be 3-5x the size of input datasets * Input data storage * Intermediate processing files * Output data storage * **Performance**: High-throughput storage system recommended * SSD storage preferred for frequently accessed data * Parallel filesystem for multi-node access ## Deployment-Specific Requirements * Resource quotas configured for GPU and memory allocation ## Performance Considerations ### Memory Management * Monitor memory usage across distributed workers * Configure appropriate memory limits per worker * Use memory-efficient data formats (e.g., Parquet) ### GPU Optimization * Ensure CUDA drivers are compatible with RAPIDS versions * Configure GPU memory pools (RMM) for optimal performance * Monitor GPU utilization and memory usage ### Network Optimization * Use high-bandwidth interconnects for multi-node deployments * Configure appropriate network protocols (TCP vs UCX) * Optimize data transfer patterns to minimize network overhead