Cluster Sizing Guide
This sizing guide provides hardware and network recommendations for Cable Validation deployments based on cluster size. Recommendations are based on performance analysis of enterprise deployments and optimal resource utilization patterns.
Important Note: Cable Validation Tool (CVT) handles both switches and hosts in modern deployments. The legacy naming in the codebase (e.g., "SwitchAgentMgr", "switch_ip") reflects historical origins when CVT only handled switches, but now applies to all managed devices (switches, hosts, HCAs, etc.).
Key Factors:
Device Overload Threshold: Individual devices (switches/hosts) can handle ~5-10 concurrent REST API calls
Network Bandwidth: 10G MGMT interface provides ~800-900 MB/s practical throughput
CPU Utilization: Target 15-25 load average for optimal performance
Memory Requirements: ~50-100 MB per 1000 devices for topology and batch processing
Batch Processing: Optimal batch sizes scale with worker count
Mixed Workloads: Switches and hosts may have different response characteristics
🎯 Simple 3-Variable Configuration
CVT performance can be optimized with just three environment variables:
# 1. Agent Deployment (~200MB image + local container operations)export CVT_DEPLOYMENT_MAX_WORKERS=60
# 2. Everything Else (validation, connectivity, DNS, etc.)
export CVT_MAX_WORKERS=150
# 3. Batching Control (when to split large deployments)
export CVT_BATCHING_THRESHOLD=10000
Note: Agent deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. Higher worker counts are possible because the process isn't purely bandwidth-limited.
See the Simple Tuning Guide for detailed configuration guidance.
Cluster Size | Recommended CPUs | Recommended Memory | Recommended MAX_WORKERS | DEPLOYMENT_MAX_WORKERS | MGMT Bandwidth | Expected Time | Notes |
Small Clusters (1-1,000 devices) | |||||||
100 devices | 4-8 cores | 4-8 GB | 30-50 | 20 | 1G | 30-60 seconds | Single server, basic setup |
500 devices | 8-16 cores | 8-16 GB | 50-75 | 20-40 | 1G | 1-2 minutes | Development/test environment |
1,000 devices | 16-32 cores | 16-32 GB | 50-100 | 20-40 | 1G | 2-3 minutes | Small production deployment |
Medium Clusters (1,000-10,000 devices) | |||||||
2,500 devices | 32-64 cores | 32-64 GB | 75-100 | 40 | 10G | 2-3 minutes | Regional deployment |
5,000 devices | 64-128 cores | 64-128 GB | 100-150 | 40-60 | 10G | 3-5 minutes | Large regional deployment |
7,500 devices | 96-192 cores | 96-192 GB | 125-175 | 60 | 10G | 4-6 minutes | Multi-site deployment |
10,000 devices | 128-256 cores | 128-256 GB | 150-200 | 60 | 10G | 5-8 minutes | Enterprise deployment |
Large Clusters (10,000-25,000 devices) | |||||||
15,000 devices | 192-384 cores | 192-384 GB | 175-225 | 60-80 | 25G+ | 6-10 minutes | Large enterprise |
20,000 devices | 256-512 cores | 256-512 GB | 200-250 | 80 | 25G+ | 8-12 minutes | Hyperscale deployment |
25,000 devices | 320-640 cores | 320-640 GB | 225-275 | 80-100 | 40G+ | 10-15 minutes | Hyperscale deployment |
Hyperscale Clusters (25,000+ devices) | |||||||
30,000 devices | 384-768 cores | 384-768 GB | 250-300 | 100-120 | 40G+ | 12-18 minutes | Hyperscale datacenter |
35,000 devices | 448-896 cores | 448-896 GB | 275-325 | 120-140 | 40G+ | 15-20 minutes | Hyperscale datacenter |
40,000 devices | 512-1024 cores | 512 GB-1TB | 300-350 | 140-160 | 40G+ | 18-25 minutes | Massive hyperscale |
Small Clusters (1-1,000 devices)
Characteristics:
Single server deployment
Basic network infrastructure
Development/test environments
Device Mix: Primarily switches, some hosts/HCAs
Sizing Logic:
CPU: 1 core per 25-50 devices
Memory: 10-20 MB per device for topology data
Workers: Conservative scaling to avoid device overload
Network: 1G sufficient for small clusters
Medium Clusters (1,000-10,000 devices)
Characteristics:
Production deployments
10G management networks
Regional or multi-site deployments
Device Mix: Mixed switches and hosts, HCAs in compute clusters
Sizing Logic:
CPU: 1 core per 40-80 devices (better efficiency at scale)
Memory: 8-15 MB per device (shared topology data)
Workers: Balanced scaling considering device capacity
Network: 10G required for concurrent processing
Host Considerations: Hosts may respond differently than switches
Large Clusters (10,000-25,000 devices)
Characteristics:
Enterprise-scale deployments
High-performance requirements
25G+ management networks
Device Mix: Large numbers of compute hosts + infrastructure switches
Sizing Logic:
CPU: 1 core per 60-100 devices (enterprise efficiency)
Memory: 5-12 MB per device (optimized topology handling)
Workers: Approaching device overload thresholds
Network: 25G+ to handle concurrent load
Mixed Response: Account for different device response characteristics
Hyperscale Clusters (25,000+ devices)
Characteristics:
Massive datacenter deployments
Enterprise-grade hardware (like customer's 448-core server)
40G+ management networks
Device Mix: Thousands of compute hosts + infrastructure switches
Sizing Logic:
CPU: 1 core per 80-120 devices (maximum efficiency)
Memory: 3-10 MB per device (highly optimized)
Workers: At or near device overload limits
Network: 40G+ essential for performance
Device Diversity: Must handle switches, hosts, HCAs, storage devices
Bandwidth Requirements by Cluster Size
Cluster Size | Concurrent Workers | Peak Bandwidth Required | Network Recommendation | Device Types |
1,000 | 50 workers | ~50-100 MB/s | 1G (sufficient) | Switches + some hosts |
5,000 | 125 workers | ~200-400 MB/s | 1G (tight) / 10G (recommended) | Mixed switches/hosts |
10,000 | 175 workers | ~400-700 MB/s | 10G (required) | Balanced switches/hosts |
25,000 | 250 workers | ~600-900 MB/s | 10G (tight) / 25G (recommended) | Majority hosts + switches |
40,000 | 350 workers | ~800-1200 MB/s | 25G (minimum) / 40G (optimal) | Large compute + storage |
Bandwidth Calculation Logic:
Per Worker: ~2-4 MB/s during active validation startup
Peak Usage: During initial topology push to all devices
Sustained Usage: Much lower during normal validation operation
Burst Patterns: High bandwidth during startup, lower during monitoring
Key Insights:
Bandwidth is BURSTY: High during startup, low during validation
10G Limit: Starts getting tight around 10,000 devices
25G Sweet Spot: Good performance for 25,000-40,000 devices
40G Future-Proof: Optimal for large hyperscale deployments
Memory Breakdown by Component
Component | Memory per 1000 Devices | Notes |
Topology Data | 20-40 MB | Device definitions, links, mixed switches/hosts |
Batch Processing | 15-30 MB | Temporary data during processing |
Connection Pools | 5-10 MB | HTTP session management |
Results Storage | 10-20 MB | Validation results and reports |
Device Metadata | 5-15 MB | Host-specific data, HCA mappings |
Total | 55-115 MB | Per 1000 devices (switches + hosts) |
Quick Performance Tuning
For detailed tuning instructions and troubleshooting, see the Simple Tuning Guide which provides:
Easy-to-follow configuration decisions
Monitoring guidance and success criteria
Troubleshooting common issues
System tuning for large deployments
CPU Optimization
Target Load: 15-25 average load during processing
NUMA Awareness: Use dual-socket servers for 20,000+ devices
Worker Scaling: Adjust
CVT_MAX_WORKERSbased on CPU cores and observed loadDevice Mix: Account for different CPU requirements of switches vs hosts
Monitoring: If load stays low, increase workers; if timeout errors increase, reduce workers
Memory Optimization
Batching: Use
CVT_BATCHING_THRESHOLDto control memory usage on large deploymentsConnection Pooling: Automatically scales with worker count
Garbage Collection: Monitor for large deployments (20,000+ devices)
Device Metadata: Additional memory for host-specific data (HCA mappings, etc.)
Network Optimization
Bandwidth Planning: Set
CVT_DEPLOYMENT_MAX_WORKERSbased on management network capacityConnection Reuse: Essential for large deployments (handled automatically)
Bandwidth Monitoring: Watch for saturation at scale
Device Response Variance: Hosts may respond differently than switches
Burst Patterns: High bandwidth during startup, lower during validation operation
Switches vs Hosts Performance Characteristics
Device Type | Typical Response Time | Concurrent Call Limit | Special Considerations |
Network Switches | 1-3 seconds | 5-10 concurrent | REST API on switch OS |
Compute Hosts | 2-5 seconds | 3-8 concurrent | Agent on host OS, may be busier |
Storage Devices | 1-4 seconds | 5-12 concurrent | Usually dedicated management |
HCA Devices | 1-2 seconds | 8-15 concurrent | Lightweight agent |
Small Deployment (<1,000 devices)
# Network: 1G management interface# Server: 8-32 cores
export CVT_DEPLOYMENT_MAX_WORKERS=20
export CVT_MAX_WORKERS=50
# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)
Medium Deployment (1,000-10,000 devices)
# Network: 10G management interface# Server: 64-128 cores
export CVT_DEPLOYMENT_MAX_WORKERS=40
export CVT_MAX_WORKERS=100
# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)
Large Deployment (10,000-30,000 devices)
# Network: 25G+ management interface# Server: 192-384 cores
export CVT_DEPLOYMENT_MAX_WORKERS=60
export CVT_MAX_WORKERS=150
export CVT_BATCHING_THRESHOLD=5000
Hyperscale Deployment (30,000+ devices)
# Network: 40G+ management interface# Server: 448+ cores
export CVT_DEPLOYMENT_MAX_WORKERS=80
export CVT_MAX_WORKERS=200
export CVT_BATCHING_THRESHOLD=3000
CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment):
Based on network bandwidth and local container operations (~200MB per device image)
Deployment includes: image fetch, save to disk, load image, container creation
1G network: 20 workers (~4GB concurrent + local ops)
10G network: 60 workers (~12GB concurrent + local ops)
25G+ network: 100-160 workers (higher bandwidth + parallel local operations)
CVT_MAX_WORKERS (Validation Operations):
Based on server CPU cores and device capacity
8-32 cores: 30-50 workers
32-128 cores: 75-150 workers
128+ cores: 150-300 workers
Watch for device timeout errors and reduce if needed
CVT_BATCHING_THRESHOLD (Batch Processing):
<5,000 devices: 10000 (default, single batch)
5,000-20,000 devices: 5000 (light batching)
20,000+ devices: 3000 (aggressive batching)
Vertical Scaling Limits
Single Server: Effective up to ~40,000 devices (switches + hosts)
CPU Bound: Beyond 40,000 devices, consider distributed processing
Memory Bound: Rarely an issue with modern servers (hosts require slightly more memory)
Network Bound: Primary constraint for large deployments
Device Mix: Higher host percentage may require more resources
Horizontal Scaling Options
Multiple Collectors: Split clusters across multiple servers
Geographic Distribution: Regional collectors for global deployments
Load Balancing: Distribute devices across multiple validation instances
Typical Cluster Compositions
Cluster Type | Switches % | Hosts % | Notes | Sizing Impact |
Infrastructure-Heavy | 80% | 20% | Network-focused deployment | Lower memory, higher network load |
Compute-Heavy | 30% | 70% | HPC/AI clusters | Higher memory, variable response times |
Balanced | 50% | 50% | Mixed enterprise deployment | Standard sizing applies |
Storage-Heavy | 40% | 60% | Storage clusters with many storage hosts | Higher memory, faster responses |
Sizing Adjustments by Device Mix
Infrastructure-Heavy Clusters (80% switches):
CPU: Use lower end of range
Memory: Use lower end of range
Workers: Can be more aggressive
Network: Higher bandwidth needs per device
Compute-Heavy Clusters (70% hosts):
CPU: Use higher end of range
Memory: Use higher end of range (HCA mappings, host metadata)
Workers: More conservative (hosts may be busier)
Network: Variable load patterns
Balanced Clusters (50/50 mix):
CPU: Use middle of range
Memory: Use middle of range
Workers: Standard recommendations apply
Network: Standard bandwidth planning
Monitoring and Alerting
Key Metrics to Monitor
Server Load: Target 15-25 during processing
Memory Usage: Should stay well below allocated
Network Utilization: Watch for bandwidth saturation (especially during agent deployment)
Device Response Times: Primary performance indicator
Error Rates: Timeout and connection errors
Validation Completion Time: Compare against expected times in sizing table
Success Indicators
Good Signs (can increase CVT_MAX_WORKERS):
✅ Server load increases during validation (better CPU utilization)
✅ Validation completes faster than baseline
✅ No significant increase in timeout errors
✅ Network bandwidth stays below 80%
Warning Signs (reduce CVT_MAX_WORKERS):
⚠️ Many "Timeout while trying to start validation" errors
⚠️ "Connection refused" errors from devices
⚠️ Server load stays low (underutilization)
⚠️ Network bandwidth hits 90%+
Scaling Triggers
Scale Up Workers: Load < 10, low error rates, fast completion
Scale Down Workers: High error rates, device timeouts, network saturation
Increase Deployment Workers: Network utilization < 50% during agent deployment
Decrease Deployment Workers: Network bandwidth > 80% during agent deployment
Adjust Batching: Memory usage > 80% (reduce
CVT_BATCHING_THRESHOLD)
Performance Expectations by Cluster Size
Cluster Size | Expected Validation Time | Expected Load Average | Target Worker Count |
1,000 devices | 1-3 minutes | 8-12 | 50-100 |
5,000 devices | 3-5 minutes | 12-18 | 100-150 |
10,000 devices | 5-8 minutes | 15-22 | 150-200 |
25,000 devices | 10-15 minutes | 18-25 | 225-275 |
40,000 devices | 18-25 minutes | 20-28 | 300-350 |
Document Relationship
This Cluster Sizing Guide provides comprehensive hardware and infrastructure planning:
CPU, memory, and network capacity planning
Expected performance at different scales
Device type considerations (switches, hosts, HCAs)
Detailed sizing methodology
For day-to-day performance tuning, refer to the Simple Tuning Guide:
Simple 3-variable configuration
Quick-start configurations by deployment size
Troubleshooting and monitoring guidance
Practical tuning adjustments