NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Cable Validation Tool v1.7.1 Cluster Sizing Guide

Overview
Sizing Methodology
Quick Configuration Reference
- 🎯 Simple 3-Variable Configuration
Cluster Sizing Table
Detailed Recommendations by Scale
Network Bandwidth Analysis
- Bandwidth Requirements by Cluster Size
Memory Usage Patterns
- Memory Breakdown by Component
Performance Optimization Guidelines
- Quick Performance Tuning
- CPU Optimization
  - Memory Optimization
  - Network Optimization
Device Type Considerations
- Switches vs Hosts Performance Characteristics
Example Configurations
Configuration Guidelines
Scaling Considerations
- Vertical Scaling Limits
- Horizontal Scaling Options
Device Mix Impact on Sizing

Cluster Sizing Guide

Overview

This sizing guide provides hardware and network recommendations for Cable Validation deployments based on cluster size. Recommendations are based on performance analysis of enterprise deployments and optimal resource utilization patterns.

Important Note: Cable Validation Tool (CVT) handles both switches and hosts in modern deployments. The legacy naming in the codebase (e.g., "SwitchAgentMgr", "switch_ip") reflects historical origins when CVT only handled switches, but now applies to all managed devices (switches, hosts, HCAs, etc.).

Sizing Methodology

Key Factors:

Device Overload Threshold: Individual devices (switches/hosts) can handle ~5-10 concurrent REST API calls
Network Bandwidth: 10G MGMT interface provides ~800-900 MB/s practical throughput
CPU Utilization: Target 15-25 load average for optimal performance
Memory Requirements: ~50-100 MB per 1000 devices for topology and batch processing
Batch Processing: Optimal batch sizes scale with worker count
Mixed Workloads: Switches and hosts may have different response characteristics

Quick Configuration Reference

🎯 Simple 3-Variable Configuration

CVT performance can be optimized with just three environment variables:

# 1. Agent Deployment (~200MB image + local container operations)
export CVT_DEPLOYMENT_MAX_WORKERS=60
# 2. Everything Else (validation, connectivity, DNS, etc.)  
export CVT_MAX_WORKERS=150
# 3. Batching Control (when to split large deployments)
export CVT_BATCHING_THRESHOLD=10000

Note: Agent deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. Higher worker counts are possible because the process isn't purely bandwidth-limited.

See the Simple Tuning Guide for detailed configuration guidance.

Cluster Sizing Table

Cluster Size	Recommended CPUs	Recommended Memory	Recommended MAX_WORKERS	DEPLOYMENT_MAX_WORKERS	MGMT Bandwidth	Expected Time	Notes
Small Clusters (1-1,000 devices)
100 devices	4-8 cores	4-8 GB	30-50	20	1G	30-60 seconds	Single server, basic setup
500 devices	8-16 cores	8-16 GB	50-75	20-40	1G	1-2 minutes	Development/test environment
1,000 devices	16-32 cores	16-32 GB	50-100	20-40	1G	2-3 minutes	Small production deployment
Medium Clusters (1,000-10,000 devices)
2,500 devices	32-64 cores	32-64 GB	75-100	40	10G	2-3 minutes	Regional deployment
5,000 devices	64-128 cores	64-128 GB	100-150	40-60	10G	3-5 minutes	Large regional deployment
7,500 devices	96-192 cores	96-192 GB	125-175	60	10G	4-6 minutes	Multi-site deployment
10,000 devices	128-256 cores	128-256 GB	150-200	60	10G	5-8 minutes	Enterprise deployment
Large Clusters (10,000-25,000 devices)
15,000 devices	192-384 cores	192-384 GB	175-225	60-80	25G+	6-10 minutes	Large enterprise
20,000 devices	256-512 cores	256-512 GB	200-250	80	25G+	8-12 minutes	Hyperscale deployment
25,000 devices	320-640 cores	320-640 GB	225-275	80-100	40G+	10-15 minutes	Hyperscale deployment
Hyperscale Clusters (25,000+ devices)
30,000 devices	384-768 cores	384-768 GB	250-300	100-120	40G+	12-18 minutes	Hyperscale datacenter
35,000 devices	448-896 cores	448-896 GB	275-325	120-140	40G+	15-20 minutes	Hyperscale datacenter
40,000 devices	512-1024 cores	512 GB-1TB	300-350	140-160	40G+	18-25 minutes	Massive hyperscale

Detailed Recommendations by Scale

Small Clusters (1-1,000 devices)

Characteristics:

Single server deployment
Basic network infrastructure
Development/test environments
Device Mix: Primarily switches, some hosts/HCAs

Sizing Logic:

CPU: 1 core per 25-50 devices
Memory: 10-20 MB per device for topology data
Workers: Conservative scaling to avoid device overload
Network: 1G sufficient for small clusters

Medium Clusters (1,000-10,000 devices)

Characteristics:

Production deployments
10G management networks
Regional or multi-site deployments
Device Mix: Mixed switches and hosts, HCAs in compute clusters

Sizing Logic:

CPU: 1 core per 40-80 devices (better efficiency at scale)
Memory: 8-15 MB per device (shared topology data)
Workers: Balanced scaling considering device capacity
Network: 10G required for concurrent processing
Host Considerations: Hosts may respond differently than switches

Large Clusters (10,000-25,000 devices)

Characteristics:

Enterprise-scale deployments
High-performance requirements
25G+ management networks
Device Mix: Large numbers of compute hosts + infrastructure switches

Sizing Logic:

CPU: 1 core per 60-100 devices (enterprise efficiency)
Memory: 5-12 MB per device (optimized topology handling)
Workers: Approaching device overload thresholds
Network: 25G+ to handle concurrent load
Mixed Response: Account for different device response characteristics

Hyperscale Clusters (25,000+ devices)

Characteristics:

Massive datacenter deployments
Enterprise-grade hardware (like customer's 448-core server)
40G+ management networks
Device Mix: Thousands of compute hosts + infrastructure switches

Sizing Logic:

CPU: 1 core per 80-120 devices (maximum efficiency)
Memory: 3-10 MB per device (highly optimized)
Workers: At or near device overload limits
Network: 40G+ essential for performance
Device Diversity: Must handle switches, hosts, HCAs, storage devices

Network Bandwidth Analysis

Bandwidth Requirements by Cluster Size

Cluster Size	Concurrent Workers	Peak Bandwidth Required	Network Recommendation	Device Types
1,000	50 workers	~50-100 MB/s	1G (sufficient)	Switches + some hosts
5,000	125 workers	~200-400 MB/s	1G (tight) / 10G (recommended)	Mixed switches/hosts
10,000	175 workers	~400-700 MB/s	10G (required)	Balanced switches/hosts
25,000	250 workers	~600-900 MB/s	10G (tight) / 25G (recommended)	Majority hosts + switches
40,000	350 workers	~800-1200 MB/s	25G (minimum) / 40G (optimal)	Large compute + storage

Bandwidth Calculation Logic:

Per Worker: ~2-4 MB/s during active validation startup
Peak Usage: During initial topology push to all devices
Sustained Usage: Much lower during normal validation operation
Burst Patterns: High bandwidth during startup, lower during monitoring

Key Insights:

Bandwidth is BURSTY: High during startup, low during validation
10G Limit: Starts getting tight around 10,000 devices
25G Sweet Spot: Good performance for 25,000-40,000 devices
40G Future-Proof: Optimal for large hyperscale deployments

Memory Usage Patterns

Memory Breakdown by Component

Component	Memory per 1000 Devices	Notes
Topology Data	20-40 MB	Device definitions, links, mixed switches/hosts
Batch Processing	15-30 MB	Temporary data during processing
Connection Pools	5-10 MB	HTTP session management
Results Storage	10-20 MB	Validation results and reports
Device Metadata	5-15 MB	Host-specific data, HCA mappings
Total	55-115 MB	Per 1000 devices (switches + hosts)

Performance Optimization Guidelines

Quick Performance Tuning

For detailed tuning instructions and troubleshooting, see the Simple Tuning Guide which provides:

Easy-to-follow configuration decisions
Monitoring guidance and success criteria
Troubleshooting common issues
System tuning for large deployments

CPU Optimization

Target Load: 15-25 average load during processing
NUMA Awareness: Use dual-socket servers for 20,000+ devices
Worker Scaling: Adjust CVT_MAX_WORKERS based on CPU cores and observed load
Device Mix: Account for different CPU requirements of switches vs hosts
Monitoring: If load stays low, increase workers; if timeout errors increase, reduce workers

Memory Optimization

Batching: Use CVT_BATCHING_THRESHOLD to control memory usage on large deployments
Connection Pooling: Automatically scales with worker count
Garbage Collection: Monitor for large deployments (20,000+ devices)
Device Metadata: Additional memory for host-specific data (HCA mappings, etc.)

Network Optimization

Bandwidth Planning: Set CVT_DEPLOYMENT_MAX_WORKERS based on management network capacity
Connection Reuse: Essential for large deployments (handled automatically)
Bandwidth Monitoring: Watch for saturation at scale
Device Response Variance: Hosts may respond differently than switches
Burst Patterns: High bandwidth during startup, lower during validation operation

Device Type Considerations

Switches vs Hosts Performance Characteristics

Device Type	Typical Response Time	Concurrent Call Limit	Special Considerations
Network Switches	1-3 seconds	5-10 concurrent	REST API on switch OS
Compute Hosts	2-5 seconds	3-8 concurrent	Agent on host OS, may be busier
Storage Devices	1-4 seconds	5-12 concurrent	Usually dedicated management
HCA Devices	1-2 seconds	8-15 concurrent	Lightweight agent

Example Configurations

Small Deployment (<1,000 devices)

# Network: 1G management interface
# Server: 8-32 cores
export CVT_DEPLOYMENT_MAX_WORKERS=20
export CVT_MAX_WORKERS=50
# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)

Medium Deployment (1,000-10,000 devices)

# Network: 10G management interface
# Server: 64-128 cores
export CVT_DEPLOYMENT_MAX_WORKERS=40
export CVT_MAX_WORKERS=100
# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)

Large Deployment (10,000-30,000 devices)

# Network: 25G+ management interface
# Server: 192-384 cores
export CVT_DEPLOYMENT_MAX_WORKERS=60
export CVT_MAX_WORKERS=150
export CVT_BATCHING_THRESHOLD=5000

Hyperscale Deployment (30,000+ devices)

# Network: 40G+ management interface
# Server: 448+ cores
export CVT_DEPLOYMENT_MAX_WORKERS=80
export CVT_MAX_WORKERS=200
export CVT_BATCHING_THRESHOLD=3000

Configuration Guidelines

CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment):

Based on network bandwidth and local container operations (~200MB per device image)
Deployment includes: image fetch, save to disk, load image, container creation
1G network: 20 workers (~4GB concurrent + local ops)
10G network: 60 workers (~12GB concurrent + local ops)
25G+ network: 100-160 workers (higher bandwidth + parallel local operations)

CVT_MAX_WORKERS (Validation Operations):

Based on server CPU cores and device capacity
8-32 cores: 30-50 workers
32-128 cores: 75-150 workers
128+ cores: 150-300 workers
Watch for device timeout errors and reduce if needed

CVT_BATCHING_THRESHOLD (Batch Processing):

<5,000 devices: 10000 (default, single batch)
5,000-20,000 devices: 5000 (light batching)
20,000+ devices: 3000 (aggressive batching)

Scaling Considerations

Vertical Scaling Limits

Single Server: Effective up to ~40,000 devices (switches + hosts)
CPU Bound: Beyond 40,000 devices, consider distributed processing
Memory Bound: Rarely an issue with modern servers (hosts require slightly more memory)
Network Bound: Primary constraint for large deployments
Device Mix: Higher host percentage may require more resources

Horizontal Scaling Options

Multiple Collectors: Split clusters across multiple servers
Geographic Distribution: Regional collectors for global deployments
Load Balancing: Distribute devices across multiple validation instances

Device Mix Impact on Sizing

Typical Cluster Compositions

Cluster Type	Switches %	Hosts %	Notes	Sizing Impact
Infrastructure-Heavy	80%	20%	Network-focused deployment	Lower memory, higher network load
Compute-Heavy	30%	70%	HPC/AI clusters	Higher memory, variable response times
Balanced	50%	50%	Mixed enterprise deployment	Standard sizing applies
Storage-Heavy	40%	60%	Storage clusters with many storage hosts	Higher memory, faster responses

Sizing Adjustments by Device Mix

Infrastructure-Heavy Clusters (80% switches):

CPU: Use lower end of range
Memory: Use lower end of range
Workers: Can be more aggressive
Network: Higher bandwidth needs per device

Compute-Heavy Clusters (70% hosts):

CPU: Use higher end of range
Memory: Use higher end of range (HCA mappings, host metadata)
Workers: More conservative (hosts may be busier)
Network: Variable load patterns

Balanced Clusters (50/50 mix):

CPU: Use middle of range
Memory: Use middle of range
Workers: Standard recommendations apply
Network: Standard bandwidth planning

Monitoring and Alerting

Key Metrics to Monitor

Server Load: Target 15-25 during processing
Memory Usage: Should stay well below allocated
Network Utilization: Watch for bandwidth saturation (especially during agent deployment)
Device Response Times: Primary performance indicator
Error Rates: Timeout and connection errors
Validation Completion Time: Compare against expected times in sizing table

Success Indicators

Good Signs (can increase CVT_MAX_WORKERS):

✅ Server load increases during validation (better CPU utilization)
✅ Validation completes faster than baseline
✅ No significant increase in timeout errors
✅ Network bandwidth stays below 80%

Warning Signs (reduce CVT_MAX_WORKERS):

⚠️ Many "Timeout while trying to start validation" errors
⚠️ "Connection refused" errors from devices
⚠️ Server load stays low (underutilization)
⚠️ Network bandwidth hits 90%+

Scaling Triggers

Scale Up Workers: Load < 10, low error rates, fast completion
Scale Down Workers: High error rates, device timeouts, network saturation
Increase Deployment Workers: Network utilization < 50% during agent deployment
Decrease Deployment Workers: Network bandwidth > 80% during agent deployment
Adjust Batching: Memory usage > 80% (reduce CVT_BATCHING_THRESHOLD)

Performance Expectations by Cluster Size

Cluster Size	Expected Validation Time	Expected Load Average	Target Worker Count
1,000 devices	1-3 minutes	8-12	50-100
5,000 devices	3-5 minutes	12-18	100-150
10,000 devices	5-8 minutes	15-22	150-200
25,000 devices	10-15 minutes	18-25	225-275
40,000 devices	18-25 minutes	20-28	300-350

Document Relationship

This Cluster Sizing Guide provides comprehensive hardware and infrastructure planning:

CPU, memory, and network capacity planning
Expected performance at different scales
Device type considerations (switches, hosts, HCAs)
Detailed sizing methodology

For day-to-day performance tuning, refer to the Simple Tuning Guide:

Simple 3-variable configuration
Quick-start configurations by deployment size
Troubleshooting and monitoring guidance
Practical tuning adjustments

On This Page