NVIDIA UFM Cable Validation Tool v1.7.1

Cluster Sizing Guide

This sizing guide provides hardware and network recommendations for Cable Validation deployments based on cluster size. Recommendations are based on performance analysis of enterprise deployments and optimal resource utilization patterns.

Important Note: Cable Validation Tool (CVT) handles both switches and hosts in modern deployments. The legacy naming in the codebase (e.g., "SwitchAgentMgr", "switch_ip") reflects historical origins when CVT only handled switches, but now applies to all managed devices (switches, hosts, HCAs, etc.).

Key Factors:

  • Device Overload Threshold: Individual devices (switches/hosts) can handle ~5-10 concurrent REST API calls

  • Network Bandwidth: 10G MGMT interface provides ~800-900 MB/s practical throughput

  • CPU Utilization: Target 15-25 load average for optimal performance

  • Memory Requirements: ~50-100 MB per 1000 devices for topology and batch processing

  • Batch Processing: Optimal batch sizes scale with worker count

  • Mixed Workloads: Switches and hosts may have different response characteristics

🎯 Simple 3-Variable Configuration

CVT performance can be optimized with just three environment variables:

# 1. Agent Deployment (~200MB image + local container operations)

export CVT_DEPLOYMENT_MAX_WORKERS=60

# 2. Everything Else (validation, connectivity, DNS, etc.)

export CVT_MAX_WORKERS=150

# 3. Batching Control (when to split large deployments)

export CVT_BATCHING_THRESHOLD=10000

Note: Agent deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. Higher worker counts are possible because the process isn't purely bandwidth-limited.

See the Simple Tuning Guide for detailed configuration guidance.

Cluster Size

Recommended CPUs

Recommended Memory

Recommended MAX_WORKERS

DEPLOYMENT_MAX_WORKERS

MGMT Bandwidth

Expected Time

Notes

Small Clusters (1-1,000 devices)

100 devices

4-8 cores

4-8 GB

30-50

20

1G

30-60 seconds

Single server, basic setup

500 devices

8-16 cores

8-16 GB

50-75

20-40

1G

1-2 minutes

Development/test environment

1,000 devices

16-32 cores

16-32 GB

50-100

20-40

1G

2-3 minutes

Small production deployment

Medium Clusters (1,000-10,000 devices)

2,500 devices

32-64 cores

32-64 GB

75-100

40

10G

2-3 minutes

Regional deployment

5,000 devices

64-128 cores

64-128 GB

100-150

40-60

10G

3-5 minutes

Large regional deployment

7,500 devices

96-192 cores

96-192 GB

125-175

60

10G

4-6 minutes

Multi-site deployment

10,000 devices

128-256 cores

128-256 GB

150-200

60

10G

5-8 minutes

Enterprise deployment

Large Clusters (10,000-25,000 devices)

15,000 devices

192-384 cores

192-384 GB

175-225

60-80

25G+

6-10 minutes

Large enterprise

20,000 devices

256-512 cores

256-512 GB

200-250

80

25G+

8-12 minutes

Hyperscale deployment

25,000 devices

320-640 cores

320-640 GB

225-275

80-100

40G+

10-15 minutes

Hyperscale deployment

Hyperscale Clusters (25,000+ devices)

30,000 devices

384-768 cores

384-768 GB

250-300

100-120

40G+

12-18 minutes

Hyperscale datacenter

35,000 devices

448-896 cores

448-896 GB

275-325

120-140

40G+

15-20 minutes

Hyperscale datacenter

40,000 devices

512-1024 cores

512 GB-1TB

300-350

140-160

40G+

18-25 minutes

Massive hyperscale

Small Clusters (1-1,000 devices)

Characteristics:

  • Single server deployment

  • Basic network infrastructure

  • Development/test environments

  • Device Mix: Primarily switches, some hosts/HCAs

Sizing Logic:

  • CPU: 1 core per 25-50 devices

  • Memory: 10-20 MB per device for topology data

  • Workers: Conservative scaling to avoid device overload

  • Network: 1G sufficient for small clusters

Medium Clusters (1,000-10,000 devices)

Characteristics:

  • Production deployments

  • 10G management networks

  • Regional or multi-site deployments

  • Device Mix: Mixed switches and hosts, HCAs in compute clusters

Sizing Logic:

  • CPU: 1 core per 40-80 devices (better efficiency at scale)

  • Memory: 8-15 MB per device (shared topology data)

  • Workers: Balanced scaling considering device capacity

  • Network: 10G required for concurrent processing

  • Host Considerations: Hosts may respond differently than switches

Large Clusters (10,000-25,000 devices)

Characteristics:

  • Enterprise-scale deployments

  • High-performance requirements

  • 25G+ management networks

  • Device Mix: Large numbers of compute hosts + infrastructure switches

Sizing Logic:

  • CPU: 1 core per 60-100 devices (enterprise efficiency)

  • Memory: 5-12 MB per device (optimized topology handling)

  • Workers: Approaching device overload thresholds

  • Network: 25G+ to handle concurrent load

  • Mixed Response: Account for different device response characteristics

Hyperscale Clusters (25,000+ devices)

Characteristics:

  • Massive datacenter deployments

  • Enterprise-grade hardware (like customer's 448-core server)

  • 40G+ management networks

  • Device Mix: Thousands of compute hosts + infrastructure switches

Sizing Logic:

  • CPU: 1 core per 80-120 devices (maximum efficiency)

  • Memory: 3-10 MB per device (highly optimized)

  • Workers: At or near device overload limits

  • Network: 40G+ essential for performance

  • Device Diversity: Must handle switches, hosts, HCAs, storage devices

Bandwidth Requirements by Cluster Size

Cluster Size

Concurrent Workers

Peak Bandwidth Required

Network Recommendation

Device Types

1,000

50 workers

~50-100 MB/s

1G (sufficient)

Switches + some hosts

5,000

125 workers

~200-400 MB/s

1G (tight) / 10G (recommended)

Mixed switches/hosts

10,000

175 workers

~400-700 MB/s

10G (required)

Balanced switches/hosts

25,000

250 workers

~600-900 MB/s

10G (tight) / 25G (recommended)

Majority hosts + switches

40,000

350 workers

~800-1200 MB/s

25G (minimum) / 40G (optimal)

Large compute + storage

Bandwidth Calculation Logic:

  • Per Worker: ~2-4 MB/s during active validation startup

  • Peak Usage: During initial topology push to all devices

  • Sustained Usage: Much lower during normal validation operation

  • Burst Patterns: High bandwidth during startup, lower during monitoring

Key Insights:

  • Bandwidth is BURSTY: High during startup, low during validation

  • 10G Limit: Starts getting tight around 10,000 devices

  • 25G Sweet Spot: Good performance for 25,000-40,000 devices

  • 40G Future-Proof: Optimal for large hyperscale deployments

Memory Breakdown by Component

Component

Memory per 1000 Devices

Notes

Topology Data

20-40 MB

Device definitions, links, mixed switches/hosts

Batch Processing

15-30 MB

Temporary data during processing

Connection Pools

5-10 MB

HTTP session management

Results Storage

10-20 MB

Validation results and reports

Device Metadata

5-15 MB

Host-specific data, HCA mappings

Total

55-115 MB

Per 1000 devices (switches + hosts)


Quick Performance Tuning

For detailed tuning instructions and troubleshooting, see the Simple Tuning Guide which provides:

  • Easy-to-follow configuration decisions

  • Monitoring guidance and success criteria

  • Troubleshooting common issues

  • System tuning for large deployments

CPU Optimization

  • Target Load: 15-25 average load during processing

  • NUMA Awareness: Use dual-socket servers for 20,000+ devices

  • Worker Scaling: Adjust CVT_MAX_WORKERS based on CPU cores and observed load

  • Device Mix: Account for different CPU requirements of switches vs hosts

  • Monitoring: If load stays low, increase workers; if timeout errors increase, reduce workers

Memory Optimization

  • Batching: Use CVT_BATCHING_THRESHOLD to control memory usage on large deployments

  • Connection Pooling: Automatically scales with worker count

  • Garbage Collection: Monitor for large deployments (20,000+ devices)

  • Device Metadata: Additional memory for host-specific data (HCA mappings, etc.)

Network Optimization

  • Bandwidth Planning: Set CVT_DEPLOYMENT_MAX_WORKERS based on management network capacity

  • Connection Reuse: Essential for large deployments (handled automatically)

  • Bandwidth Monitoring: Watch for saturation at scale

  • Device Response Variance: Hosts may respond differently than switches

  • Burst Patterns: High bandwidth during startup, lower during validation operation

Switches vs Hosts Performance Characteristics

Device Type

Typical Response Time

Concurrent Call Limit

Special Considerations

Network Switches

1-3 seconds

5-10 concurrent

REST API on switch OS

Compute Hosts

2-5 seconds

3-8 concurrent

Agent on host OS, may be busier

Storage Devices

1-4 seconds

5-12 concurrent

Usually dedicated management

HCA Devices

1-2 seconds

8-15 concurrent

Lightweight agent


Small Deployment (<1,000 devices)

# Network: 1G management interface

# Server: 8-32 cores

export CVT_DEPLOYMENT_MAX_WORKERS=20

export CVT_MAX_WORKERS=50

# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)

Medium Deployment (1,000-10,000 devices)

# Network: 10G management interface

# Server: 64-128 cores

export CVT_DEPLOYMENT_MAX_WORKERS=40

export CVT_MAX_WORKERS=100

# CVT_BATCHING_THRESHOLD=10000 (default, no need to change)

Large Deployment (10,000-30,000 devices)

# Network: 25G+ management interface

# Server: 192-384 cores

export CVT_DEPLOYMENT_MAX_WORKERS=60

export CVT_MAX_WORKERS=150

export CVT_BATCHING_THRESHOLD=5000

Hyperscale Deployment (30,000+ devices)

# Network: 40G+ management interface

# Server: 448+ cores

export CVT_DEPLOYMENT_MAX_WORKERS=80

export CVT_MAX_WORKERS=200

export CVT_BATCHING_THRESHOLD=3000

CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment):

  • Based on network bandwidth and local container operations (~200MB per device image)

  • Deployment includes: image fetch, save to disk, load image, container creation

  • 1G network: 20 workers (~4GB concurrent + local ops)

  • 10G network: 60 workers (~12GB concurrent + local ops)

  • 25G+ network: 100-160 workers (higher bandwidth + parallel local operations)

CVT_MAX_WORKERS (Validation Operations):

  • Based on server CPU cores and device capacity

  • 8-32 cores: 30-50 workers

  • 32-128 cores: 75-150 workers

  • 128+ cores: 150-300 workers

  • Watch for device timeout errors and reduce if needed

CVT_BATCHING_THRESHOLD (Batch Processing):

  • <5,000 devices: 10000 (default, single batch)

  • 5,000-20,000 devices: 5000 (light batching)

  • 20,000+ devices: 3000 (aggressive batching)

Vertical Scaling Limits

  • Single Server: Effective up to ~40,000 devices (switches + hosts)

  • CPU Bound: Beyond 40,000 devices, consider distributed processing

  • Memory Bound: Rarely an issue with modern servers (hosts require slightly more memory)

  • Network Bound: Primary constraint for large deployments

  • Device Mix: Higher host percentage may require more resources

Horizontal Scaling Options

  • Multiple Collectors: Split clusters across multiple servers

  • Geographic Distribution: Regional collectors for global deployments

  • Load Balancing: Distribute devices across multiple validation instances

Typical Cluster Compositions

Cluster Type

Switches %

Hosts %

Notes

Sizing Impact

Infrastructure-Heavy

80%

20%

Network-focused deployment

Lower memory, higher network load

Compute-Heavy

30%

70%

HPC/AI clusters

Higher memory, variable response times

Balanced

50%

50%

Mixed enterprise deployment

Standard sizing applies

Storage-Heavy

40%

60%

Storage clusters with many storage hosts

Higher memory, faster responses


Sizing Adjustments by Device Mix

Infrastructure-Heavy Clusters (80% switches):

  • CPU: Use lower end of range

  • Memory: Use lower end of range

  • Workers: Can be more aggressive

  • Network: Higher bandwidth needs per device

Compute-Heavy Clusters (70% hosts):

  • CPU: Use higher end of range

  • Memory: Use higher end of range (HCA mappings, host metadata)

  • Workers: More conservative (hosts may be busier)

  • Network: Variable load patterns

Balanced Clusters (50/50 mix):

  • CPU: Use middle of range

  • Memory: Use middle of range

  • Workers: Standard recommendations apply

  • Network: Standard bandwidth planning

Monitoring and Alerting

Key Metrics to Monitor

  1. Server Load: Target 15-25 during processing

  2. Memory Usage: Should stay well below allocated

  3. Network Utilization: Watch for bandwidth saturation (especially during agent deployment)

  4. Device Response Times: Primary performance indicator

  5. Error Rates: Timeout and connection errors

  6. Validation Completion Time: Compare against expected times in sizing table

Success Indicators

Good Signs (can increase CVT_MAX_WORKERS):

  • ✅ Server load increases during validation (better CPU utilization)

  • ✅ Validation completes faster than baseline

  • ✅ No significant increase in timeout errors

  • ✅ Network bandwidth stays below 80%

Warning Signs (reduce CVT_MAX_WORKERS):

  • ⚠️ Many "Timeout while trying to start validation" errors

  • ⚠️ "Connection refused" errors from devices

  • ⚠️ Server load stays low (underutilization)

  • ⚠️ Network bandwidth hits 90%+

Scaling Triggers

  • Scale Up Workers: Load < 10, low error rates, fast completion

  • Scale Down Workers: High error rates, device timeouts, network saturation

  • Increase Deployment Workers: Network utilization < 50% during agent deployment

  • Decrease Deployment Workers: Network bandwidth > 80% during agent deployment

  • Adjust Batching: Memory usage > 80% (reduce CVT_BATCHING_THRESHOLD)

Performance Expectations by Cluster Size

Cluster Size

Expected Validation Time

Expected Load Average

Target Worker Count

1,000 devices

1-3 minutes

8-12

50-100

5,000 devices

3-5 minutes

12-18

100-150

10,000 devices

5-8 minutes

15-22

150-200

25,000 devices

10-15 minutes

18-25

225-275

40,000 devices

18-25 minutes

20-28

300-350

Document Relationship

This Cluster Sizing Guide provides comprehensive hardware and infrastructure planning:

  • CPU, memory, and network capacity planning

  • Expected performance at different scales

  • Device type considerations (switches, hosts, HCAs)

  • Detailed sizing methodology

For day-to-day performance tuning, refer to the Simple Tuning Guide:

  • Simple 3-variable configuration

  • Quick-start configurations by deployment size

  • Troubleshooting and monitoring guidance

  • Practical tuning adjustments

© Copyright 2025, NVIDIA. Last updated on Nov 12, 2025