NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA UFM Cable Validation Tool v1.7.1 Simple CVT Performance Tuning Guide

Easy Configuration - Just 3 Variables
- Core Configuration:
How to Decide Values
Quick Start by Deployment Size:
How to Know If Your Settings Are Good
Expected Performance
- Your Customer's 32,149 Switches:
Monitoring and Troubleshooting
System Tuning for Large Deployments
Agent Deployment Time Estimates
- Deployment Duration by Cluster Size:
Bottom Line

Simple CVT Performance Tuning Guide

Easy Configuration - Just 3 Variables

Based on extensive testing and real-world deployments, CVT performance can be optimized with just these simple settings:

Core Configuration:

# 1. Agent Deployment (~200MB image + local container operations)
export CVT_DEPLOYMENT_MAX_WORKERS=60
# 2. Everything Else (validation, connectivity, DNS, etc.)  
export CVT_MAX_WORKERS=150
# 3. Batching Control (when to split large deployments)
export CVT_BATCHING_THRESHOLD=10000

CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment)

Question: How much bandwidth can your management network handle?

Note: Deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. The process is not purely bandwidth-limited, so we can use higher worker counts.

Network Speed	Recommended Value	Reasoning
1G	20	20 × 200MB = ~4GB concurrent + local ops
10G	60	60 × 200MB = ~12GB concurrent + local ops
25G+	100-160	Higher bandwidth + parallel local operations

CVT_MAX_WORKERS (Everything Else)

Question: How many CPU cores does your server have?

Server CPU Cores	Recommended Value	Reasoning
8-32 cores	30-50	Conservative scaling
32-128 cores	75-150	Balanced scaling
128+ cores	150-300	Aggressive scaling

But watch for switch overload! If you see many timeout errors, reduce this value.

CVT_BATCHING_THRESHOLD (When to Batch)

Question: How many switches do you have?

Switch Count	Recommended Value	What Happens
<5,000	10000 (default)	Single batch (faster)
5,000-20,000	5000	Light batching
20,000+	3000	More aggressive batching

Quick Start by Deployment Size:

Small Deployment (<1,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=20
export CVT_MAX_WORKERS=50
# No need to change batching threshold

Medium Deployment (1,000-10,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=40
export CVT_MAX_WORKERS=100
# No need to change batching threshold

Large Deployment (10,000-30,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=60
export CVT_MAX_WORKERS=150
export CVT_BATCHING_THRESHOLD=5000

Hyperscale Deployment (30,000+ switches)

export CVT_DEPLOYMENT_MAX_WORKERS=80
export CVT_MAX_WORKERS=200
export CVT_BATCHING_THRESHOLD=3000

How to Know If Your Settings Are Good

Good Signs:

✅ Server load increases during validation (better CPU utilization)
✅ Validation completes faster than before
✅ No significant increase in timeout errors
✅ Network bandwidth stays below 80%

Warning Signs:

⚠️ Many "Timeout while trying to start validation" errors
⚠️ "Connection refused" errors from switches
⚠️ Server load stays low (underutilization)
⚠️ Network bandwidth hits 90%+

Adjustment Strategy:

Too many timeouts: Reduce CVT_MAX_WORKERS by 25-50
Server underutilized: Increase CVT_MAX_WORKERS by 25-50
Deployment too slow: Increase CVT_DEPLOYMENT_MAX_WORKERS (if network allows)
Memory issues: Reduce CVT_BATCHING_THRESHOLD

Your Customer's 32,149 Switches:

Current Configuration:

export CVT_DEPLOYMENT_MAX_WORKERS=60    # Optimal for 10G network with 200MB image
export CVT_MAX_WORKERS=150              # Good for 448-core server
export CVT_BATCHING_THRESHOLD=10000     # Will use batching (32K > 10K)

Expected Results:

Current: 7m 13s
Optimized: 2-3 minutes (60-75% improvement)
Server Load: Should increase from 5-7 to 15-20

Monitoring and Troubleshooting

Critical Monitoring Points:

Watch for Device Overload:

Timeout Errors: Increase in "Timeout while trying to start validation" messages
Connection Refused: "Connection error" messages from devices
Response Times: Slower device response times
Failure Rate: Higher percentage of failed device connections

Server Utilization Monitoring:

Load Average: Should increase from baseline to target ranges
CPU Usage: Better utilization of available cores
Memory: Watch for any memory pressure during large deployments
Network: Monitor bandwidth utilization on management interface

Success Criteria by Worker Count

Worker Count	Expected Load Average	Expected Time Improvement	Risk Level
50-75	8-12	30-50% faster	Low
100-150	12-18	50-70% faster	Medium
200+	18-25	70%+ faster	High

Red Flags (Scale Back If You See)

⚠️ Significant increase in timeout errors
⚠️ "Connection refused" errors from devices
⚠️ Device response times getting slower
⚠️ Higher failure rates than baseline

Green Lights (Scale Up If You See)

✅ Stable or improved device response times
✅ No increase in connection errors
✅ Server load well below target range
✅ Good success rate maintained

System Tuning for Large Deployments

File Descriptor Limits

# Increase file descriptor limits for high concurrency
ulimit -n 65536
# Make permanent by adding to /etc/security/limits.conf:
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

Network Optimization

# Optimize network settings for high concurrency
echo 65536 > /proc/sys/net/core/somaxconn
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

Memory Settings

# For very large deployments (20,000+ devices)
echo 1 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio

Agent Deployment Time Estimates

Deployment Duration by Cluster Size:

Devices	Workers	Network	Estimated Time
1,000	20	1G	1-2 hours
4,000	60	10G	2-4 hours
10,000	80	25G	4-8 hours
25,000+	100-160	40G+	10-20 hours

Note: Agent deployment includes image fetch (~200MB per device), local save, image load, and container creation. With the reduced image size and parallel local operations, deployment is significantly faster than with larger images.

Bottom Line

Most customers only need to set 2 variables:

CVT_DEPLOYMENT_MAX_WORKERS (based on network bandwidth)
CVT_MAX_WORKERS (based on server CPU and device tolerance)

The third variable (CVT_BATCHING_THRESHOLD) usually works fine at default!

On This Page