NVIDIA UFM Cable Validation Tool v1.7.1

Simple CVT Performance Tuning Guide

Based on extensive testing and real-world deployments, CVT performance can be optimized with just these simple settings:

Core Configuration:

# 1. Agent Deployment (~200MB image + local container operations)

export CVT_DEPLOYMENT_MAX_WORKERS=60

# 2. Everything Else (validation, connectivity, DNS, etc.)

export CVT_MAX_WORKERS=150

# 3. Batching Control (when to split large deployments)

export CVT_BATCHING_THRESHOLD=10000

CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment)

Question: How much bandwidth can your management network handle?

Note: Deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. The process is not purely bandwidth-limited, so we can use higher worker counts.

Network Speed

Recommended Value

Reasoning

1G

20

20 × 200MB = ~4GB concurrent + local ops

10G

60

60 × 200MB = ~12GB concurrent + local ops

25G+

100-160

Higher bandwidth + parallel local operations


CVT_MAX_WORKERS (Everything Else)

Question: How many CPU cores does your server have?

Server CPU Cores

Recommended Value

Reasoning

8-32 cores

30-50

Conservative scaling

32-128 cores

75-150

Balanced scaling

128+ cores

150-300

Aggressive scaling

But watch for switch overload! If you see many timeout errors, reduce this value.

CVT_BATCHING_THRESHOLD (When to Batch)

Question: How many switches do you have?

Switch Count

Recommended Value

What Happens

<5,000

10000 (default)

Single batch (faster)

5,000-20,000

5000

Light batching

20,000+

3000

More aggressive batching


Small Deployment (<1,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=20

export CVT_MAX_WORKERS=50

# No need to change batching threshold

Medium Deployment (1,000-10,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=40

export CVT_MAX_WORKERS=100

# No need to change batching threshold

Large Deployment (10,000-30,000 switches)

export CVT_DEPLOYMENT_MAX_WORKERS=60

export CVT_MAX_WORKERS=150

export CVT_BATCHING_THRESHOLD=5000

Hyperscale Deployment (30,000+ switches)

export CVT_DEPLOYMENT_MAX_WORKERS=80

export CVT_MAX_WORKERS=200

export CVT_BATCHING_THRESHOLD=3000

Good Signs:

  • ✅ Server load increases during validation (better CPU utilization)

  • ✅ Validation completes faster than before

  • ✅ No significant increase in timeout errors

  • ✅ Network bandwidth stays below 80%

Warning Signs:

  • ⚠️ Many "Timeout while trying to start validation" errors

  • ⚠️ "Connection refused" errors from switches

  • ⚠️ Server load stays low (underutilization)

  • ⚠️ Network bandwidth hits 90%+

Adjustment Strategy:

  1. Too many timeouts: Reduce CVT_MAX_WORKERS by 25-50

  2. Server underutilized: Increase CVT_MAX_WORKERS by 25-50

  3. Deployment too slow: Increase CVT_DEPLOYMENT_MAX_WORKERS (if network allows)

  4. Memory issues: Reduce CVT_BATCHING_THRESHOLD

Your Customer's 32,149 Switches:

Current Configuration:

export CVT_DEPLOYMENT_MAX_WORKERS=60    # Optimal for 10G network with 200MB image

export CVT_MAX_WORKERS=150 # Good for 448-core server

export CVT_BATCHING_THRESHOLD=10000 # Will use batching (32K > 10K)

Expected Results:

  • Current: 7m 13s

  • Optimized: 2-3 minutes (60-75% improvement)

  • Server Load: Should increase from 5-7 to 15-20

Critical Monitoring Points:

Watch for Device Overload:

  1. Timeout Errors: Increase in "Timeout while trying to start validation" messages

  2. Connection Refused: "Connection error" messages from devices

  3. Response Times: Slower device response times

  4. Failure Rate: Higher percentage of failed device connections

Server Utilization Monitoring:

  1. Load Average: Should increase from baseline to target ranges

  2. CPU Usage: Better utilization of available cores

  3. Memory: Watch for any memory pressure during large deployments

  4. Network: Monitor bandwidth utilization on management interface

Success Criteria by Worker Count

Worker Count

Expected Load Average

Expected Time Improvement

Risk Level

50-75

8-12

30-50% faster

Low

100-150

12-18

50-70% faster

Medium

200+

18-25

70%+ faster

High


Red Flags (Scale Back If You See)

  • ⚠️ Significant increase in timeout errors

  • ⚠️ "Connection refused" errors from devices

  • ⚠️ Device response times getting slower

  • ⚠️ Higher failure rates than baseline

Green Lights (Scale Up If You See)

  • ✅ Stable or improved device response times

  • ✅ No increase in connection errors

  • ✅ Server load well below target range

  • ✅ Good success rate maintained

File Descriptor Limits

# Increase file descriptor limits for high concurrency

ulimit -n 65536

# Make permanent by adding to /etc/security/limits.conf:

echo "* soft nofile 65536" >> /etc/security/limits.conf

echo "* hard nofile 65536" >> /etc/security/limits.conf

Network Optimization

# Optimize network settings for high concurrency

echo 65536 > /proc/sys/net/core/somaxconn

echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

Memory Settings

# For very large deployments (20,000+ devices)

echo 1 > /proc/sys/vm/overcommit_memory

echo 80 > /proc/sys/vm/overcommit_ratio

Deployment Duration by Cluster Size:

Devices

Workers

Network

Estimated Time

1,000

20

1G

1-2 hours

4,000

60

10G

2-4 hours

10,000

80

25G

4-8 hours

25,000+

100-160

40G+

10-20 hours

Note: Agent deployment includes image fetch (~200MB per device), local save, image load, and container creation. With the reduced image size and parallel local operations, deployment is significantly faster than with larger images.

Most customers only need to set 2 variables:

  1. CVT_DEPLOYMENT_MAX_WORKERS (based on network bandwidth)

  2. CVT_MAX_WORKERS (based on server CPU and device tolerance)

The third variable (CVT_BATCHING_THRESHOLD) usually works fine at default!

© Copyright 2025, NVIDIA. Last updated on Nov 12, 2025