Simple CVT Performance Tuning Guide
Based on extensive testing and real-world deployments, CVT performance can be optimized with just these simple settings:
Core Configuration:
# 1. Agent Deployment (~200MB image + local container operations)export CVT_DEPLOYMENT_MAX_WORKERS=60
# 2. Everything Else (validation, connectivity, DNS, etc.)
export CVT_MAX_WORKERS=150
# 3. Batching Control (when to split large deployments)
export CVT_BATCHING_THRESHOLD=10000
CVT_DEPLOYMENT_MAX_WORKERS (Agent Deployment)
Question: How much bandwidth can your management network handle?
Note: Deployment includes multiple phases: image fetch (~200MB), save to disk, load image, and container creation. The process is not purely bandwidth-limited, so we can use higher worker counts.
Network Speed | Recommended Value | Reasoning |
1G | 20 | 20 × 200MB = ~4GB concurrent + local ops |
10G | 60 | 60 × 200MB = ~12GB concurrent + local ops |
25G+ | 100-160 | Higher bandwidth + parallel local operations |
CVT_MAX_WORKERS (Everything Else)
Question: How many CPU cores does your server have?
Server CPU Cores | Recommended Value | Reasoning |
8-32 cores | 30-50 | Conservative scaling |
32-128 cores | 75-150 | Balanced scaling |
128+ cores | 150-300 | Aggressive scaling |
But watch for switch overload! If you see many timeout errors, reduce this value.
CVT_BATCHING_THRESHOLD (When to Batch)
Question: How many switches do you have?
Switch Count | Recommended Value | What Happens |
<5,000 | 10000 (default) | Single batch (faster) |
5,000-20,000 | 5000 | Light batching |
20,000+ | 3000 | More aggressive batching |
Small Deployment (<1,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=20export CVT_MAX_WORKERS=50
# No need to change batching threshold
Medium Deployment (1,000-10,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=40export CVT_MAX_WORKERS=100
# No need to change batching threshold
Large Deployment (10,000-30,000 switches)
export CVT_DEPLOYMENT_MAX_WORKERS=60export CVT_MAX_WORKERS=150
export CVT_BATCHING_THRESHOLD=5000
Hyperscale Deployment (30,000+ switches)
export CVT_DEPLOYMENT_MAX_WORKERS=80export CVT_MAX_WORKERS=200
export CVT_BATCHING_THRESHOLD=3000
Good Signs:
✅ Server load increases during validation (better CPU utilization)
✅ Validation completes faster than before
✅ No significant increase in timeout errors
✅ Network bandwidth stays below 80%
Warning Signs:
⚠️ Many "Timeout while trying to start validation" errors
⚠️ "Connection refused" errors from switches
⚠️ Server load stays low (underutilization)
⚠️ Network bandwidth hits 90%+
Adjustment Strategy:
Too many timeouts: Reduce
CVT_MAX_WORKERSby 25-50Server underutilized: Increase
CVT_MAX_WORKERSby 25-50Deployment too slow: Increase
CVT_DEPLOYMENT_MAX_WORKERS(if network allows)Memory issues: Reduce
CVT_BATCHING_THRESHOLD
Your Customer's 32,149 Switches:
Current Configuration:
export CVT_DEPLOYMENT_MAX_WORKERS=60 # Optimal for 10G network with 200MB imageexport CVT_MAX_WORKERS=150 # Good for 448-core server
export CVT_BATCHING_THRESHOLD=10000 # Will use batching (32K > 10K)
Expected Results:
Current: 7m 13s
Optimized: 2-3 minutes (60-75% improvement)
Server Load: Should increase from 5-7 to 15-20
Critical Monitoring Points:
Watch for Device Overload:
Timeout Errors: Increase in "Timeout while trying to start validation" messages
Connection Refused: "Connection error" messages from devices
Response Times: Slower device response times
Failure Rate: Higher percentage of failed device connections
Server Utilization Monitoring:
Load Average: Should increase from baseline to target ranges
CPU Usage: Better utilization of available cores
Memory: Watch for any memory pressure during large deployments
Network: Monitor bandwidth utilization on management interface
Success Criteria by Worker Count
Worker Count | Expected Load Average | Expected Time Improvement | Risk Level |
50-75 | 8-12 | 30-50% faster | Low |
100-150 | 12-18 | 50-70% faster | Medium |
200+ | 18-25 | 70%+ faster | High |
Red Flags (Scale Back If You See)
⚠️ Significant increase in timeout errors
⚠️ "Connection refused" errors from devices
⚠️ Device response times getting slower
⚠️ Higher failure rates than baseline
Green Lights (Scale Up If You See)
✅ Stable or improved device response times
✅ No increase in connection errors
✅ Server load well below target range
✅ Good success rate maintained
File Descriptor Limits
# Increase file descriptor limits for high concurrencyulimit -n 65536
# Make permanent by adding to /etc/security/limits.conf:
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
Network Optimization
# Optimize network settings for high concurrencyecho 65536 > /proc/sys/net/core/somaxconn
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
Memory Settings
# For very large deployments (20,000+ devices)echo 1 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio
Deployment Duration by Cluster Size:
Devices | Workers | Network | Estimated Time |
1,000 | 20 | 1G | 1-2 hours |
4,000 | 60 | 10G | 2-4 hours |
10,000 | 80 | 25G | 4-8 hours |
25,000+ | 100-160 | 40G+ | 10-20 hours |
Note: Agent deployment includes image fetch (~200MB per device), local save, image load, and container creation. With the reduced image size and parallel local operations, deployment is significantly faster than with larger images.
Most customers only need to set 2 variables:
CVT_DEPLOYMENT_MAX_WORKERS(based on network bandwidth)CVT_MAX_WORKERS(based on server CPU and device tolerance)
The third variable (CVT_BATCHING_THRESHOLD) usually works fine at default!