UCC: Connection Saturation and High Response Times#
Overview#
USD Content Cache (UCC) serves cached USD assets to render workers over HTTP/HTTPS. When connection limits are exceeded or worker capacity is insufficient, UCC response times degrade, client requests time out, and simulations fail. UCC uses NGINX as its foundation, which has per-worker connection limits that must be sized appropriately for the workload.
Connection saturation occurs when:
Worker connection limits (
worker_connections) are undersized for concurrent client requestsCPU cores allocated to UCC are insufficient for the connection handling workload
Replica count is too low to distribute connection load across the cluster
Client concurrency spikes exceed UCC’s configured capacity
HTTP/1.1 is used instead of HTTP/2, requiring one connection per concurrent request
When connection saturation occurs, UCC cannot accept new connections, client requests queue or time out, and P99 response times increase dramatically (from milliseconds to seconds or tens of seconds). This manifests as simulation failures with timeout errors or “connection refused” messages.
The recommended sizing formula for UCC connections is:
worker_connections = (GPU_count * client_concurrency) / replica_count / vCPU_count * safety_margin
For example, with 66 GPUs, 256 client concurrency, 5 replicas, and 32 vCPUs per replica:
worker_connections = (66 * 256) / 5 / 32 * 1.5 ~ 1,600
Symptoms and Detection Signals#
Visible Symptoms#
High P99 response times - Response times exceeding 5-10 seconds in P99 percentile
Client timeout errors - Render workers reporting connection timeouts or “connection refused”
Simulation failures - Simulations failing with gRPC UNKNOWN errors or HTTP timeout errors
Connection queue buildup - Connections waiting for available worker slots
Log Messages#
Connection Refused#
# Look for timeout or connection errors in render worker logs
# Patterns may include:
# - *refused*
# - *timeout*
# - *connect*
# - *dial*
Timeout Errors#
# Look for timeout errors in render worker logs
# Patterns may include:
# - *timeout*
# - *deadline*
# - *context*
Metric Signals#
The following Prometheus metrics can be used to detect connection saturation before it causes simulation failures. Monitor these metrics proactively to identify capacity issues early.
Metric |
Description |
|---|---|
nginx_connections_active {
pod =~ "usd-content-cache-.*",
namespace = "ucc"
}
|
Active connections per NGINX worker. High values approaching |
nginx_connections_waiting {
pod =~ "usd-content-cache-.*",
namespace = "ucc"
}
|
Connections waiting for available worker slots. Non-zero values indicate connection queueing; high values indicate saturation. This should typically be zero or very low. |
nginx_http_request_duration_seconds {
quantile = "0.99",
pod =~ "usd-content-cache-.*"
}
|
P99 request duration. Values exceeding 5-10 seconds indicate severe performance degradation. Compare against baseline (typically <500ms for cache hits). |
rate(
nginx_http_requests_total {
pod =~ "usd-content-cache-.*",
status =~ "5.."
} [5m]
)
|
Rate of 5xx errors from UCC. Sharp increases may indicate connection saturation causing request failures. Normal operation should have minimal 5xx errors. |
container_network_receive_bytes_total {
pod =~ "usd-content-cache-.*",
namespace = "ucc"
}
|
Total bytes received by UCC pods. High values approaching NIC capacity may indicate network saturation contributing to connection issues. Compare against VM SKU NIC limits. |
Root Cause Analysis#
Known Causes#
Connection saturation in UCC is typically caused by undersized worker connection limits, insufficient CPU cores, or too few replicas to handle the workload.
Undersized Worker Connection Limits#
The worker_connections parameter in NGINX controls the maximum number of simultaneous
connections each worker process can handle. The default value (often 1,024) is insufficient for
high-concurrency simulation workloads. Each NGINX worker process runs on one CPU core, so total
connection capacity is worker_connections * vCPU_count * replica_count.
For example, with default worker_connections=1024, 32 vCPUs, and 5 replicas:
Total capacity = 1,024 * 32 * 5 = 163,840 connections
However, this includes both inbound (client→UCC) and outbound (UCC→S3) connections. A workload with 66 GPUs and 256 client concurrency requires:
Required connections ~ 66 * 256 = 16,896 inbound + outbound to S3
If worker_connections is too low, NGINX rejects new connections once the limit is reached,
causing “connection refused” errors and queueing.
Check current worker connection configuration:
# Check Helm values for worker_connections
helm get values <ucc-release-name> -n ucc | grep worker_connections
# If not set, check default from ConfigMap
kubectl get configmap -n ucc <ucc-configmap> -o yaml | grep worker_connections
Insufficient CPU Cores#
NGINX spawns one worker process per CPU core. If CPU allocation is too low, even with properly
sized worker_connections, UCC cannot handle the connection load because there are not enough
worker processes to distribute connections across.
Check CPU allocation:
# Check UCC pod CPU limits and requests
kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'
# Check actual CPU usage
kubectl top pods -n ucc
Too Few Replicas#
UCC replica count may be too low to distribute connection load. The recommended sizing is based on network bandwidth requirements: provision at least 3.3 Gbps of network bandwidth per GPU, with a minimum of 1 Gbps per GPU.
For example, with 66 GPUs requiring 217.8 Gbps total bandwidth, and VMs with 10 Gbps NICs:
Required replicas ~ 217.8 Gbps / 10 Gbps per node ~ 22 pods
However, network bandwidth is typically the primary sizing factor; connection limits are secondary.
Check current replica count:
# Check UCC StatefulSet replica count
kubectl get statefulset -n ucc
# Check Helm values
helm get values <ucc-release-name> -n ucc | grep replicas
Other Possible Causes#
HTTP/1.1 Instead of HTTP/2
HTTP/1.1 requires one connection per concurrent request
HTTP/2 multiplexes multiple requests over a single connection
Using HTTP/1.1 exhausts connections faster than HTTP/2
Check client HTTP version support and UCC HTTP/2 configuration
Load Balancer Session Affinity Disabled
Without session affinity, client retries hit different UCC pods
Retry storms amplify connection pressure across all pods
Each retry counts as a new connection without affinity
Cloud Provider SNAT Port Exhaustion
Outbound connections to S3 consume SNAT ports
SNAT exhaustion prevents new upstream connections
More common in cloud environments with NAT gateways or load balancers
Network Latency or Packet Loss
High network latency increases connection lifetime
Packet loss triggers retransmissions and connection delays
Connections remain open longer, consuming worker slots
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
Monitor Connection Metrics and Identify Saturation
Check active and waiting connection counts to determine if saturation is occurring.
# Access UCC metrics endpoint kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145 curl http://localhost:9145/metrics | grep "nginx_connections" # Query Prometheus metrics: # - nginx_connections_active{pod=~"usd-content-cache-.*"} # - nginx_connections_waiting{pod=~"usd-content-cache-.*"} # Check if active connections approach worker_connections limit # Alert threshold: active > 0.8 * worker_connections
Analysis:- Active connections consistently nearworker_connectionslimit indicate saturation.- Non-zeronginx_connections_waitingindicates connection queueing.- P99 response times >5s correlate with high active connection counts.- Compare active connections across pods to identify load distribution issues.Resolution:- If active connections approach limit, increaseworker_connections(see step 2).- If queueing occurs, scale replicas or increase CPU allocation (see steps 3-4).- Monitor connection metrics after changes to verify improvements.Increase Worker Connection Limits
If connection saturation is detected, increase the
worker_connectionsparameter in NGINX configuration.# Get current Helm values helm get values <ucc-release-name> -n ucc -o yaml > current-values.yaml # Edit: Add or update nginx.workerConnections # Recommended: (GPU_count * concurrency) / replicas / vCPU * 1.5 # Example: (66 * 256) / 5 / 32 * 1.5 ~ 1,600 # Apply updated values helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml # Verify configuration applied kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep worker_connections # Monitor connection metrics after upgrade kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145 curl http://localhost:9145/metrics | grep "nginx_connections_active"
Analysis:- Currentworker_connectionscompared to calculated requirement determines if increase is needed.- Post-upgrade, active connections should remain well below new limit (target <70%).- Verify no connection queueing (nginx_connections_waitingshould be zero).Resolution:- Setworker_connectionsto calculated value based on sizing formula.- Apply via Helm upgrade:helm upgrade <release> <chart> -n ucc -f values.yaml- Restart UCC pods if configuration hot-reload is not supported.- Monitor metrics for 24-48 hours to validate capacity improvements.Scale UCC Replicas to Distribute Load
If connection saturation persists after increasing worker limits, scale the number of UCC replicas to distribute load across more pods.
# Check current replica count kubectl get statefulset -n ucc # Calculate required replicas based on network bandwidth # Required bandwidth = GPU_count * 3.3 Gbps (recommended) or 1 Gbps (minimum) # Example: 66 GPUs * 3.3 Gbps = 217.8 Gbps # VM NIC capacity: 10 Gbps → required replicas ~ 22 # Update Helm values: cluster.replicas # Edit current-values.yaml # Apply updated replica count helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml # Wait for new pods to become ready kubectl get pods -n ucc -w # Verify load distribution across replicas kubectl top pods -n ucc
Analysis:- Current replica count compared to calculated requirement determines scaling needs.- Post-scaling, connection load should distribute evenly across pods.- Network bandwidth per pod should be well below NIC capacity (target <70%).Resolution:- Scale replicas to match calculated requirement (network bandwidth-based sizing).- Apply via Helm upgrade:helm upgrade <release> <chart> -n ucc -f values.yaml- Monitor connection distribution and response times across all replicas.- Verify load balancer distributes traffic evenly across new pods.Increase CPU Allocation for More Worker Processes
If CPU utilization is high (>80%) and connection saturation persists, increase CPU allocation to spawn more NGINX worker processes (one per core).
# Check current CPU allocation kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}' # Check actual CPU usage kubectl top pods -n ucc # If CPU usage consistently >80%, increase CPU limits # Edit Helm values: cluster.container.resources.limits.cpu # Example: increase from 16 to 32 vCPUs # Apply updated CPU allocation helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml # Verify NGINX spawned more workers kubectl exec -n ucc <ucc-pod> -- ps aux | grep "nginx: worker process" | wc -l # Should equal vCPU count
Analysis:- High CPU utilization (>80%) with connection queueing indicates CPU bottleneck.- Number of NGINX worker processes equals vCPU count (one per core).- More workers allow more concurrent connection handling.Resolution:- Increase CPU allocation to match workload needs.- Verify worker process count equals new vCPU allocation.- Monitor CPU and connection metrics post-upgrade.- Consider upgrading VM SKU if CPU limits are reached.
Other Diagnostic Actions#
Check load balancer distribution: Verify traffic distributes evenly across UCC replicas:
# Check request distribution across pods (from UCC metrics) kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145 curl http://localhost:9145/metrics | grep "nginx_http_requests_total" # Compare request counts per pod # Uneven distribution may indicate load balancer issues
Review client session affinity: Check if load balancer has session affinity enabled:
# Check Service configuration kubectl get svc -n ucc <ucc-service> -o yaml | grep -A 5 "sessionAffinity" # For cloud provider load balancers, check annotations kubectl get svc -n ucc <ucc-service> -o yaml | grep -i "affinity\|sticky"
Monitor cloud provider SNAT usage: Check if SNAT port exhaustion is contributing:
# For Azure AKS: # az monitor metrics list \ # --resource <load-balancer-resource-id> \ # --metric "UsedSnatPorts,AllocatedSnatPorts" \ # --interval PT1M --aggregation Average # For AWS: # Check NAT Gateway or load balancer connection tracking metrics
Prevention#
Proactive Monitoring#
Set up alerts for:
Connection utilization thresholds: Alert when active connections exceed 80% of
worker_connectionslimitConnection queueing: Alert when
nginx_connections_waitingis non-zero for >30 secondsP99 response time degradation: Alert when P99 exceeds baseline by 3x (e.g., >1.5s if baseline is 500ms)
CPU utilization: Alert when CPU usage exceeds 80% for >5 minutes
5xx error rate increases: Alert on sharp increases in 5xx response codes
Configuration Best Practices#
Size worker connections appropriately: Use sizing formula:
(GPU_count * concurrency) / replicas / vCPU * 1.5Provision adequate CPU: Allocate vCPUs to match connection handling needs (one worker process per core)
Scale replicas for network bandwidth: Provision at least 3.3 Gbps per GPU (minimum 1 Gbps)
Enable HTTP/2: Configure HTTP/2 on both UCC and clients to multiplex requests over fewer connections
Enable session affinity: Configure load balancer session affinity (30-60s timeout) to improve retry efficiency
Monitor connection trends: Track connection usage over time to predict when scaling is needed
Plan for traffic spikes: Size capacity for peak concurrent simulations, not average load
Capacity Planning#
Calculate connection requirements: Use formula above to determine
worker_connectionsbased on GPU count and client concurrencyAccount for multiple concurrent simulations: Multiply requirements by concurrent simulation count
Provision headroom: Add 50% safety margin to calculated values to handle traffic bursts
Plan VM SKU upgrades: Select VM SKUs with adequate vCPUs and network bandwidth for workload
Test under load: Validate configuration with representative workload before production deployment
Monitor during scale-out: Track metrics during GPU fleet expansion to predict UCC scaling needs