UCC: Connection Saturation and High Response Times#

Overview#

USD Content Cache (UCC) serves cached USD assets to render workers over HTTP/HTTPS. When connection limits are exceeded or worker capacity is insufficient, UCC response times degrade, client requests time out, and simulations fail. UCC uses NGINX as its foundation, which has per-worker connection limits that must be sized appropriately for the workload.

Connection saturation occurs when:

Worker connection limits (worker_connections) are undersized for concurrent client requests
CPU cores allocated to UCC are insufficient for the connection handling workload
Replica count is too low to distribute connection load across the cluster
Client concurrency spikes exceed UCC’s configured capacity
HTTP/1.1 is used instead of HTTP/2, requiring one connection per concurrent request

When connection saturation occurs, UCC cannot accept new connections, client requests queue or time out, and P99 response times increase dramatically (from milliseconds to seconds or tens of seconds). This manifests as simulation failures with timeout errors or “connection refused” messages.

The recommended sizing formula for UCC connections is:

worker_connections = (GPU_count * client_concurrency) / replica_count / vCPU_count * safety_margin

For example, with 66 GPUs, 256 client concurrency, 5 replicas, and 32 vCPUs per replica:

worker_connections = (66 * 256) / 5 / 32 * 1.5 ~ 1,600

Symptoms and Detection Signals#

Visible Symptoms#

High P99 response times - Response times exceeding 5-10 seconds in P99 percentile
Client timeout errors - Render workers reporting connection timeouts or “connection refused”
Simulation failures - Simulations failing with gRPC UNKNOWN errors or HTTP timeout errors
Connection queue buildup - Connections waiting for available worker slots

Log Messages#

Connection Refused#

Location: Render Worker Pod

Application: OmniClient / Storage Service

Description: Logs indicating connection failures to UCC.

# Look for timeout or connection errors in render worker logs
# Patterns may include:
# - *refused*
# - *timeout*
# - *connect*
# - *dial*

Timeout Errors#

Location: Render Worker Pod

Application: Storage Service

Description: Logs indicating request timeouts from UCC.

# Look for timeout errors in render worker logs
# Patterns may include:
# - *timeout*
# - *deadline*
# - *context*

Metric Signals#

The following Prometheus metrics can be used to detect connection saturation before it causes simulation failures. Monitor these metrics proactively to identify capacity issues early.

Metric	Description
`nginx_connections_active { pod =~ "usd-content-cache-.*", namespace = "ucc" }`	Active connections per NGINX worker. High values approaching `worker_connections` limit indicate connection saturation risk. Alert when exceeding 80% of configured limit.
`nginx_connections_waiting { pod =~ "usd-content-cache-.*", namespace = "ucc" }`	Connections waiting for available worker slots. Non-zero values indicate connection queueing; high values indicate saturation. This should typically be zero or very low.
`nginx_http_request_duration_seconds { quantile = "0.99", pod =~ "usd-content-cache-.*" }`	P99 request duration. Values exceeding 5-10 seconds indicate severe performance degradation. Compare against baseline (typically <500ms for cache hits).
`rate( nginx_http_requests_total { pod =~ "usd-content-cache-.*", status =~ "5.." } [5m] )`	Rate of 5xx errors from UCC. Sharp increases may indicate connection saturation causing request failures. Normal operation should have minimal 5xx errors.
`container_network_receive_bytes_total { pod =~ "usd-content-cache-.*", namespace = "ucc" }`	Total bytes received by UCC pods. High values approaching NIC capacity may indicate network saturation contributing to connection issues. Compare against VM SKU NIC limits.

Root Cause Analysis#

Known Causes#

Connection saturation in UCC is typically caused by undersized worker connection limits, insufficient CPU cores, or too few replicas to handle the workload.

Undersized Worker Connection Limits#

The worker_connections parameter in NGINX controls the maximum number of simultaneous connections each worker process can handle. The default value (often 1,024) is insufficient for high-concurrency simulation workloads. Each NGINX worker process runs on one CPU core, so total connection capacity is worker_connections * vCPU_count * replica_count.

For example, with default worker_connections=1024, 32 vCPUs, and 5 replicas:

Total capacity = 1,024 * 32 * 5 = 163,840 connections

However, this includes both inbound (client→UCC) and outbound (UCC→S3) connections. A workload with 66 GPUs and 256 client concurrency requires:

Required connections ~ 66 * 256 = 16,896 inbound + outbound to S3

If worker_connections is too low, NGINX rejects new connections once the limit is reached, causing “connection refused” errors and queueing.

Check current worker connection configuration:

# Check Helm values for worker_connections
helm get values <ucc-release-name> -n ucc | grep worker_connections

# If not set, check default from ConfigMap
kubectl get configmap -n ucc <ucc-configmap> -o yaml | grep worker_connections

Insufficient CPU Cores#

NGINX spawns one worker process per CPU core. If CPU allocation is too low, even with properly sized worker_connections, UCC cannot handle the connection load because there are not enough worker processes to distribute connections across.

Check CPU allocation:

# Check UCC pod CPU limits and requests
kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

# Check actual CPU usage
kubectl top pods -n ucc

Too Few Replicas#

UCC replica count may be too low to distribute connection load. The recommended sizing is based on network bandwidth requirements: provision at least 3.3 Gbps of network bandwidth per GPU, with a minimum of 1 Gbps per GPU.

For example, with 66 GPUs requiring 217.8 Gbps total bandwidth, and VMs with 10 Gbps NICs:

Required replicas ~ 217.8 Gbps / 10 Gbps per node ~ 22 pods

However, network bandwidth is typically the primary sizing factor; connection limits are secondary.

Check current replica count:

# Check UCC StatefulSet replica count
kubectl get statefulset -n ucc

# Check Helm values
helm get values <ucc-release-name> -n ucc | grep replicas

Other Possible Causes#

HTTP/1.1 Instead of HTTP/2
- HTTP/1.1 requires one connection per concurrent request
- HTTP/2 multiplexes multiple requests over a single connection
- Using HTTP/1.1 exhausts connections faster than HTTP/2
- Check client HTTP version support and UCC HTTP/2 configuration
Load Balancer Session Affinity Disabled
- Without session affinity, client retries hit different UCC pods
- Retry storms amplify connection pressure across all pods
- Each retry counts as a new connection without affinity
Cloud Provider SNAT Port Exhaustion
- Outbound connections to S3 consume SNAT ports
- SNAT exhaustion prevents new upstream connections
- More common in cloud environments with NAT gateways or load balancers
Network Latency or Packet Loss
- High network latency increases connection lifetime
- Packet loss triggers retransmissions and connection delays
- Connections remain open longer, consuming worker slots

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

Monitor Connection Metrics and Identify Saturation

Check active and waiting connection counts to determine if saturation is occurring.
```
# Access UCC metrics endpoint
kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_connections"

# Query Prometheus metrics:
# - nginx_connections_active{pod=~"usd-content-cache-.*"}
# - nginx_connections_waiting{pod=~"usd-content-cache-.*"}

# Check if active connections approach worker_connections limit
# Alert threshold: active > 0.8 * worker_connections
```
Analysis:
- Active connections consistently near
worker_connections

limit indicate saturation.
- Non-zero
nginx_connections_waiting

indicates connection queueing.
- P99 response times >5s correlate with high active connection counts.
- Compare active connections across pods to identify load distribution issues.
Resolution:
- If active connections approach limit, increase
worker_connections

(see step 2).
- If queueing occurs, scale replicas or increase CPU allocation (see steps 3-4).
- Monitor connection metrics after changes to verify improvements.

Increase Worker Connection Limits

If connection saturation is detected, increase the worker_connections parameter in NGINX configuration.

# Get current Helm values
helm get values <ucc-release-name> -n ucc -o yaml > current-values.yaml

# Edit: Add or update nginx.workerConnections
# Recommended: (GPU_count * concurrency) / replicas / vCPU * 1.5
# Example: (66 * 256) / 5 / 32 * 1.5 ~ 1,600

# Apply updated values
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Verify configuration applied
kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep worker_connections

# Monitor connection metrics after upgrade
kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_connections_active"

Analysis:

Current

worker_connections

compared to calculated requirement determines if increase is needed.

Post-upgrade, active connections should remain well below new limit (target <70%).
Verify no connection queueing (

nginx_connections_waiting

should be zero).

Resolution:

Set

worker_connections

to calculated value based on sizing formula.

Apply via Helm upgrade:

helm upgrade <release> <chart> -n ucc -f values.yaml

Restart UCC pods if configuration hot-reload is not supported.
Monitor metrics for 24-48 hours to validate capacity improvements.

Scale UCC Replicas to Distribute Load

If connection saturation persists after increasing worker limits, scale the number of UCC replicas to distribute load across more pods.
```
# Check current replica count
kubectl get statefulset -n ucc

# Calculate required replicas based on network bandwidth
# Required bandwidth = GPU_count * 3.3 Gbps (recommended) or 1 Gbps (minimum)
# Example: 66 GPUs * 3.3 Gbps = 217.8 Gbps
# VM NIC capacity: 10 Gbps → required replicas ~ 22

# Update Helm values: cluster.replicas
# Edit current-values.yaml

# Apply updated replica count
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Wait for new pods to become ready
kubectl get pods -n ucc -w

# Verify load distribution across replicas
kubectl top pods -n ucc
```
Analysis:
- Current replica count compared to calculated requirement determines scaling needs.
- Post-scaling, connection load should distribute evenly across pods.
- Network bandwidth per pod should be well below NIC capacity (target <70%).
Resolution:
- Scale replicas to match calculated requirement (network bandwidth-based sizing).
- Apply via Helm upgrade:
helm upgrade <release> <chart> -n ucc -f values.yaml
- Monitor connection distribution and response times across all replicas.
- Verify load balancer distributes traffic evenly across new pods.

Increase CPU Allocation for More Worker Processes

If CPU utilization is high (>80%) and connection saturation persists, increase CPU allocation to spawn more NGINX worker processes (one per core).

# Check current CPU allocation
kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

# Check actual CPU usage
kubectl top pods -n ucc

# If CPU usage consistently >80%, increase CPU limits
# Edit Helm values: cluster.container.resources.limits.cpu
# Example: increase from 16 to 32 vCPUs

# Apply updated CPU allocation
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Verify NGINX spawned more workers
kubectl exec -n ucc <ucc-pod> -- ps aux | grep "nginx: worker process" | wc -l
# Should equal vCPU count

Analysis:

High CPU utilization (>80%) with connection queueing indicates CPU bottleneck.
Number of NGINX worker processes equals vCPU count (one per core).
More workers allow more concurrent connection handling.

Resolution:

Increase CPU allocation to match workload needs.
Verify worker process count equals new vCPU allocation.
Monitor CPU and connection metrics post-upgrade.
Consider upgrading VM SKU if CPU limits are reached.

Other Diagnostic Actions#

Check load balancer distribution: Verify traffic distributes evenly across UCC replicas:

# Check request distribution across pods (from UCC metrics)
kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_http_requests_total"

# Compare request counts per pod
# Uneven distribution may indicate load balancer issues

Review client session affinity: Check if load balancer has session affinity enabled:

# Check Service configuration
kubectl get svc -n ucc <ucc-service> -o yaml | grep -A 5 "sessionAffinity"

# For cloud provider load balancers, check annotations
kubectl get svc -n ucc <ucc-service> -o yaml | grep -i "affinity\|sticky"

Monitor cloud provider SNAT usage: Check if SNAT port exhaustion is contributing:

# For Azure AKS:
# az monitor metrics list \
#   --resource <load-balancer-resource-id> \
#   --metric "UsedSnatPorts,AllocatedSnatPorts" \
#   --interval PT1M --aggregation Average

# For AWS:
# Check NAT Gateway or load balancer connection tracking metrics

Prevention#

Proactive Monitoring#

Set up alerts for:

Connection utilization thresholds: Alert when active connections exceed 80% of worker_connections limit
Connection queueing: Alert when nginx_connections_waiting is non-zero for >30 seconds
P99 response time degradation: Alert when P99 exceeds baseline by 3x (e.g., >1.5s if baseline is 500ms)
CPU utilization: Alert when CPU usage exceeds 80% for >5 minutes
5xx error rate increases: Alert on sharp increases in 5xx response codes

Configuration Best Practices#

Size worker connections appropriately: Use sizing formula: (GPU_count * concurrency) / replicas / vCPU * 1.5
Provision adequate CPU: Allocate vCPUs to match connection handling needs (one worker process per core)
Scale replicas for network bandwidth: Provision at least 3.3 Gbps per GPU (minimum 1 Gbps)
Enable HTTP/2: Configure HTTP/2 on both UCC and clients to multiplex requests over fewer connections
Enable session affinity: Configure load balancer session affinity (30-60s timeout) to improve retry efficiency
Monitor connection trends: Track connection usage over time to predict when scaling is needed
Plan for traffic spikes: Size capacity for peak concurrent simulations, not average load

Capacity Planning#

Calculate connection requirements: Use formula above to determine worker_connections based on GPU count and client concurrency
Account for multiple concurrent simulations: Multiply requirements by concurrent simulation count
Provision headroom: Add 50% safety margin to calculated values to handle traffic bursts
Plan VM SKU upgrades: Select VM SKUs with adequate vCPUs and network bandwidth for workload
Test under load: Validate configuration with representative workload before production deployment
Monitor during scale-out: Track metrics during GPU fleet expansion to predict UCC scaling needs