UCC: Upstream S3 Connection Spikes and High Connect Time#

Overview#

UCC fetches uncached content from upstream S3 buckets during cache MISS scenarios. When the number of concurrent connections to S3 becomes excessive (>50,000-60,000), connection establishment times increase dramatically (P99 connect time from milliseconds to 5-10 seconds), S3 may return 503 (Service Unavailable) errors, and cloud provider SNAT ports may become exhausted.

NGINX opens a new TCP connection to S3 for every request in the current release. Each connection requires a TCP handshake plus TLS handshake, consuming time and SNAT ports. With high cache MISS rates (>3,000 RPS to S3), this quickly exhausts available connections and SNAT capacity.

Additionally, S3 uses DNS-based load balancing with short TTLs (5-60 seconds) to distribute requests. High DNS cache TTL in NGINX prevents discovery of new S3 endpoints as S3 scales, concentrating traffic on a subset of endpoints and worsening connection bottlenecks.

Note: Connection reuse to S3 is not available in the current release. Future versions will include a Lua-based connection pooling balancer. Operators experiencing this issue should focus on reducing cache MISS rates and monitoring SNAT capacity.

Symptoms and Detection Signals#

Visible Symptoms#

High upstream connection counts - Connections to S3 exceeding 50,000-60,000 (estimated from SNAT metrics)
Slow S3 connect times - P99 connect time exceeding 5-10 seconds (baseline should be <100ms)
S3 503 errors - S3 returning “Service Unavailable” due to request rate limits
SNAT port exhaustion warnings - Cloud provider reporting high SNAT port usage (>70% of allocated)

Log Messages#

S3 503 Service Unavailable#

Location: UCC Access Logs

Description: Logs showing S3 returning 503 errors.

# SOURCE: NGINX Access Logs
# Look for upstream 503 responses:
upstream_status: "503"
host: "*s3*.amazonaws.com*"

Metric Signals#

Metric	Description
`# From UCC access logs histogram_quantile(0.99, upstream_connect_time_seconds_bucket )`	P99 upstream connect time to S3. Values exceeding 1-2 seconds indicate connection establishment bottlenecks. Baseline should be <100ms. Spikes to 5-10s indicate severe connection saturation.
`# Cloud provider SNAT metric # Azure: az monitor metrics UsedSnatPorts # AWS: NAT Gateway ActiveConnectionCount`	Used SNAT ports for outbound connections. High values approaching allocated limit indicate exhaustion risk. Alert threshold: >70% of allocated ports.
`rate( nginx_http_requests_total { pod =~ "usd-content-cache-.*", upstream_status = "503" } [5m] )`	Rate of 503 errors from S3. Sharp increases indicate S3 rate limiting due to excessive request rates.

Root Cause Analysis#

Known Causes#

Upstream S3 connection spikes are caused by lack of connection reuse in the current release. No operator-side configuration workaround is available.

No Connection Reuse to S3#

The current release cannot reuse connections to dynamically resolved S3 bucket hostnames. Each request opens a new TCP and TLS connection, causing connection count to scale linearly with request rate. High cache MISS rates drive S3 request rates, which drive connection count.

For example, 3,000 requests/sec to S3 with 10-second connection lifetime results in 30,000 concurrent connections. With no reuse, connection count grows until SNAT exhaustion or performance degradation occurs.

No operator workaround exists. Future versions will include connection pooling; operators should focus on reducing cache MISS rates to lower S3 request volume.

High Cache MISS Rate Driving S3 Requests#

Excessive cache MISSes cause UCC to fetch from S3 more frequently than necessary. Each MISS triggers an upstream request and consumes a new connection. Root causes of high MISS rates include undersized metadata cache, cold cache, or cache eviction.

Check cache HIT/MISS ratio:

# Query cache status from access logs
kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
  awk -F'"' '{print $26}' | sort | uniq -c

# Expected for warm cache: >70% HIT rate
# High MISS rates trigger more S3 connections

Other Possible Causes#

SNAT Port Allocation Insufficient - Allocated SNAT ports too low for outbound connection volume
S3 Rate Limiting - S3 returning 503 when request rate exceeds bucket limits
Network Latency to S3 - High RTT causing connections to remain open longer

Troubleshooting Steps#

Diagnostic Steps#

Monitor Upstream Connection Count and Connect Time

Measure connection metrics to quantify the issue.

# Monitor SNAT usage (proxy for connection count)
# Azure:
# az monitor metrics list \
#   --resource <load-balancer-resource-id> \
#   --metric "UsedSnatPorts,AllocatedSnatPorts,SnatConnectionCount"

# AWS:
# CloudWatch metrics for NAT Gateway: ActiveConnectionCount

# Check upstream connect time from logs
kubectl logs -n ucc <ucc-pod> --tail=10000 | \
  grep "upstream_connect_time" | awk -F'"' '{print $32}' | \
  awk '{if($1>0) {sum+=$1; count++}} END {if(count>0) print "Avg:", sum/count}'

# Parse P99 connect time manually or via log aggregation tool

Analysis:

SNAT usage >70% indicates connection exhaustion risk.
P99 connect times >1-2s indicate saturation (baseline <100ms).

Resolution:

No configuration fix available in current release.
Focus on reducing cache MISS rate (step 2).

Reduce Cache MISS Rate to Lower S3 Request Volume

Since connection reuse is not available, reduce the number of upstream S3 requests by improving cache effectiveness.

# Check current cache HIT/MISS ratio
kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
  awk -F'"' '{print $26}' | sort | uniq -c

# If MISS rate >30%, investigate causes:
# - Cold cache: implement pre-warming (see Cache Expiration Stampede runbook)
# - Metadata cache undersized: increase keys_zone (see Metadata Cache runbook)
# - Disk issues: see Data Disk Bandwidth runbook

Resolution:

Improve cache HIT rates to reduce S3 request volume.
Each 10% improvement in HIT rate reduces S3 connections proportionally.
See related runbooks for cache effectiveness improvements.

Monitor Cloud Provider SNAT Allocation

Verify SNAT port allocation is sufficient and monitor usage trends.

# Azure: Check load balancer SNAT metrics
# az monitor metrics list \
#   --resource <load-balancer-resource-id> \
#   --metric "AllocatedSnatPorts,UsedSnatPorts"

# Calculate headroom: (allocated - peak_used) / allocated
# If <30% headroom, request SNAT allocation increase

# AWS: Check NAT Gateway capacity
# Monitor ActiveConnectionCount and ConnectionAttemptCount

Resolution:

If SNAT usage consistently >70%, coordinate with cloud provider for allocation increase.
Plan for headroom as workload scales.

Prevention#

Proactive Monitoring#

Set up alerts for:

Upstream connect time degradation: Alert when P99 connect time exceeds 500ms
S3 503 error rate: Alert when S3 returns 503 errors (rate >1% of requests)
SNAT port utilization: Alert when usage exceeds 70% of allocated ports
Cache MISS rate increases: Alert when MISS rate exceeds 30% for warm cache

Mitigation Strategies#

Pre-warm cache before production loads: Reduce initial MISS burst (see Cache Expiration Stampede runbook)
Improve cache hit ratios: Address metadata cache and disk sizing (see related runbooks)
Monitor SNAT capacity: Track SNAT usage trends; request increases before exhaustion
Coordinate with cloud provider: Work with support to increase SNAT allocation if needed

Future Improvements#

Connection pooling and reuse features are under development for future releases. These features will significantly reduce upstream connection count and improve performance. Operators experiencing this issue should monitor release notes for availability.