UCC: Metadata Cache Undersizing#

Overview#

UCC uses NGINX’s proxy_cache_path keys_zone parameter to allocate shared memory for tracking cached items. This metadata cache (keys zone) stores information about which URLs are cached, their cache keys, expiration times, and storage locations. When the metadata cache is undersized, NGINX silently evicts the oldest cache metadata using LRU (Least Recently Used), causing cache misses even when content is present on disk.

The metadata cache is configured via the keys_zone parameter. For example:

proxy_cache_path /cache/data
                 levels=1:2
                 keys_zone=ucc_cache:256m
                 inactive=1d
                 max_size=500g;

Here, keys_zone=ucc_cache:256m allocates 256 MiB of shared memory. Each cache entry consumes approximately 200-300 bytes, so 256 MiB tracks roughly 900,000-1,300,000 items. The default 256 MiB is often insufficient for large simulation workloads with millions of USD assets.

Sizing formula:

keys_zone_size = (unique_urls * 250 bytes) * 1.5 headroom

Example: 1.5M URLs * 250 bytes * 1.5 = 562 MB → use 512m or 1g

When undersized, NGINX evicts metadata for the oldest cached items (no error messages logged). Clients requesting evicted items experience cache MISSes despite content being on disk, triggering re-fetches from S3.

Symptoms and Detection Signals#

Visible Symptoms#

  • High cache miss rates despite warm cache - Cache showing MISS for content that was previously cached

  • Frequent S3 re-fetches - Content being fetched from S3 despite existing on disk

  • Silent cache metadata eviction - No error messages; NGINX quietly evicts oldest metadata via LRU

  • Inconsistent cache hit ratios - Hit ratios varying significantly between identical workload runs

Cache Miss Patterns#

Location: UCC Access Logs
Application: NGINX
Description: HTTP access logs showing cache MISS for the same file repeatedly.
# SOURCE: NGINX Access Logs
# Look for patterns like:
upstream_cache_status: "MISS"
# Same file URI appearing multiple times with MISS status
# despite file being previously cached

Metric Signals#

Metric

Description

rate(
  nginx_http_requests_total {
      pod =~ "usd-content-cache-.*",
      cache_status = "MISS"
  } [5m]
)

Rate of cache MISS responses. High MISS rates (>30%) during warm cache indicate ineffectiveness. Metadata cache undersizing prevents NGINX from tracking cached items.

rate(
  nginx_http_requests_total {
      pod =~ "usd-content-cache-.*",
      upstream_addr != "-"
  } [5m]
)

Rate of requests proxied to S3. High rates during warm cache indicate cache not serving content effectively.

Root Cause Analysis#

Known Causes#

Metadata cache undersizing is typically caused by using the default 256 MiB keys_zone size for workloads with >1 million cached items, or by applying configuration to the wrong Helm values section.

Default Metadata Cache Size Insufficient#

When asset count exceeds metadata tracking capacity (~1M items for 256 MiB), NGINX silently evicts older cache metadata. Clients requesting evicted items experience cache MISSes and re-fetch from S3.

Configuration Applied to Wrong Helm Values Section#

A common deployment mistake is setting metadataMemorySize in the wrong section of the Helm values file. Configuration must be under nginx.proxyCache.paths[] to take effect.

Other Possible Causes#

  1. Highly Complex URL Structures - Very long URLs consuming more bytes per cache key

  2. Multiple Cache Zones Competing for Memory - Multiple cache paths with separate keys_zones

  3. Workload Access Patterns - Extremely diverse URLs preventing effective LRU

Troubleshooting Steps#

Diagnostic Steps#

  1. Verify Configuration Was Applied Correctly

    Check if metadata cache size was applied and rendered correctly in NGINX config.

    # Check Helm values
    helm get values <ucc-release-name> -n ucc -o yaml | grep -B 5 -A 5 "metadataMemorySize"
    
    # Expected location:
    # nginx:
    #   proxyCache:
    #     paths:
    #       - name: s3
    #         metadataMemorySize: 512m  # or larger
    
    # Check rendered NGINX configuration
    kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone"
    
    # Should show: keys_zone=s3:512m (or configured size)
    # If shows 256m, configuration was not applied or was overridden
    
    Analysis:
    - Mismatch between Helm values and rendered config indicates configuration not applied.
    - Missing metadataMemorySize or wrong YAML section causes default 256m to be used.
    Resolution:
    - If mismatch found, correct Helm values location (must be under nginx.proxyCache.paths).
    - Reapply: helm upgrade <release> <chart> -n ucc -f values.yaml
    - Verify rendered config post-upgrade.
  2. Monitor Cache Miss Rates and Decide If Resize Is Needed

    Check cache hit ratios to determine if metadata cache is undersized.

    # Check cache HIT/MISS distribution
    kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
      awk -F'"' '{print $26}' | sort | uniq -c
    
    # Calculate hit ratio: HIT / (HIT + MISS) * 100%
    # Warm cache target: >80% HIT rate
    
    # If warm cache shows <70% HIT rate, investigate
    # Metadata cache undersizing is one possible cause
    
    # Estimate unique URL count accessed
    kubectl logs -n ucc <ucc-pod> --tail=50000 | \
      grep "request_uri" | awk -F'"' '{print $8}' | sort -u | wc -l
    
    # Compare against metadata cache capacity
    # 256m ~ 1M items; 512m ~ 2M items; 1g ~ 4M items
    
    Analysis:
    - Warm cache with <70% HIT rate may indicate metadata cache issues.
    - If unique URL count exceeds keys_zone capacity, resize is needed.
    Resolution:
    - If unique URLs > capacity, increase metadataMemorySize (step 3).
    - If HIT rate is low for other reasons, see Poor Cache Hit Ratios runbook.
  3. Increase Metadata Cache Size

    Resize keys_zone based on workload asset count.

    # Edit Helm values
    # Under nginx.proxyCache.paths (for each backend):
    # - name: s3
    #   metadataMemorySize: 1g  # Increased from 256m
    #   path: /cache/s3
    
    # Apply via Helm upgrade
    helm upgrade <ucc-release-name> <chart-path> -n ucc -f values.yaml
    
    # Restart pods to pick up new configuration
    kubectl rollout restart statefulset -n ucc <ucc-statefulset>
    
    # Verify new configuration applied
    kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone"
    
    Analysis:
    - Recommended sizes: 512m for 1M-2M URLs; 1-2g for >2M URLs.
    Resolution:
    - Apply via Helm upgrade.
    - Monitor cache HIT rates post-resize to validate improvement (target >80%).

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Cache miss rate thresholds: Alert when MISS rate exceeds 30% for warm cache

  • Cache hit ratio degradation: Alert when HIT ratio drops below 70%

Configuration Best Practices#

  • Size keys_zone for asset count: Use formula from Overview section

  • Start with 512 MiB - 1 GiB: For workloads with 1M-4M unique URLs

  • Verify configuration location: Ensure metadataMemorySize is under nginx.proxyCache.paths[]

  • Validate post-deployment: Verify rendered NGINX config matches Helm values

Capacity Planning#

  • Estimate unique URL count: Analyze scenes to determine asset count

  • Calculate metadata requirement: Use sizing formula; add 50% headroom

  • Test under peak load: Validate keys_zone size with maximum concurrent simulation count