UCC: Metadata Cache Undersizing#
Overview#
UCC uses NGINX’s proxy_cache_path keys_zone parameter to allocate shared memory for tracking
cached items. This metadata cache (keys zone) stores information about which URLs are cached, their
cache keys, expiration times, and storage locations. When the metadata cache is undersized, NGINX
silently evicts the oldest cache metadata using LRU (Least Recently Used), causing cache misses
even when content is present on disk.
The metadata cache is configured via the keys_zone parameter. For example:
proxy_cache_path /cache/data
levels=1:2
keys_zone=ucc_cache:256m
inactive=1d
max_size=500g;
Here, keys_zone=ucc_cache:256m allocates 256 MiB of shared memory. Each cache entry consumes
approximately 200-300 bytes, so 256 MiB tracks roughly 900,000-1,300,000 items. The default 256 MiB
is often insufficient for large simulation workloads with millions of USD assets.
Sizing formula:
keys_zone_size = (unique_urls * 250 bytes) * 1.5 headroom
Example: 1.5M URLs * 250 bytes * 1.5 = 562 MB → use 512m or 1g
When undersized, NGINX evicts metadata for the oldest cached items (no error messages logged). Clients requesting evicted items experience cache MISSes despite content being on disk, triggering re-fetches from S3.
Symptoms and Detection Signals#
Visible Symptoms#
High cache miss rates despite warm cache - Cache showing MISS for content that was previously cached
Frequent S3 re-fetches - Content being fetched from S3 despite existing on disk
Silent cache metadata eviction - No error messages; NGINX quietly evicts oldest metadata via LRU
Inconsistent cache hit ratios - Hit ratios varying significantly between identical workload runs
Cache Miss Patterns#
# SOURCE: NGINX Access Logs
# Look for patterns like:
upstream_cache_status: "MISS"
# Same file URI appearing multiple times with MISS status
# despite file being previously cached
Metric Signals#
Metric |
Description |
|---|---|
rate(
nginx_http_requests_total {
pod =~ "usd-content-cache-.*",
cache_status = "MISS"
} [5m]
)
|
Rate of cache MISS responses. High MISS rates (>30%) during warm cache indicate ineffectiveness. Metadata cache undersizing prevents NGINX from tracking cached items. |
rate(
nginx_http_requests_total {
pod =~ "usd-content-cache-.*",
upstream_addr != "-"
} [5m]
)
|
Rate of requests proxied to S3. High rates during warm cache indicate cache not serving content effectively. |
Root Cause Analysis#
Known Causes#
Metadata cache undersizing is typically caused by using the default 256 MiB keys_zone size for workloads with >1 million cached items, or by applying configuration to the wrong Helm values section.
Default Metadata Cache Size Insufficient#
When asset count exceeds metadata tracking capacity (~1M items for 256 MiB), NGINX silently evicts older cache metadata. Clients requesting evicted items experience cache MISSes and re-fetch from S3.
Configuration Applied to Wrong Helm Values Section#
A common deployment mistake is setting metadataMemorySize in the wrong section of the Helm values
file. Configuration must be under nginx.proxyCache.paths[] to take effect.
Other Possible Causes#
Highly Complex URL Structures - Very long URLs consuming more bytes per cache key
Multiple Cache Zones Competing for Memory - Multiple cache paths with separate keys_zones
Workload Access Patterns - Extremely diverse URLs preventing effective LRU
Troubleshooting Steps#
Diagnostic Steps#
Verify Configuration Was Applied Correctly
Check if metadata cache size was applied and rendered correctly in NGINX config.
# Check Helm values helm get values <ucc-release-name> -n ucc -o yaml | grep -B 5 -A 5 "metadataMemorySize" # Expected location: # nginx: # proxyCache: # paths: # - name: s3 # metadataMemorySize: 512m # or larger # Check rendered NGINX configuration kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone" # Should show: keys_zone=s3:512m (or configured size) # If shows 256m, configuration was not applied or was overridden
Analysis:- Mismatch between Helm values and rendered config indicates configuration not applied.- MissingmetadataMemorySizeor wrong YAML section causes default 256m to be used.Resolution:- If mismatch found, correct Helm values location (must be undernginx.proxyCache.paths).- Reapply:helm upgrade <release> <chart> -n ucc -f values.yaml- Verify rendered config post-upgrade.Monitor Cache Miss Rates and Decide If Resize Is Needed
Check cache hit ratios to determine if metadata cache is undersized.
# Check cache HIT/MISS distribution kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \ awk -F'"' '{print $26}' | sort | uniq -c # Calculate hit ratio: HIT / (HIT + MISS) * 100% # Warm cache target: >80% HIT rate # If warm cache shows <70% HIT rate, investigate # Metadata cache undersizing is one possible cause # Estimate unique URL count accessed kubectl logs -n ucc <ucc-pod> --tail=50000 | \ grep "request_uri" | awk -F'"' '{print $8}' | sort -u | wc -l # Compare against metadata cache capacity # 256m ~ 1M items; 512m ~ 2M items; 1g ~ 4M items
Analysis:- Warm cache with <70% HIT rate may indicate metadata cache issues.- If unique URL count exceeds keys_zone capacity, resize is needed.Resolution:- If unique URLs > capacity, increasemetadataMemorySize(step 3).- If HIT rate is low for other reasons, see Poor Cache Hit Ratios runbook.Increase Metadata Cache Size
Resize keys_zone based on workload asset count.
# Edit Helm values # Under nginx.proxyCache.paths (for each backend): # - name: s3 # metadataMemorySize: 1g # Increased from 256m # path: /cache/s3 # Apply via Helm upgrade helm upgrade <ucc-release-name> <chart-path> -n ucc -f values.yaml # Restart pods to pick up new configuration kubectl rollout restart statefulset -n ucc <ucc-statefulset> # Verify new configuration applied kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone"
Analysis:- Recommended sizes: 512m for 1M-2M URLs; 1-2g for >2M URLs.Resolution:- Apply via Helm upgrade.- Monitor cache HIT rates post-resize to validate improvement (target >80%).
Prevention#
Proactive Monitoring#
Set up alerts for:
Cache miss rate thresholds: Alert when MISS rate exceeds 30% for warm cache
Cache hit ratio degradation: Alert when HIT ratio drops below 70%
Configuration Best Practices#
Size keys_zone for asset count: Use formula from Overview section
Start with 512 MiB - 1 GiB: For workloads with 1M-4M unique URLs
Verify configuration location: Ensure
metadataMemorySizeis undernginx.proxyCache.paths[]Validate post-deployment: Verify rendered NGINX config matches Helm values
Capacity Planning#
Estimate unique URL count: Analyze scenes to determine asset count
Calculate metadata requirement: Use sizing formula; add 50% headroom
Test under peak load: Validate keys_zone size with maximum concurrent simulation count