UCC: Data Disk Bandwidth Bottlenecks#

Overview#

UCC stores cached content on persistent volumes backed by Persistent Volume Claims (PVCs). During cache population (cache MISS scenarios), UCC writes fetched content from S3 to disk. When the underlying storage class cannot provide sufficient I/O bandwidth, write operations saturate the disk, causing slow cache population, increased scene load times, and potential write stalls.

Cloud provider storage classes provide different IOPS and bandwidth tiers. Lower tiers (e.g., Standard, gp2) provide limited bandwidth that may be insufficient for high-throughput cache writes. Premium tiers with burst capabilities (e.g., managed-csi-premium-burst in Azure, gp3 in AWS) scale bandwidth with PVC size, making larger PVCs essential for performance.

For example, in Azure, a managed-csi-premium disk provides:

  • 250 GB PVC: ~60 MB/s baseline, 170 MB/s burst

  • 2 TiB PVC: ~200 MB/s baseline, 900 MB/s burst

During cache population for large scenes, UCC may write 500-900 MB/s of derived data. If the storage class cannot sustain this bandwidth, writes slow down, cache population takes longer, and scene load times increase.

Data disk bandwidth bottlenecks can occur when:

  • PVC storage class tier is insufficient (Standard instead of Premium)

  • PVC size is too small (bandwidth scales with size in cloud providers)

  • Storage class does not support burst I/O during cache population peaks

  • Multiple UCC pods on the same node compete for disk bandwidth

  • Node-level storage bandwidth is exhausted

When data disk bandwidth is saturated, cache population during MISS scenarios becomes the bottleneck, scene load times increase significantly (2-3x), and write operations may stall or fail.

Symptoms and Detection Signals#

Visible Symptoms#

  • Slow cache population - Cache taking 5-10 minutes to populate content that should take 2-3 minutes

  • High scene load times on cold cache - Initial scene loads taking 2-3x longer than expected

  • Write bandwidth saturation - Disk write throughput consistently at storage class limit

  • Degraded performance on cache MISS - Performance gap between HIT and MISS scenarios >3x

Metric Signals#

Metric

Description

rate(
  container_fs_writes_bytes_total {
      pod =~ "usd-content-cache-.*",
      namespace = "ucc"
  } [5m]
)

Disk write bandwidth per pod. Values approaching storage class limits (e.g., >830 MB/s sustained for managed-csi-premium-burst on 2 TiB PVC) indicate saturation. Compare against storage class specifications.

rate(
  container_fs_reads_bytes_total {
      pod =~ "usd-content-cache-.*",
      namespace = "ucc"
  } [5m]
)

Disk read bandwidth per pod. High reads during warm cache scenarios may indicate metadata cache issues or eviction. Read bandwidth should be minimal for warm cache HITs.

Root Cause Analysis#

Known Causes#

Data disk bandwidth bottlenecks are typically caused by using insufficient storage class tiers, undersized PVCs, or lack of burst I/O capabilities.

Insufficient Storage Class Tier#

Cloud provider storage classes provide different performance tiers. Lower tiers (Standard, gp2) provide limited IOPS and bandwidth that may be insufficient for UCC cache writes. Premium tiers with burst support (managed-csi-premium-burst, gp3) provide higher baseline and burst bandwidth.

Check current storage class:

# Check PVC storage class
kubectl get pvc -n ucc
kubectl describe pvc -n ucc <pvc-name> | grep "StorageClass"

# Check storage class specifications
kubectl describe storageclass <storage-class-name>

PVC Size Too Small for Bandwidth Scaling#

In cloud providers, storage bandwidth often scales with PVC size. A 250 GB PVC provides significantly less bandwidth than a 2 TiB PVC, even on the same storage class tier.

For managed-csi-premium-burst in Azure:

  • 250 GB: ~60 MB/s baseline, 170 MB/s burst

  • 1 TiB: ~150 MB/s baseline, 600 MB/s burst

  • 2 TiB: ~200 MB/s baseline, 900 MB/s burst

During high-throughput cache writes (500-900 MB/s), only a 2 TiB or larger PVC can sustain the load.

Other Possible Causes#

  1. No Burst I/O Support - Storage class lacks burst capabilities; baseline insufficient for peaks

  2. Multiple Pods Competing for Node Disk Bandwidth - Pods on same node competing for shared bandwidth

  3. Node Storage Performance Issues - Node hardware or storage backend degradation

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

  1. Monitor Disk Bandwidth Usage and Identify Saturation

    Check disk I/O bandwidth to determine if storage class limits are being reached.

    # Check disk write bandwidth from Prometheus
    # Query: rate(container_fs_writes_bytes_total{pod=~"usd-content-cache-.*"}[5m])
    # Alert threshold: >80% of storage class limit
    
    # Check disk I/O from within pod
    kubectl exec -n ucc <ucc-pod> -- iostat -x 5 3
    
    # For Azure, check disk metrics via CLI
    # az monitor metrics list \
    #   --resource <disk-resource-id> \
    #   --metric "Composite Disk Write Bytes/sec" \
    #   --interval PT1M --aggregation Average
    
    # For AWS, check EBS CloudWatch metrics
    # VolumeWriteBytes, VolumeWriteOps
    
    Analysis:
    - Disk writes approaching storage class limits indicate saturation.
    - Sustained writes >700-800 MB/s for managed-csi-premium-burst (1 TiB) indicate near-limit.
    Resolution:
    - If saturation detected, upgrade storage class tier or increase PVC size (steps 2-3).
  2. Upgrade to Burst-Capable Storage Class

    If using Standard or non-burst storage classes, upgrade to Premium with burst capabilities.

    # Check current storage class
    kubectl get pvc -n ucc <pvc-name> -o yaml | grep "storageClassName"
    
    # For Azure: upgrade to managed-csi-premium-burst
    # For AWS: upgrade to gp3 (from gp2)
    
    # Note: Cannot change storage class on existing PVC
    # Must redeploy with new storage class
    
    helm get values <ucc-release-name> -n ucc -o yaml > values.yaml
    # Edit: cluster.container.storage.volume.storageClassName
    # Azure: "managed-csi-premium-burst"
    # AWS: "gp3"
    
    helm uninstall <ucc-release-name> -n ucc
    kubectl delete pvc -n ucc <old-pvc-names>
    helm install <ucc-release-name> <chart-path> -n ucc -f values.yaml
    
    Resolution:
    - Upgrade to Premium/burst storage class.
    - Plan migration during maintenance window (requires UCC downtime).
  3. Increase PVC Size to Scale Bandwidth

    If using a burst-capable storage class but PVC is too small, increase PVC size.

    # Check current PVC size
    kubectl get pvc -n ucc
    
    # Check if storage class supports volume expansion
    kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion
    
    # If expansion supported, edit PVC
    kubectl edit pvc <pvc-name> -n ucc
    # Change spec.resources.requests.storage to larger value
    # Recommended: 2 TiB minimum
    
    # For storage classes without expansion: redeploy with larger size
    
    Resolution:
    - Increase PVC to 2 TiB minimum for high-throughput cache writes.
    - Monitor disk bandwidth post-resize to verify improved throughput.

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Disk write bandwidth thresholds: Alert when write bandwidth exceeds 80% of storage class limit

  • Disk write latency degradation: Alert when P99 write latency exceeds 100ms

  • Cache population time: Alert when cache population takes >2x expected time for known workloads

  • Node disk pressure: Alert when node-level disk bandwidth approaches limits

Configuration Best Practices#

  • Test burst quota under sustained load: Burst capabilities have quotas that can be exhausted during prolonged high-throughput writes; validate storage class can sustain workload duration