DDCS: Cache Misses and Performance Degradation#

Overview#

DDCS caches derived data that is computationally expensive to generate. When cache misses occur frequently, render time is spent regenerating derived content before getting to your desired workload, significantly increasing scene load times and reducing overall system performance.

Cache misses and performance degradation can occur when:

  • Cache is cold (scene or assets not preloaded)

  • Cache eviction due to memory or disk pressure

  • Misconfigured cache size parameters

  • Insufficient cache capacity for workload patterns

  • Write buffer pressure causing cache eviction

Symptoms and Detection Signals#

Visible Symptoms#

  • User-facing slowdowns - Scene loads taking significantly longer than expected

  • Inconsistent performance - Performance varies dramatically between “warm” and “cold” scene loads

Log Messages#

Write Buffer Pressure#

Pod: ddcs-*
Location: DDCS Pod
Application: DDCS
Description: Logs indicating writes are behind or are failing.
# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*timeout while trying to write to DB*" or
         "*write failed because the database is busy*" or
         "*failed to write*" or
         "*the engine is busy and cannot currently process the request*"

Metric Signals#

The following Prometheus metrics can be used to further determine the cause of the problem. During normal operation these counts may climb. These metrics relate to the write performance of the storage medium. When the engine is writing or wants to write data faster than the medium allows for, scene load times will suffer.

Metric

Description

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_total_(stops|delays)"
}

Total count of write stalls (stops or delays) in RocksDB. High values indicate RocksDB is throttling writes due to resource pressure, causing cache eviction and performance degradation.

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_(stops|delays)_l0_file_count_limit"
}

Count of stalls caused by exceeding the L0 file count limit. High values indicate compaction cannot keep up with write volume, causing RocksDB to stall writes and leading to cache eviction.

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"
}

Count of stalls caused by pending compaction bytes exceeding limits. High values indicate compaction backlog is building faster than it can be processed, suggesting disk I/O bottlenecks and causing cache eviction.

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_(stops|stalls)_memtable_limit"
}

Count of stalls caused by memtable (write buffer) limits being reached. High values indicate write buffers are full and cannot be flushed fast enough, causing write stalls and cache eviction.

ddcs_m_adapter_error_total {
    kind =~
        "rocks_timed_out|rocks_busy|rocks_try_again"
}

Count of errors reported by the database engine. During normal operation these counts will climb. However, if these sharply increase in concert, then it may indicate that the engine cannot keep up.

ddcs_m_adapter_entry_cache_total {
    operation = "miss"
}

Count of cache misses. High values relative to cache hits indicate cache effectiveness issues, suggesting cache is cold, undersized, or experiencing eviction pressure.

ddcs_m_adapter_entry_cache_total {
    operation = "hit"
}

Count of cache hits. Compare against cache misses to calculate hit ratio. Low hit ratios (below 70-80%) indicate cache effectiveness issues.

ddcs_m_adapter_get_total {
    level = "in_memory|rocksdb_in_memory"
}

Count of successful gets from in-memory cache. This is the ideal location requiring no disk I/O. Low values relative to disk seeks indicate cache eviction or cold cache conditions.

ddcs_m_adapter_get_total {
    level = "rocksdb_disk_seek"
}

Count of gets requiring disk seeks. High values relative to in-memory gets indicate cache misses requiring disk I/O, causing performance degradation.

ddcs_m_adapter_io_histogram_bucket {
    io_kind = "rocksdb_read"
}

Histogram capturing time spent on read I/O operations. High values indicate slow disk reads, suggesting disk I/O bottlenecks or cache misses requiring disk access.

ddcs_m_adapter_io_histogram_bucket {
    io_kind = "rocksdb_write"
}

Histogram capturing time spent on write I/O operations. High values indicate slow disk writes, suggesting write buffer pressure or disk I/O bottlenecks causing performance degradation.

ddcs_m_adapter_bytes_returned_total { }

Total bytes returned from all levels (cache or disk). High values relative to in-memory cache hits indicate increased disk I/O, suggesting cache eviction or cold cache conditions.

Root Cause Analysis#

Possible Causes#

Cold Cache (Scene or Assets Not Preloaded)#

A cold cache occurs when scenes or assets have not been preloaded into DDCS. During initial scene loads, GPUs encounter assets for the first time and must generate derived content synchronously, which is then written to DDCS. This synchronous generation and writing process significantly increases scene load times compared to warm cache scenarios where derived data is already available.

Cold cache conditions are expected during initial scene loads or when new assets are introduced. However, if cache miss ratios remain consistently high across multiple scene loads or after cache warm-up procedures, it may indicate that the cache is not sized correctly for the workload or that cache eviction is occurring too frequently.

Cache Eviction Due to Memory/Disk Pressure#

Cache eviction occurs when memory or disk pressure forces DDCS to remove cached data to make room for new writes. Write buffers evict cached key-value pairs when full, and RocksDB may evict data from caches when memory pressure occurs.

Cache size parameters (sys.cache_size for row cache, sys.block_cache_size for block cache) may be misconfigured for the workload. Too small caches result in frequent eviction and cache misses, while too large caches may cause memory pressure and OOM kills.

Storage Medium has Insufficient Performance#

Insufficient storage performance occurs when the underlying persistent volume cannot keep up with DDCS write and read operations. When storage IOPS or throughput is insufficient, RocksDB cannot flush write buffers or perform compaction fast enough, causing write stalls and cache eviction. This manifests as high stall metrics, slow I/O histograms, and increased disk seek operations.

Storage performance requirements depend on your installation environment and workload. The storage class and persistent volume configuration must provide sufficient IOPS and throughput for the attached volumes. Consult your cloud service provider (CSP) documentation for storage class performance characteristics and scaling options.

Other Possible Causes#

  1. Insufficient Cache Capacity

    • Total cache size insufficient for workload data set

    • Cache not sized for peak scene sizes

    • Multiple concurrent scenes exceeding cache capacity

  2. Write Buffer Configuration Issues

    • Too few write buffers (cf.max_write_buffer_number) causing premature eviction

    • Write buffer size too small for write bursts

    • Write buffer pressure causing cache eviction

  3. Garbage Collection Aggressiveness

    • Garbage collection removing cached data too aggressively

    • garbageCollection.deleteKeyspaceQuantile set too high

    • garbageCollection.minFreeCapacity threshold too high

  4. Workload Pattern Changes

    • New scenes or assets not fitting existing cache patterns

    • Increased scene complexity requiring more cache space

    • Concurrent workload increases exceeding cache capacity

Troubleshooting Steps#

Diagnostic Steps#

  1. Check Cache Hit/Miss Ratios

    Monitor cache hit and miss metrics to determine if cache is cold or experiencing eviction.

    # Query metrics:
    # - ddcs_m_adapter_entry_cache_total{operation="hit"}
    # - ddcs_m_adapter_entry_cache_total{operation="miss"}
    
    Analysis:
    - High miss rates relative to hits indicate cold cache or cache eviction.
    - Hit ratios below 70-80% suggest cache effectiveness issues.
    - Compare hit rates between initial scene loads (cold) and subsequent loads (warm).
    - Consistently high miss ratios after warm-up indicate cache sizing or eviction problems.
    Resolution:
    - If cache is cold, implement warm-up procedures for common scenes.
    - If miss ratios remain high after warm-up, investigate cache sizing (see Cache Eviction section).
    - Monitor ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs ddcs_m_adapter_get_total{level="rocksdb_disk_seek"} to quantify cache effectiveness.
  2. Check Cache Hit/Miss and Disk Seek Metrics

    Monitor cache effectiveness metrics to identify eviction patterns.

    # Query metrics:
    # - ddcs_m_adapter_entry_cache_total{operation="miss"} vs {operation="hit"}
    # - ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs {level="rocksdb_disk_seek"}
    
    Analysis:
    - High miss rates relative to hits indicate cache eviction.
    - High rocksdb_disk_seek relative to in_memory|rocksdb_in_memory indicates eviction forcing disk reads.
    - Increasing miss rates over time suggest cache capacity issues.
    Resolution:
    - If eviction is occurring, check cache size configuration (see step 4).
    - Monitor ddcs_m_adapter_bytes_returned_total to track data volume requiring disk access.
  3. Monitor RocksDB Stall Metrics

    Check RocksDB stall metrics to identify storage performance bottlenecks.

    # Access DDCS metrics endpoint
    kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
    curl http://localhost:3051/metrics | grep "__rocksdb_stalls"
    
    # Query stall metrics:
    # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_total_(stops|delays)"}
    # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_l0_file_count_limit"}
    # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"}
    # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|stalls)_memtable_limit"}
    
    Analysis:
    - High stall counts indicate RocksDB throttling writes due to storage performance limits.
    - L0 file count limit stalls suggest compaction cannot keep up with write volume.
    - Pending compaction bytes stalls indicate compaction backlog due to slow disk I/O.
    - Memtable limit stalls indicate write buffers cannot flush fast enough.
    Resolution:
    - If stalls are high, investigate storage class IOPS and throughput capabilities.
    - Review ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"} for slow write operations.
    - Consult CSP documentation to upgrade storage class or increase volume performance.
  4. Check I/O Performance Metrics

    Monitor I/O histogram metrics to quantify storage performance issues.

    # Query I/O metrics:
    # - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_read"}
    # - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}
    
    Analysis:
    - High read I/O latency indicates slow disk reads, suggesting cache misses requiring disk access.
    - High write I/O latency indicates slow disk writes, suggesting write buffer pressure.
    - Compare I/O latencies against storage class performance specifications.
    Resolution:
    - If I/O latencies are high, upgrade storage class or increase volume IOPS/throughput.
    - Monitor ddcs_m_adapter_error_total{kind=~"rocks_timed_out|rocks_busy|rocks_try_again"} for storage-related errors.
    - Review storage class configuration and consider higher performance tiers.
  1. Review Storage Class and Volume Configuration

    Verify storage class provides sufficient IOPS and throughput for the workload.

    # Check PVC storage class
    kubectl get pvc -n ddcs
    kubectl describe pvc -n ddcs <pvc-name>
    
    # Check storage class configuration
    kubectl get storageclass
    kubectl describe storageclass <storage-class-name>
    
    Analysis:
    - Storage class performance characteristics determine available IOPS and throughput.
    - Insufficient storage performance causes RocksDB stalls and cache eviction.
    - Compare storage class specs against workload requirements.
    Resolution:
    - Upgrade to storage class with higher IOPS/throughput if performance is insufficient.
    - Consider provisioned IOPS volumes for consistent performance.
    - Monitor stall metrics after storage changes to validate improvements.

Other Diagnostic Actions#

  • Monitor write buffer utilization: Check __rocksdb_stalls_(stops|stalls)_memtable_limit metrics for write buffer pressure

  • Review garbage collection settings: Check garbageCollection.deleteKeyspaceQuantile and garbageCollection.minFreeCapacity configuration

  • Analyze workload patterns: Review ddcs_m_adapter_entry_cache_total and ddcs_m_adapter_get_total metrics over time to identify capacity issues

  • Compare warm vs cold performance: Monitor latency metrics during warm and cold loads to quantify cache impact

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Cache hit ratio thresholds: Alert when cache hit ratios drop below 70% for extended periods

  • Cache miss rate increases: Alert on significant increases in cache miss rates

  • Write buffer pressure: Alert when write buffer utilization approaches limits

  • Memory pressure: Alert when pod memory usage approaches limits to prevent eviction

Capacity Planning#

  • Estimate cache requirements: Calculate cache sizes based on average scene sizes and access patterns

  • Plan for cache growth: Account for cache growth as workloads scale

  • Monitor cache trends: Track cache usage trends to predict when capacity increases are needed

  • Test cache effectiveness: Validate cache configurations under expected production load