DDCS: Cache Misses and Performance Degradation#
Overview#
DDCS caches derived data that is computationally expensive to generate. When cache misses occur frequently, render time is spent regenerating derived content before getting to your desired workload, significantly increasing scene load times and reducing overall system performance.
Cache misses and performance degradation can occur when:
Cache is cold (scene or assets not preloaded)
Cache eviction due to memory or disk pressure
Misconfigured cache size parameters
Insufficient cache capacity for workload patterns
Write buffer pressure causing cache eviction
Symptoms and Detection Signals#
Visible Symptoms#
User-facing slowdowns - Scene loads taking significantly longer than expected
Inconsistent performance - Performance varies dramatically between “warm” and “cold” scene loads
Log Messages#
Write Buffer Pressure#
# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*timeout while trying to write to DB*" or
"*write failed because the database is busy*" or
"*failed to write*" or
"*the engine is busy and cannot currently process the request*"
Metric Signals#
The following Prometheus metrics can be used to further determine the cause of the problem. During normal operation these counts may climb. These metrics relate to the write performance of the storage medium. When the engine is writing or wants to write data faster than the medium allows for, scene load times will suffer.
Metric |
Description |
|---|---|
ddcs_rocksdb_intrinsic_gauge {
name =~
"__rocksdb_stalls_total_(stops|delays)"
}
|
Total count of write stalls (stops or delays) in RocksDB. High values indicate RocksDB is throttling writes due to resource pressure, causing cache eviction and performance degradation. |
ddcs_rocksdb_intrinsic_gauge {
name =~
"__rocksdb_stalls_(stops|delays)_l0_file_count_limit"
}
|
Count of stalls caused by exceeding the L0 file count limit. High values indicate compaction cannot keep up with write volume, causing RocksDB to stall writes and leading to cache eviction. |
ddcs_rocksdb_intrinsic_gauge {
name =~
"__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"
}
|
Count of stalls caused by pending compaction bytes exceeding limits. High values indicate compaction backlog is building faster than it can be processed, suggesting disk I/O bottlenecks and causing cache eviction. |
ddcs_rocksdb_intrinsic_gauge {
name =~
"__rocksdb_stalls_(stops|stalls)_memtable_limit"
}
|
Count of stalls caused by memtable (write buffer) limits being reached. High values indicate write buffers are full and cannot be flushed fast enough, causing write stalls and cache eviction. |
ddcs_m_adapter_error_total {
kind =~
"rocks_timed_out|rocks_busy|rocks_try_again"
}
|
Count of errors reported by the database engine. During normal operation these counts will climb. However, if these sharply increase in concert, then it may indicate that the engine cannot keep up. |
ddcs_m_adapter_entry_cache_total {
operation = "miss"
}
|
Count of cache misses. High values relative to cache hits indicate cache effectiveness issues, suggesting cache is cold, undersized, or experiencing eviction pressure. |
ddcs_m_adapter_entry_cache_total {
operation = "hit"
}
|
Count of cache hits. Compare against cache misses to calculate hit ratio. Low hit ratios (below 70-80%) indicate cache effectiveness issues. |
ddcs_m_adapter_get_total {
level = "in_memory|rocksdb_in_memory"
}
|
Count of successful gets from in-memory cache. This is the ideal location requiring no disk I/O. Low values relative to disk seeks indicate cache eviction or cold cache conditions. |
ddcs_m_adapter_get_total {
level = "rocksdb_disk_seek"
}
|
Count of gets requiring disk seeks. High values relative to in-memory gets indicate cache misses requiring disk I/O, causing performance degradation. |
ddcs_m_adapter_io_histogram_bucket {
io_kind = "rocksdb_read"
}
|
Histogram capturing time spent on read I/O operations. High values indicate slow disk reads, suggesting disk I/O bottlenecks or cache misses requiring disk access. |
ddcs_m_adapter_io_histogram_bucket {
io_kind = "rocksdb_write"
}
|
Histogram capturing time spent on write I/O operations. High values indicate slow disk writes, suggesting write buffer pressure or disk I/O bottlenecks causing performance degradation. |
ddcs_m_adapter_bytes_returned_total { }
|
Total bytes returned from all levels (cache or disk). High values relative to in-memory cache hits indicate increased disk I/O, suggesting cache eviction or cold cache conditions. |
Root Cause Analysis#
Possible Causes#
Cold Cache (Scene or Assets Not Preloaded)#
A cold cache occurs when scenes or assets have not been preloaded into DDCS. During initial scene loads, GPUs encounter assets for the first time and must generate derived content synchronously, which is then written to DDCS. This synchronous generation and writing process significantly increases scene load times compared to warm cache scenarios where derived data is already available.
Cold cache conditions are expected during initial scene loads or when new assets are introduced. However, if cache miss ratios remain consistently high across multiple scene loads or after cache warm-up procedures, it may indicate that the cache is not sized correctly for the workload or that cache eviction is occurring too frequently.
Cache Eviction Due to Memory/Disk Pressure#
Cache eviction occurs when memory or disk pressure forces DDCS to remove cached data to make room for new writes. Write buffers evict cached key-value pairs when full, and RocksDB may evict data from caches when memory pressure occurs.
Cache size parameters (sys.cache_size for row cache, sys.block_cache_size for block cache)
may be misconfigured for the workload. Too small caches result in frequent eviction and cache misses,
while too large caches may cause memory pressure and OOM kills.
Storage Medium has Insufficient Performance#
Insufficient storage performance occurs when the underlying persistent volume cannot keep up with DDCS write and read operations. When storage IOPS or throughput is insufficient, RocksDB cannot flush write buffers or perform compaction fast enough, causing write stalls and cache eviction. This manifests as high stall metrics, slow I/O histograms, and increased disk seek operations.
Storage performance requirements depend on your installation environment and workload. The storage class and persistent volume configuration must provide sufficient IOPS and throughput for the attached volumes. Consult your cloud service provider (CSP) documentation for storage class performance characteristics and scaling options.
Other Possible Causes#
Insufficient Cache Capacity
Total cache size insufficient for workload data set
Cache not sized for peak scene sizes
Multiple concurrent scenes exceeding cache capacity
Write Buffer Configuration Issues
Too few write buffers (
cf.max_write_buffer_number) causing premature evictionWrite buffer size too small for write bursts
Write buffer pressure causing cache eviction
Garbage Collection Aggressiveness
Garbage collection removing cached data too aggressively
garbageCollection.deleteKeyspaceQuantileset too highgarbageCollection.minFreeCapacitythreshold too high
Workload Pattern Changes
New scenes or assets not fitting existing cache patterns
Increased scene complexity requiring more cache space
Concurrent workload increases exceeding cache capacity
Troubleshooting Steps#
Diagnostic Steps#
Check Cache Hit/Miss Ratios
Monitor cache hit and miss metrics to determine if cache is cold or experiencing eviction.
# Query metrics: # - ddcs_m_adapter_entry_cache_total{operation="hit"} # - ddcs_m_adapter_entry_cache_total{operation="miss"}
Analysis:- High miss rates relative to hits indicate cold cache or cache eviction.- Hit ratios below 70-80% suggest cache effectiveness issues.- Compare hit rates between initial scene loads (cold) and subsequent loads (warm).- Consistently high miss ratios after warm-up indicate cache sizing or eviction problems.Resolution:- If cache is cold, implement warm-up procedures for common scenes.- If miss ratios remain high after warm-up, investigate cache sizing (see Cache Eviction section).- Monitorddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"}vsddcs_m_adapter_get_total{level="rocksdb_disk_seek"}to quantify cache effectiveness.Check Cache Hit/Miss and Disk Seek Metrics
Monitor cache effectiveness metrics to identify eviction patterns.
# Query metrics: # - ddcs_m_adapter_entry_cache_total{operation="miss"} vs {operation="hit"} # - ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs {level="rocksdb_disk_seek"}
Analysis:- High miss rates relative to hits indicate cache eviction.- Highrocksdb_disk_seekrelative toin_memory|rocksdb_in_memoryindicates eviction forcing disk reads.- Increasing miss rates over time suggest cache capacity issues.Resolution:- If eviction is occurring, check cache size configuration (see step 4).- Monitorddcs_m_adapter_bytes_returned_totalto track data volume requiring disk access.Monitor RocksDB Stall Metrics
Check RocksDB stall metrics to identify storage performance bottlenecks.
# Access DDCS metrics endpoint kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051 curl http://localhost:3051/metrics | grep "__rocksdb_stalls" # Query stall metrics: # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_total_(stops|delays)"} # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_l0_file_count_limit"} # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"} # - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|stalls)_memtable_limit"}
Analysis:- High stall counts indicate RocksDB throttling writes due to storage performance limits.- L0 file count limit stalls suggest compaction cannot keep up with write volume.- Pending compaction bytes stalls indicate compaction backlog due to slow disk I/O.- Memtable limit stalls indicate write buffers cannot flush fast enough.Resolution:- If stalls are high, investigate storage class IOPS and throughput capabilities.- Reviewddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}for slow write operations.- Consult CSP documentation to upgrade storage class or increase volume performance.Check I/O Performance Metrics
Monitor I/O histogram metrics to quantify storage performance issues.
# Query I/O metrics: # - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_read"} # - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}
Analysis:- High read I/O latency indicates slow disk reads, suggesting cache misses requiring disk access.- High write I/O latency indicates slow disk writes, suggesting write buffer pressure.- Compare I/O latencies against storage class performance specifications.Resolution:- If I/O latencies are high, upgrade storage class or increase volume IOPS/throughput.- Monitorddcs_m_adapter_error_total{kind=~"rocks_timed_out|rocks_busy|rocks_try_again"}for storage-related errors.- Review storage class configuration and consider higher performance tiers.
Review Storage Class and Volume Configuration
Verify storage class provides sufficient IOPS and throughput for the workload.
# Check PVC storage class kubectl get pvc -n ddcs kubectl describe pvc -n ddcs <pvc-name> # Check storage class configuration kubectl get storageclass kubectl describe storageclass <storage-class-name>
Analysis:- Storage class performance characteristics determine available IOPS and throughput.- Insufficient storage performance causes RocksDB stalls and cache eviction.- Compare storage class specs against workload requirements.Resolution:- Upgrade to storage class with higher IOPS/throughput if performance is insufficient.- Consider provisioned IOPS volumes for consistent performance.- Monitor stall metrics after storage changes to validate improvements.
Other Diagnostic Actions#
Monitor write buffer utilization: Check
__rocksdb_stalls_(stops|stalls)_memtable_limitmetrics for write buffer pressureReview garbage collection settings: Check
garbageCollection.deleteKeyspaceQuantileandgarbageCollection.minFreeCapacityconfigurationAnalyze workload patterns: Review
ddcs_m_adapter_entry_cache_totalandddcs_m_adapter_get_totalmetrics over time to identify capacity issuesCompare warm vs cold performance: Monitor latency metrics during warm and cold loads to quantify cache impact
Prevention#
Proactive Monitoring#
Set up alerts for:
Cache hit ratio thresholds: Alert when cache hit ratios drop below 70% for extended periods
Cache miss rate increases: Alert on significant increases in cache miss rates
Write buffer pressure: Alert when write buffer utilization approaches limits
Memory pressure: Alert when pod memory usage approaches limits to prevent eviction
Capacity Planning#
Estimate cache requirements: Calculate cache sizes based on average scene sizes and access patterns
Plan for cache growth: Account for cache growth as workloads scale
Monitor cache trends: Track cache usage trends to predict when capacity increases are needed
Test cache effectiveness: Validate cache configurations under expected production load