DDCS: Disk Space Exhaustion#

Overview#

DDCS stores derived data on persistent volumes backed by Persistent Volume Claims (PVCs). When the volume attached to DDCS is too small or fills too quickly, pods may crash, enter a crash loop, or report errors writing to disk. DDCS requires persistent storage to maintain derived data, and disk space exhaustion prevents the service from operating correctly.

The storage limit is controlled by the storageLimit configuration value, which should be 7-10% less than the allocated PVC size to allow for breathing room, compaction, and WAL (Write-Ahead Log) files. When disk space is exhausted, RocksDB cannot write new data, flush write buffers, or perform compaction operations, causing write stalls, cache eviction, and service failures.

Disk space exhaustion can occur when:

  • Persistent volume allocation is insufficient for workload data growth

  • Data growth outpaces garbage collection and cleanup mechanisms

  • Retention policies are misconfigured or not functioning

  • Garbage collection is not running or is insufficient

  • Multiple DDCS pods competing for limited storage resources

  • RocksDB compaction backlog consuming excessive disk space

  • WAL files accumulating without cleanup

When disk space is exhausted, DDCS pods may fail to start, crash during operation, or report errors when attempting to write data, resulting in service unavailability and degraded rendering performance.

Symptoms and Detection Signals#

Visible Symptoms#

  • DDCS pods crash - Pods failing with disk-related errors

  • Crash loop backoff - Pods repeatedly restarting due to disk space issues

  • Errors writing to disk - Logs indicating write failures or “no space left” errors

  • Service unavailability - DDCS unable to serve requests due to disk issues

  • Write stalls - RocksDB stalling writes due to insufficient disk space

  • Cache eviction - Cache eviction occurring due to disk pressure

Log Messages#

Disk Space Errors#

Pod: ddcs-*
Location: DDCS Pod
Application: DDCS/RocksDB
Description: Logs indicating disk space exhaustion or write failures.
# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*no space left on device*" or
         "*No space left*" or
         "*ENOSPC*" or
         "*disk full*" or
         "*failed to write*" or
         "*cannot allocate memory*" or
         "*storage limit exceeded*"

Metric Signals#

The following Prometheus metrics can be used to detect disk space exhaustion before it causes service failures. Monitor these metrics proactively to identify capacity issues early.

Metric

Description

kubelet_volume_stats_available_bytes {
    persistentvolumeclaim = "ddcs-*",
    namespace = "ddcs"
}

Available bytes on the persistent volume. Low values indicate disk space exhaustion. Compare against kubelet_volume_stats_capacity_bytes to calculate usage percentage.

kubelet_volume_stats_capacity_bytes {
    persistentvolumeclaim = "ddcs-*",
    namespace = "ddcs"
}

Total capacity of the persistent volume. Compare against available bytes to determine usage percentage. Alert when usage exceeds 80% of capacity.

kubelet_volume_stats_used_bytes {
    persistentvolumeclaim = "ddcs-*",
    namespace = "ddcs"
}

Used bytes on the persistent volume. High values relative to capacity indicate approaching disk space limits.

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_total_(stops|delays)"
}

Total count of write stalls (stops or delays) in RocksDB. High values may indicate RocksDB throttling writes due to disk space constraints, causing cache eviction and performance degradation.

ddcs_rocksdb_intrinsic_gauge {
    name =~
        "__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"
}

Count of stalls caused by pending compaction bytes exceeding limits. High values may indicate compaction backlog due to insufficient disk space, preventing cleanup of stale data.

ddcs_m_adapter_error_total {
    kind =~
        "rocks_timed_out|rocks_busy|rocks_try_again"
}

Count of errors reported by the database engine. Sharp increases may indicate the engine cannot write due to disk space constraints.

ddcs_m_adapter_io_histogram_bucket {
    io_kind = "rocksdb_write"
}

Histogram capturing time spent on write I/O operations. High values may indicate slow disk writes due to disk space pressure or write stalls.

ddcs_m_adapter_entry_cache_total {
    operation = "miss"
}

Count of cache misses. High values relative to cache hits may indicate cache eviction due to disk space pressure forcing removal of cached data.

ddcs_m_adapter_get_total {
    level = "rocksdb_disk_seek"
}

Count of gets requiring disk seeks. High values relative to in-memory gets may indicate cache eviction due to disk pressure, forcing disk reads instead of cache access.

Root Cause Analysis#

Known Causes#

Disk space exhaustion in DDCS is typically caused by insufficient persistent volume allocation, data growth outpacing cleanup mechanisms, or misconfigured retention policies.

Insufficient Persistent Volume Allocation#

PVCs may be undersized for the workload data growth. DDCS stores derived data that can be many times the size of source content. If PVCs are not sized appropriately, they will fill to capacity quickly. The storageLimit configuration should be set to 7-10% less than the PVC size to allow for overhead, but if the PVC itself is too small, even proper configuration cannot prevent exhaustion.

Check PVC sizes and usage:

# Check PVC capacity and usage
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs

# Check actual disk usage within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

# Check storageLimit configuration
helm get values <ddcs-release-name> -n ddcs | grep storageLimit

# Compare PVC size to storageLimit
# storageLimit should be 7-10% less than PVC size

Data Growth Outpacing Cleanup#

Data growth may outpace garbage collection and cleanup mechanisms. If garbage collection is not running frequently enough, is misconfigured, or cannot free sufficient space, disk usage will continue to grow until exhaustion occurs. RocksDB compaction may also be unable to keep up with write volume, causing accumulation of stale SST files.

Check garbage collection and compaction:

# Check garbage collection configuration
helm get values <ddcs-release-name> -n ddcs -o yaml | grep -A 10 "garbageCollection"

# Check if GC is running
kubectl logs -n ddcs <ddcs-pod-name> | grep -i "garbage\|gc\|cleanup"

# Review RocksDB compaction settings
helm get values <ddcs-release-name> -n ddcs | grep -A 5 "periodic_compaction"

# Monitor disk usage trends
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

Other Possible Causes#

  1. Storage Limit Misconfiguration

    • storageLimit set too high relative to PVC size (should be 7-10% less)

    • storageLimit not accounting for WAL and compaction overhead

    • Multiple DDCS pods sharing storage resources without proper limits

  2. RocksDB Compaction Issues

    • Compaction not running efficiently due to disk space constraints

    • Stale SST files not being removed due to insufficient space

    • Compaction backlog causing storage growth faster than cleanup

    • Pending compaction bytes exceeding limits due to slow disk I/O

  3. Garbage Collection Configuration Issues

    • garbageCollection.minFreeCapacity set too high, preventing GC from running when needed

    • garbageCollection.deleteKeyspaceQuantile set too low, not freeing enough space

    • garbageCollection.checkDbCapacityMs interval too long, delaying GC triggers

    • GC not running due to configuration errors or service issues

  4. Node-Level Storage Issues

    • Node running out of disk space affecting all pods

    • Multiple pods on same node competing for storage

    • Storage backend performance issues preventing efficient cleanup

    • Ephemeral storage limits being exceeded

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

  1. Monitor Disk Usage Metrics and Set Up Alerts

    Monitor disk usage metrics to detect exhaustion before it causes service failures.

    # Check PVC capacity and usage
    kubectl get pvc -n ddcs -o wide
    
    # Get detailed PVC information
    kubectl describe pvc -n ddcs <pvc-name>
    
    # Check disk usage from within pods
    kubectl exec -n ddcs <ddcs-pod-name> -- df -h
    
    # Query Prometheus metrics for disk usage
    # kubelet_volume_stats_available_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
    # kubelet_volume_stats_capacity_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
    # kubelet_volume_stats_used_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
    
    # Calculate usage percentage
    # usage_percent = (used_bytes / capacity_bytes) * 100
    
    # Check DDCS storage usage metrics
    kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
    curl http://localhost:3051/metrics | grep -i "storage\|disk\|size"
    
    Analysis:
    - PVCs showing high usage (>90%) indicate capacity exhaustion risk.
    - Compare kubelet_volume_stats_used_bytes against kubelet_volume_stats_capacity_bytes.
    - DDCS storage metrics approaching storageLimit indicate exhaustion.
    - Rapid increases in disk usage indicate data growth outpacing cleanup.
    Resolution:
    - Set up alerts for PVC usage exceeding 80% of capacity.
    - Monitor DDCS storage metrics and alert when approaching storageLimit.
    - Track disk usage trends to predict when capacity increases are needed.
    - Expand PVCs if storage class supports volume expansion (see step 3).
  2. Expand Persistent Volume Claims (PVCs) as Needed

    If disk space is exhausted or approaching limits, expand PVCs to provide additional capacity.

    # Check current PVC sizes
    kubectl get pvc -n ddcs
    
    # Check if storage class supports volume expansion
    kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion
    
    # If expansion is supported, edit PVC to request larger size
    # Note: Some storage classes require manual PVC edit, others support dynamic expansion
    kubectl edit pvc <pvc-name> -n ddcs
    # Change spec.resources.requests.storage to larger value
    
    # After expansion, update storageLimit in Helm values to match
    # storageLimit should be 7-10% less than PVC size
    helm get values <ddcs-release-name> -n ddcs -o yaml > current-values.yaml
    # Edit: cluster.container.settings.storageLimit
    # Example: If PVC is 100GB, storageLimit should be 90-93GB
    
    # Apply updated values
    helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f current-values.yaml
    
    Analysis:
    - PVCs at or near capacity require expansion.
    - Storage classes with allowVolumeExpansion: true support dynamic expansion.
    - After PVC expansion, storageLimit must be updated to match (7-10% less than PVC size).
    Resolution:
    - Expand PVCs if storage class supports volume expansion.
    - For storage classes without expansion support, create new larger PVCs and migrate data.
    - Update storageLimit in Helm values after expansion (should be 7-10% less than PVC size).
    - Apply updated values: helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yaml
    - Monitor disk usage after expansion to ensure adequate capacity.

Other Diagnostic Actions#

  • Check node-level storage: Verify nodes have sufficient storage capacity:

    kubectl describe nodes | grep -A 5 "Allocated resources"
    kubectl top nodes
    # Check for node-level disk pressure
    kubectl describe nodes | grep -i "disk\|storage\|pressure"
    
  • Review storage class configuration: Verify storage class provides adequate capacity:

    kubectl get storageclass
    kubectl describe storageclass <storage-class-name>
    # Check volume expansion capabilities
    kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion
    

Prevention#

Proactive Monitoring#

Set up alerts for:

  • PVC capacity thresholds: Alert when PVC usage exceeds 80% of capacity

  • Storage limit thresholds: Alert when DDCS storage usage approaches storageLimit

  • Garbage collection failures: Alert when GC is not running or failing

  • Rapid disk growth: Alert on rapid increases in disk usage indicating potential issues

  • Node storage pressure: Alert when node-level storage is approaching limits

  • RocksDB write stalls: Alert when write stall metrics indicate disk space pressure

  • Cache eviction rates: Alert when cache miss rates increase due to disk pressure

Configuration Best Practices#

  • Size PVCs appropriately: Estimate storage requirements based on workload patterns and size PVCs with headroom (account for derived data being many times source content size)

  • Configure storage limits: Set storageLimit to 7-10% less than PVC size to allow for overhead, WAL files, and compaction

  • Enable garbage collection: Ensure GC is configured and running with appropriate thresholds (minFreeCapacity, deleteKeyspaceQuantile, checkDbCapacityMs)

  • Configure periodic compaction: Enable RocksDB periodic compaction to clean up stale SST files

  • Monitor disk usage trends: Track disk usage over time to predict when capacity increases are needed

  • Plan for data growth: Account for data growth as workloads scale and scenes become more complex

  • Use volume expansion: Prefer storage classes that support volume expansion for easier capacity management

Capacity Planning#

  • Estimate storage requirements: Calculate storage needs based on scene sizes, derived data multipliers (often 3-10x source content), and retention requirements

  • Plan for storage growth: Account for storage growth as workloads scale and new assets are introduced

  • Monitor storage trends: Track storage usage trends over time to predict when capacity increases are needed

  • Test retention policies: Validate GC and retention policies under expected production load

  • Review workload patterns: Analyze workload patterns to understand data growth rates and adjust capacity planning accordingly

  • Account for multiple pods: If running multiple DDCS pods, ensure total storage capacity accounts for all pods and their data growth