DDCS: Disk Space Exhaustion#

Overview#

DDCS stores derived data on persistent volumes backed by Persistent Volume Claims (PVCs). When the volume attached to DDCS is too small or fills too quickly, pods may crash, enter a crash loop, or report errors writing to disk. DDCS requires persistent storage to maintain derived data, and disk space exhaustion prevents the service from operating correctly.

The storage limit is controlled by the storageLimit configuration value, which should be 7-10% less than the allocated PVC size to allow for breathing room, compaction, and WAL (Write-Ahead Log) files. When disk space is exhausted, RocksDB cannot write new data, flush write buffers, or perform compaction operations, causing write stalls, cache eviction, and service failures.

Disk space exhaustion can occur when:

Persistent volume allocation is insufficient for workload data growth
Data growth outpaces garbage collection and cleanup mechanisms
Retention policies are misconfigured or not functioning
Garbage collection is not running or is insufficient
Multiple DDCS pods competing for limited storage resources
RocksDB compaction backlog consuming excessive disk space
WAL files accumulating without cleanup

When disk space is exhausted, DDCS pods may fail to start, crash during operation, or report errors when attempting to write data, resulting in service unavailability and degraded rendering performance.

Symptoms and Detection Signals#

Visible Symptoms#

DDCS pods crash - Pods failing with disk-related errors
Crash loop backoff - Pods repeatedly restarting due to disk space issues
Errors writing to disk - Logs indicating write failures or “no space left” errors
Service unavailability - DDCS unable to serve requests due to disk issues
Write stalls - RocksDB stalling writes due to insufficient disk space
Cache eviction - Cache eviction occurring due to disk pressure

Log Messages#

Disk Space Errors#

Pod:

ddcs-*

Location: DDCS Pod

Application: DDCS/RocksDB

Description: Logs indicating disk space exhaustion or write failures.

# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*no space left on device*" or
         "*No space left*" or
         "*ENOSPC*" or
         "*disk full*" or
         "*failed to write*" or
         "*cannot allocate memory*" or
         "*storage limit exceeded*"

Metric Signals#

The following Prometheus metrics can be used to detect disk space exhaustion before it causes service failures. Monitor these metrics proactively to identify capacity issues early.

Metric	Description
`kubelet_volume_stats_available_bytes { persistentvolumeclaim = "ddcs-*", namespace = "ddcs" }`	Available bytes on the persistent volume. Low values indicate disk space exhaustion. Compare against `kubelet_volume_stats_capacity_bytes` to calculate usage percentage.
`kubelet_volume_stats_capacity_bytes { persistentvolumeclaim = "ddcs-*", namespace = "ddcs" }`	Total capacity of the persistent volume. Compare against available bytes to determine usage percentage. Alert when usage exceeds 80% of capacity.
`kubelet_volume_stats_used_bytes { persistentvolumeclaim = "ddcs-*", namespace = "ddcs" }`	Used bytes on the persistent volume. High values relative to capacity indicate approaching disk space limits.
`ddcs_rocksdb_intrinsic_gauge { name =~ “_rocksdb_stalls_total(stops	delays)” }`
`ddcs_rocksdb_intrinsic_gauge { name =~ “_rocksdb_stalls(stops	delays)_pending_compaction_bytes” }`
`ddcs_m_adapter_error_total { kind =~ “rocks_timed_out	rocks_busy
`ddcs_m_adapter_io_histogram_bucket { io_kind = "rocksdb_write" }`	Histogram capturing time spent on write I/O operations. High values may indicate slow disk writes due to disk space pressure or write stalls.
`ddcs_m_adapter_entry_cache_total { operation = "miss" }`	Count of cache misses. High values relative to cache hits may indicate cache eviction due to disk space pressure forcing removal of cached data.
`ddcs_m_adapter_get_total { level = "rocksdb_disk_seek" }`	Count of gets requiring disk seeks. High values relative to in-memory gets may indicate cache eviction due to disk pressure, forcing disk reads instead of cache access.

Root Cause Analysis#

Known Causes#

Disk space exhaustion in DDCS is typically caused by insufficient persistent volume allocation, data growth outpacing cleanup mechanisms, or misconfigured retention policies.

Insufficient Persistent Volume Allocation#

PVCs may be undersized for the workload data growth. DDCS stores derived data that can be many times the size of source content. If PVCs are not sized appropriately, they will fill to capacity quickly. The storageLimit configuration should be set to 7-10% less than the PVC size to allow for overhead, but if the PVC itself is too small, even proper configuration cannot prevent exhaustion.

Check PVC sizes and usage:

# Check PVC capacity and usage
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs

# Check actual disk usage within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

# Check storageLimit configuration
helm get values <ddcs-release-name> -n ddcs | grep storageLimit

# Compare PVC size to storageLimit
# storageLimit should be 7-10% less than PVC size

Data Growth Outpacing Cleanup#

Data growth may outpace garbage collection and cleanup mechanisms. If garbage collection is not running frequently enough, is misconfigured, or cannot free sufficient space, disk usage will continue to grow until exhaustion occurs. RocksDB compaction may also be unable to keep up with write volume, causing accumulation of stale SST files.

Check garbage collection and compaction:

# Check garbage collection configuration
helm get values <ddcs-release-name> -n ddcs -o yaml | grep -A 10 "garbageCollection"

# Check if GC is running
kubectl logs -n ddcs <ddcs-pod-name> | grep -i "garbage\|gc\|cleanup"

# Review RocksDB compaction settings
helm get values <ddcs-release-name> -n ddcs | grep -A 5 "periodic_compaction"

# Monitor disk usage trends
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

Other Possible Causes#

Storage Limit Misconfiguration
- storageLimit set too high relative to PVC size (should be 7-10% less)
- storageLimit not accounting for WAL and compaction overhead
- Multiple DDCS pods sharing storage resources without proper limits
RocksDB Compaction Issues
- Compaction not running efficiently due to disk space constraints
- Stale SST files not being removed due to insufficient space
- Compaction backlog causing storage growth faster than cleanup
- Pending compaction bytes exceeding limits due to slow disk I/O
Garbage Collection Configuration Issues
- garbageCollection.minFreeCapacity set too high, preventing GC from running when needed
- garbageCollection.deleteKeyspaceQuantile set too low, not freeing enough space
- garbageCollection.checkDbCapacityMs interval too long, delaying GC triggers
- GC not running due to configuration errors or service issues
Node-Level Storage Issues
- Node running out of disk space affecting all pods
- Multiple pods on same node competing for storage
- Storage backend performance issues preventing efficient cleanup
- Ephemeral storage limits being exceeded

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

Monitor Disk Usage Metrics and Set Up Alerts

Monitor disk usage metrics to detect exhaustion before it causes service failures.

# Check PVC capacity and usage
kubectl get pvc -n ddcs -o wide

# Get detailed PVC information
kubectl describe pvc -n ddcs <pvc-name>

# Check disk usage from within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

# Query Prometheus metrics for disk usage
# kubelet_volume_stats_available_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_capacity_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_used_bytes{pvc_name="<pvc-name>", namespace="ddcs"}

# Calculate usage percentage
# usage_percent = (used_bytes / capacity_bytes) * 100

# Check DDCS storage usage metrics
kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
curl http://localhost:3051/metrics | grep -i "storage\|disk\|size"

Analysis:

PVCs showing high usage (>90%) indicate capacity exhaustion risk.
Compare

kubelet_volume_stats_used_bytes

against

kubelet_volume_stats_capacity_bytes

.

DDCS storage metrics approaching

storageLimit

indicate exhaustion.

Rapid increases in disk usage indicate data growth outpacing cleanup.

Resolution:

Set up alerts for PVC usage exceeding 80% of capacity.
Monitor DDCS storage metrics and alert when approaching

storageLimit

.

Track disk usage trends to predict when capacity increases are needed.
Expand PVCs if storage class supports volume expansion (see step 3).

Expand Persistent Volume Claims (PVCs) as Needed

If disk space is exhausted or approaching limits, expand PVCs to provide additional capacity.

# Check current PVC sizes
kubectl get pvc -n ddcs

# Check if storage class supports volume expansion
kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion

# If expansion is supported, edit PVC to request larger size
# Note: Some storage classes require manual PVC edit, others support dynamic expansion
kubectl edit pvc <pvc-name> -n ddcs
# Change spec.resources.requests.storage to larger value

# After expansion, update storageLimit in Helm values to match
# storageLimit should be 7-10% less than PVC size
helm get values <ddcs-release-name> -n ddcs -o yaml > current-values.yaml
# Edit: cluster.container.settings.storageLimit
# Example: If PVC is 100GB, storageLimit should be 90-93GB

# Apply updated values
helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f current-values.yaml

Analysis:

PVCs at or near capacity require expansion.
Storage classes with

allowVolumeExpansion: true

support dynamic expansion.

After PVC expansion,

storageLimit

must be updated to match (7-10% less than PVC size).

Resolution:

Expand PVCs if storage class supports volume expansion.
For storage classes without expansion support, create new larger PVCs and migrate data.
Update

storageLimit

in Helm values after expansion (should be 7-10% less than PVC size).

Apply updated values:

helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yaml

Monitor disk usage after expansion to ensure adequate capacity.

Other Diagnostic Actions#

Check node-level storage: Verify nodes have sufficient storage capacity:

kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl top nodes
# Check for node-level disk pressure
kubectl describe nodes | grep -i "disk\|storage\|pressure"

Review storage class configuration: Verify storage class provides adequate capacity:

kubectl get storageclass
kubectl describe storageclass <storage-class-name>
# Check volume expansion capabilities
kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion

Prevention#

Proactive Monitoring#

Set up alerts for:

PVC capacity thresholds: Alert when PVC usage exceeds 80% of capacity
Storage limit thresholds: Alert when DDCS storage usage approaches storageLimit
Garbage collection failures: Alert when GC is not running or failing
Rapid disk growth: Alert on rapid increases in disk usage indicating potential issues
Node storage pressure: Alert when node-level storage is approaching limits
RocksDB write stalls: Alert when write stall metrics indicate disk space pressure
Cache eviction rates: Alert when cache miss rates increase due to disk pressure

Configuration Best Practices#

Size PVCs appropriately: Estimate storage requirements based on workload patterns and size PVCs with headroom (account for derived data being many times source content size)
Configure storage limits: Set storageLimit to 7-10% less than PVC size to allow for overhead, WAL files, and compaction
Enable garbage collection: Ensure GC is configured and running with appropriate thresholds (minFreeCapacity, deleteKeyspaceQuantile, checkDbCapacityMs)
Configure periodic compaction: Enable RocksDB periodic compaction to clean up stale SST files
Monitor disk usage trends: Track disk usage over time to predict when capacity increases are needed
Plan for data growth: Account for data growth as workloads scale and scenes become more complex
Use volume expansion: Prefer storage classes that support volume expansion for easier capacity management

Capacity Planning#

Estimate storage requirements: Calculate storage needs based on scene sizes, derived data multipliers (often 3-10x source content), and retention requirements
Plan for storage growth: Account for storage growth as workloads scale and new assets are introduced
Monitor storage trends: Track storage usage trends over time to predict when capacity increases are needed
Test retention policies: Validate GC and retention policies under expected production load
Review workload patterns: Analyze workload patterns to understand data growth rates and adjust capacity planning accordingly
Account for multiple pods: If running multiple DDCS pods, ensure total storage capacity accounts for all pods and their data growth