DDCS: RocksDB Corruption or Failures#
Overview#
When DDCS (Derived Data Cache Service) experiences RocksDB corruption or failures, the service may fail to start, logs indicate database corruption, or data loss is observed. DDCS uses RocksDB as its persistent storage backend for derived data, and corruption can prevent the service from operating correctly.
DDCS relies on RocksDB for persistent storage of derived data. RocksDB is a persistent key-value store that provides durability and performance. When corruption occurs, RocksDB cannot read or write data correctly, causing service failures.
RocksDB corruption can occur when:
Unclean shutdowns preventing proper database closure
Disk errors or hardware failures affecting persistent volumes
File system corruption on persistent volumes
Insufficient disk space during critical operations
The only solution for corruption is to delete all PVCs and delete all pods to force a restart, which will result in data loss and require the cache to be rebuilt.
Symptoms and Detection Signals#
Visible Symptoms#
Service fails to start - DDCS pods unable to start due to database errors
Database corruption errors - Logs indicating RocksDB corruption or invalid data
Data loss - Missing or corrupted derived data in cache
Repeated pod restarts - Pods crashing repeatedly due to database errors
Log Messages#
Database Open Failures#
# LEVEL: Error/Fatal
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*failed to open rocksdb database*"
Metric Signals#
The following Prometheus metrics can be used to detect RocksDB corruption or I/O errors.
Metric |
Description |
|---|---|
ddcs_m_adapter_error_total {
kind =~
"rocks_io_error"
}
|
Count of I/O errors reported by RocksDB. High values or sharp increases indicate disk I/O failures or corruption that may prevent database operations. |
Root Cause Analysis#
Known Causes#
RocksDB corruption in DDCS is typically caused by unclean shutdowns, disk errors, or software bugs.
Unclean Shutdowns#
Unclean shutdowns occur when DDCS pods are terminated abruptly without allowing RocksDB to flush data and close the database properly. This can happen during node failures, forced pod deletions, or OOM kills. When RocksDB cannot close cleanly, WAL files or SST files may be left in an inconsistent state, causing corruption.
Check for unclean shutdowns:
kubectl describe pod -n ddcs <ddcs-pod-name> | grep -A 10 "Events:"
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide
# Look for termination reasons: OOMKilled, Evicted, NodeShutdown
kubectl get events -n ddcs --sort-by='.lastTimestamp' | grep -i "evict\|kill\|terminate"
Disk Errors or Hardware Failures#
Disk errors or hardware failures on persistent volumes can cause data corruption. This may manifest as I/O errors, read/write failures, or corrupted data blocks. Persistent volumes backed by unreliable storage or experiencing hardware issues are more susceptible to corruption.
Check for disk errors:
kubectl logs -n ddcs <ddcs-pod-name> | grep -i "disk\|io\|error\|fail"
kubectl describe pvc -n ddcs
# Check for volume attachment issues or storage backend problems
Other Possible Causes#
File System Corruption
File system corruption on persistent volumes
File system errors preventing proper writes
Mount issues causing data inconsistencies
Insufficient Disk Space
Disk space exhaustion during critical operations
WAL files or SST files corrupted due to space constraints
Compaction failures due to insufficient space
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
Delete All PVCs and Pods to Force Restart
The only solution for RocksDB corruption is to delete all PVCs and delete all pods to force a restart. This will result in data loss, and the cache will be rebuilt as workloads run.
# List all DDCS PVCs kubectl get pvc -n ddcs # Delete all DDCS PVCs # WARNING: This will result in complete data loss kubectl delete pvc --all -n ddcs # Delete all DDCS pods kubectl delete pods -n ddcs -l app.kubernetes.io/instance=ddcs # Verify pods are running and healthy kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
Analysis:- Deleting PVCs removes all corrupted data.- Deleting pods forces recreation with new PVCs.- Cache will be rebuilt as workloads access DDCS.Resolution:- Delete all PVCs to remove corrupted data (data will be lost).- Delete all pods to force restart with clean storage.- Monitor pod startup to ensure service is healthy.- Cache will rebuild automatically as workloads run (renders will be slower during rebuild).
Proactive Monitoring#
Set up alerts for:
RocksDB I/O errors: Alert when
ddcs_m_adapter_error_total{kind=~"rocks_io_error"}increasesPod termination events: Alert on OOM kills or forced pod terminations
Database corruption errors: Alert on corruption-related log messages
Pod startup failures: Alert when DDCS pods fail to start repeatedly