DDCS: No Active Shards#
Overview#
When DDCS is enabled (all workflows), there must be at least 1 healthy pod serving traffic. DDCS is the storage medium for derived data. When all pods are unhealthy, rendering stops.
DDCS (Derived Data Cache Service) uses client-side logic to distribute traffic across multiple pods.
An unhealthy DDCS results in issues within write-render-worker pods.
Clients discover DDCS peers with DNS (e.g., ddcs.ddcs.svc.cluster.local). It is possible for
discovery to return IPs that are not reachable, or 0 IPs when DDCS pods are not scheduled. This can result
in unexpected issues in the rendering pipeline.
Symptoms and Detection Signals#
Visible Symptoms#
Application crashes on startup with errors related to DDCS connection failures
Active sessions unable to open files - previously working sessions suddenly cannot load new assets
Error messages in application logs indicating DDCS is unavailable
Log Messages#
0 Active Shards Log Message#
# LEVEL: Info
# SOURCE: omni.datastore.health
kubernetes.pod_name: write-render-worker* and
message: "*ShardHealthCheck: active shard list changed, have 0 active shards!*"
DNS Discovery Failed#
# LEVEL: Error
# SOURCE: omni.datastore
kubernetes.pod_name: write-render-worker* and
message: "*Unable to discover grpc datastore service uri*"
Metric Signals#
Monitor the following Prometheus metrics or Kubernetes states:
Metric/State |
Signal |
|---|---|
|
Pods showing |
|
Pods in |
|
High restart counts indicate pods that may be intermittently unavailable, causing connection failures during restart periods. |
DDCS pod endpoint health |
Pods responding to health checks but not to gRPC requests on port 3010 indicate a service-level issue rather than pod failure. |
Root Cause Analysis#
Known Causes#
Replica Set Scaled to 0#
For various reasons an operator may want to scale the DDCS ReplicaSet “to zero”. This may be done through the values configuration, by setting replicas to 0.
In this scenario, the operator has chosen to not schedule any DDCS pods. However, sessions continue to rely on the unavailable service. The symptoms described in this document will appear.
PVC Attachment Failures#
Pods may be stuck in Pending state. In this case kubectl describe the pods and PVCs in the namespace. Kubernetes may explain that PVCs
are failing to attach or cannot be created. Depending on the CSP, there may be a billing or configuration issue. Reference the documentation for your Kubernetes platform
If need be, delete the PVC(s) for DDCS and then delete the pods. Renders will be slower while the cache is rebuilt.
Misc Scheduling Issue#
Pods stuck in Pending state are often not scheduled due to insufficient resources.
Ensure DDCS is properly configured to be scheduled on the resources provisioned in your environment. Allocate more or larger CPU VM SKUs to the Kubernetes node pool. Or, modify the values configuration
to reduce resource consumption. Be mindful that this will affect overall performance.
Other Possible Causes#
DNS Propagation Delays
DNS not updated after pod termination
DNS cache TTL longer than pod lifecycle transitions
Stale DNS entries in client-side DNS cache
Network Connectivity Issues
Network policies blocking access to DDCS pods Confirm that network policies allow communication between functions and DDCS pods
Service Endpoint Mismatch
Kubernetes Service endpoints not updated to reflect current pod states
Service selector not matching current pod labels
Endpoints controller not reconciling pod changes
Resource Constraints
Pods evicted due to node resource pressure
Pods unable to start due to insufficient node resources
Storage issues preventing pod startup
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
Check if DDCS StatefulSet is Scaled Down to 0
If DDCS is enabled in the application configuration but the StatefulSet has 0 replicas, no pods will be available to serve requests.
# Check StatefulSet replica configuration kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.replicas}{" replicas configured (current: "}{.status.replicas}{" running, "}{.status.readyReplicas}{" ready)"}{"\n"}{end}' # Alternative: Simple check of all StatefulSets in ddcs namespace kubectl get statefulset -n ddcs # Verify if any DDCS pods exist kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
Analysis:- If StatefulSet shows0 replicas configured, this is the root cause.- If no pods are returned from the pod query, DDCS is not running.- If DNS SRV records exist but no pods are running, DNS may be stale or StatefulSet was recently scaled down.Resolution:- Scale the StatefulSet back up to the desired replica count (see Resolution Steps section).Check for Image Pull Authorization Issues
If pods are stuck in
ImagePullBackOffstatus, the namespace may not have the appropriate secret to access the NVIDIA container repository.# Check pod status for ImagePullBackOff kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs # Describe pods to see detailed error messages kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:" # Verify image pull secrets exist kubectl get secrets -n ddcs | grep -E "docker-registry|regcred|ngc"
Analysis:- Pods showingImagePullBackOffstatus indicate authorization issues.- Error messages in pod events will clarify why Kubernetes cannot obtain the image.- Missing or incorrect image pull secrets indicate the root cause.Resolution:- Ensure the pull secret configured for the installation has access to the NVIDIA repository.- Verify the secret name matches what’s configured in Helm values underimage.pullSecrets.Check for PVC Attachment Failures
Pods may be stuck in
Pendingstate due to Persistent Volume Claim (PVC) attachment failures.# Check pod status kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs # Describe pods to see PVC-related errors kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:" # Check PVC status kubectl get pvc -n ddcs # Describe PVCs to see attachment issues kubectl describe pvc -n ddcs
Analysis:- Pods stuck inPendingstate with PVC-related errors in events.- PVCs showingPendingorFailedstatus.- Error messages indicating attachment failures or billing/configuration issues.Resolution:- Reference documentation for your Kubernetes platform (AWS EKS, Azure AKS, etc.).- Check for billing or quota issues with storage.- If necessary, delete the PVC(s) for DDCS and then delete the pods (renders will be slower while cache is rebuilt).Check for Resource Scheduling Issues
Pods stuck in
Pendingstate may be unscheduled due to insufficient node resources.# Check pod status and scheduling events kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide # Describe pods to see why they're not scheduled kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:" # Check node resources kubectl top nodes # Check if nodes have sufficient resources kubectl describe nodes | grep -A 5 "Allocated resources"
Analysis:- Pods inPendingstate with events indicating insufficient CPU/memory.- Nodes at or near capacity.- No available nodes matching pod resource requirements.Resolution:- Allocate more or larger CPU VM SKUs to the Kubernetes node pool.- Modify thevaluesconfiguration to reduce resource consumption (be mindful this affects performance).
Other Diagnostic Actions#
DNS SRV record count mismatch: Compare the number of SRV records returned by DNS resolution against the number of healthy DDCS pods:
# Get DNS SRV records kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup -type=SRV _grpc._tcp.ddcs.ddcs.svc.cluster.local
Stale DNS SRV entries: DNS SRV records may contain endpoints for pods that no longer exist or have been terminated
Pod endpoint mismatch: SRV records pointing to pod endpoints (hostname:port) that don’t match current pod endpoints
gRPC connection failures: Attempt to connect to each discovered DDCS endpoint from a client pod:
# From within a workload pod or debug pod kubectl run -it --rm debug --image=grpcurl/grpcurl --restart=Never -- grpcurl -plaintext <ddcs-pod-ip>:3010 list
Port 3010 unreachable: Network policies or firewall rules blocking access to DDCS gRPC port
Service endpoint verification: Verify Kubernetes Service endpoints match actual pod IPs:
kubectl get endpoints ddcs -n ddcs -o yaml
Prevention#
Proactive Monitoring#
Set up alerts for:
Pod count vs DNS SRV record count mismatch: Alert when number of DNS SRV records doesn’t match number of healthy DDCS pods
Pods in non-ready state: Alert when pods remain non-ready for more than 5 minutes
High pod restart rates: Alert on elevated restart counts indicating instability
Service endpoint mismatches: Alert when service endpoints don’t match SRV record targets
gRPC connection failures: Monitor application-level metrics for DDCS connection failures