DDCS: No Active Shards#

Overview#

When DDCS is enabled (all workflows), there must be at least 1 healthy pod serving traffic. DDCS is the storage medium for derived data. When all pods are unhealthy, rendering stops.

DDCS (Derived Data Cache Service) uses client-side logic to distribute traffic across multiple pods. An unhealthy DDCS results in issues within write-render-worker pods.

Clients discover DDCS peers with DNS (e.g., ddcs.ddcs.svc.cluster.local). It is possible for discovery to return IPs that are not reachable, or 0 IPs when DDCS pods are not scheduled. This can result in unexpected issues in the rendering pipeline.

Symptoms and Detection Signals#

Visible Symptoms#

  • Application crashes on startup with errors related to DDCS connection failures

  • Active sessions unable to open files - previously working sessions suddenly cannot load new assets

  • Error messages in application logs indicating DDCS is unavailable

Log Messages#

0 Active Shards Log Message#

Pod: write-render-worker
Location: Cloud Function Pod
Application: KIT
Description: The client discovered peers but determined all peers were unhealthy.
# LEVEL: Info
# SOURCE: omni.datastore.health
kubernetes.pod_name: write-render-worker* and
message: "*ShardHealthCheck: active shard list changed, have 0 active shards!*"

DNS Discovery Failed#

Pod: write-render-worker
Location: Cloud Function Pod
Application: KIT
Description: The client was configured to discover DDCS pods but there is a problem finding peers.
# LEVEL: Error
# SOURCE: omni.datastore
kubernetes.pod_name: write-render-worker* and
message: "*Unable to discover grpc datastore service uri*"

Metric Signals#

Monitor the following Prometheus metrics or Kubernetes states:

Metric/State

Signal

kube_pod_status_ready

Pods showing false for extended periods indicate pods that are not ready to accept traffic. Compare DNS entries against ready pods.

kube_pod_status_phase

Pods in Pending, Failed, or CrashLoopBackOff states are not reachable. These pods may still appear in DNS.

kube_pod_container_status_restarts_total

High restart counts indicate pods that may be intermittently unavailable, causing connection failures during restart periods.

DDCS pod endpoint health

Pods responding to health checks but not to gRPC requests on port 3010 indicate a service-level issue rather than pod failure.

Root Cause Analysis#

Known Causes#

Replica Set Scaled to 0#

For various reasons an operator may want to scale the DDCS ReplicaSet “to zero”. This may be done through the values configuration, by setting replicas to 0.

In this scenario, the operator has chosen to not schedule any DDCS pods. However, sessions continue to rely on the unavailable service. The symptoms described in this document will appear.

Not Authorized for Nvidia Image repository#

The namespace may not have the appropriate secret to access the Nvidia container repository.

Check the Kubernetes pods in the namespace. If all pods are stuck with ImagePullBackOff status, kubectl describe the pods. The log will clarify why Kubernetes is unable to obtain the image. If it appears that authorization is an issue, ensure that the pull secret configured for the installation has access to the Nvidia repository.

PVC Attachment Failures#

Pods may be stuck in Pending state. In this case kubectl describe the pods and PVCs in the namespace. Kubernetes may explain that PVCs are failing to attach or cannot be created. Depending on the CSP, there may be a billing or configuration issue. Reference the documentation for your Kubernetes platform

If need be, delete the PVC(s) for DDCS and then delete the pods. Renders will be slower while the cache is rebuilt.

Misc Scheduling Issue#

Pods stuck in Pending state are often not scheduled due to insufficient resources.

Ensure DDCS is properly configured to be scheduled on the resources provisioned in your environment. Allocate more or larger CPU VM SKUs to the Kubernetes node pool. Or, modify the values configuration to reduce resource consumption. Be mindful that this will affect overall performance.

Other Possible Causes#

  1. DNS Propagation Delays

    • DNS not updated after pod termination

    • DNS cache TTL longer than pod lifecycle transitions

    • Stale DNS entries in client-side DNS cache

  2. Network Connectivity Issues

    • Network policies blocking access to DDCS pods Confirm that network policies allow communication between functions and DDCS pods

  3. Service Endpoint Mismatch

    • Kubernetes Service endpoints not updated to reflect current pod states

    • Service selector not matching current pod labels

    • Endpoints controller not reconciling pod changes

  4. Resource Constraints

    • Pods evicted due to node resource pressure

    • Pods unable to start due to insufficient node resources

    • Storage issues preventing pod startup

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

  1. Check if DDCS StatefulSet is Scaled Down to 0

    If DDCS is enabled in the application configuration but the StatefulSet has 0 replicas, no pods will be available to serve requests.

    # Check StatefulSet replica configuration
    kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.replicas}{" replicas configured (current: "}{.status.replicas}{" running, "}{.status.readyReplicas}{" ready)"}{"\n"}{end}'
    
    # Alternative: Simple check of all StatefulSets in ddcs namespace
    kubectl get statefulset -n ddcs
    
    # Verify if any DDCS pods exist
    kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
    
    Analysis:
    - If StatefulSet shows 0 replicas configured, this is the root cause.
    - If no pods are returned from the pod query, DDCS is not running.
    - If DNS SRV records exist but no pods are running, DNS may be stale or StatefulSet was recently scaled down.
    Resolution:
    - Scale the StatefulSet back up to the desired replica count (see Resolution Steps section).
  2. Check for Image Pull Authorization Issues

    If pods are stuck in ImagePullBackOff status, the namespace may not have the appropriate secret to access the NVIDIA container repository.

    # Check pod status for ImagePullBackOff
    kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
    
    # Describe pods to see detailed error messages
    kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"
    
    # Verify image pull secrets exist
    kubectl get secrets -n ddcs | grep -E "docker-registry|regcred|ngc"
    
    Analysis:
    - Pods showing ImagePullBackOff status indicate authorization issues.
    - Error messages in pod events will clarify why Kubernetes cannot obtain the image.
    - Missing or incorrect image pull secrets indicate the root cause.
    Resolution:
    - Ensure the pull secret configured for the installation has access to the NVIDIA repository.
    - Verify the secret name matches what’s configured in Helm values under image.pullSecrets.
  3. Check for PVC Attachment Failures

    Pods may be stuck in Pending state due to Persistent Volume Claim (PVC) attachment failures.

    # Check pod status
    kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
    
    # Describe pods to see PVC-related errors
    kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"
    
    # Check PVC status
    kubectl get pvc -n ddcs
    
    # Describe PVCs to see attachment issues
    kubectl describe pvc -n ddcs
    
    Analysis:
    - Pods stuck in Pending state with PVC-related errors in events.
    - PVCs showing Pending or Failed status.
    - Error messages indicating attachment failures or billing/configuration issues.
    Resolution:
    - Reference documentation for your Kubernetes platform (AWS EKS, Azure AKS, etc.).
    - Check for billing or quota issues with storage.
    - If necessary, delete the PVC(s) for DDCS and then delete the pods (renders will be slower while cache is rebuilt).
  4. Check for Resource Scheduling Issues

    Pods stuck in Pending state may be unscheduled due to insufficient node resources.

    # Check pod status and scheduling events
    kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide
    
    # Describe pods to see why they're not scheduled
    kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"
    
    # Check node resources
    kubectl top nodes
    
    # Check if nodes have sufficient resources
    kubectl describe nodes | grep -A 5 "Allocated resources"
    
    Analysis:
    - Pods in Pending state with events indicating insufficient CPU/memory.
    - Nodes at or near capacity.
    - No available nodes matching pod resource requirements.
    Resolution:
    - Allocate more or larger CPU VM SKUs to the Kubernetes node pool.
    - Modify the values configuration to reduce resource consumption (be mindful this affects performance).

Other Diagnostic Actions#

  • DNS SRV record count mismatch: Compare the number of SRV records returned by DNS resolution against the number of healthy DDCS pods:

    # Get DNS SRV records
    kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup -type=SRV _grpc._tcp.ddcs.ddcs.svc.cluster.local
    
  • Stale DNS SRV entries: DNS SRV records may contain endpoints for pods that no longer exist or have been terminated

  • Pod endpoint mismatch: SRV records pointing to pod endpoints (hostname:port) that don’t match current pod endpoints

  • gRPC connection failures: Attempt to connect to each discovered DDCS endpoint from a client pod:

    # From within a workload pod or debug pod
    kubectl run -it --rm debug --image=grpcurl/grpcurl --restart=Never -- grpcurl -plaintext <ddcs-pod-ip>:3010 list
    
  • Port 3010 unreachable: Network policies or firewall rules blocking access to DDCS gRPC port

  • Service endpoint verification: Verify Kubernetes Service endpoints match actual pod IPs:

    kubectl get endpoints ddcs -n ddcs -o yaml
    

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Pod count vs DNS SRV record count mismatch: Alert when number of DNS SRV records doesn’t match number of healthy DDCS pods

  • Pods in non-ready state: Alert when pods remain non-ready for more than 5 minutes

  • High pod restart rates: Alert on elevated restart counts indicating instability

  • Service endpoint mismatches: Alert when service endpoints don’t match SRV record targets

  • gRPC connection failures: Monitor application-level metrics for DDCS connection failures