DDCS: Network Bandwidth or Latency Bottlenecks#

Overview#

When DDCS (Derived Data Cache Service) experiences network bandwidth or latency bottlenecks, cache population is slow, throughput between DDCS and GPU nodes is degraded, or scene load times increase. DDCS requires high bandwidth and low latency to efficiently serve derived data to render workers over the network.

DDCS serves derived data to render workers over the network. The service requires high bandwidth and low latency to support efficient data transfer. It is generally recommended to have 3.3Gbps of network bandwidth available for each GPU in a cluster, with a minimum of 1Gbps per GPU.

Network bandwidth or latency bottlenecks can occur when:

  • Insufficient DDCS pods scheduled to handle network load

  • Insufficient network bandwidth between DDCS and GPU nodes

  • Network congestion from noisy neighbors in shared environments

  • Misconfigured Kubernetes networking causing suboptimal routing

  • Cross-availability zone or cross-data center traffic

When network bottlenecks occur, DDCS cannot efficiently serve data to render workers, causing slow cache population, degraded throughput, and increased scene load times.

Symptoms and Detection Signals#

Visible Symptoms#

  • Slow cache population - Cache taking longer than expected to populate with derived data

  • Degraded throughput - Reduced data transfer rates between DDCS and GPU nodes

  • Increased scene load times - Scene loads taking significantly longer than expected

Metric Signals#

The following Prometheus metrics can be used to detect network bandwidth or latency bottlenecks. Review these metrics to determine if networking is a problem.

Metric

Description

ddcs_m_adapter_bytes_returned_total { }

Total bytes returned from all levels (cache or disk). High values relative to network capacity may indicate network bottlenecks. Compare against network transmit metrics to identify if DDCS is saturating available bandwidth.

container_network_receive_bytes_total {
    pod =~ "ddcs-.*"
}

Total bytes received by DDCS pods. High values approaching network interface limits indicate network saturation. Compare against network interface capacity to identify bottlenecks.

container_network_transmit_bytes_total {
    pod =~ "ddcs-.*"
}

Total bytes transmitted by DDCS pods. High values approaching network interface limits indicate network saturation. This metric combined with receive bytes helps identify if DDCS pods are network-bound.

Root Cause Analysis#

Known Causes#

Network bandwidth or latency bottlenecks in DDCS are typically caused by insufficient DDCS pods scheduled, insufficient network bandwidth, or network congestion.

Insufficient DDCS Pods Scheduled#

DDCS must be configured with replica count equal to the number of compute nodes calculated based on network bandwidth requirements. The rule of thumb is to provision at least 3.3 Gbps of bandwidth per GPU. If there are not enough DDCS pods scheduled, the available pods may become network-bound, causing bottlenecks. Refer to the DDCS: Configure guide for scaling guidance.

Check DDCS pod count:

# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs
kubectl describe statefulset -n ddcs <ddcs-statefulset-name>

# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required

Insufficient Network Bandwidth#

Network bandwidth may be insufficient for the workload. DDCS requires high bandwidth to serve derived data efficiently. It is recommended to have 3.3Gbps per GPU, with a minimum of 1Gbps.

Network Congestion#

In shared network environments, other workloads may consume bandwidth, causing congestion and degraded performance for DDCS traffic.

Other Possible Causes#

  1. Misconfigured Kubernetes Networking

    • Suboptimal routing causing increased latency

    • Network policies limiting throughput

    • Cross-availability zone traffic

  2. Cross-Availability Zone or Cross-Region Traffic

    • Pods in different availability zones causing higher latency

    • Cross-region traffic increasing latency significantly

    • Public internet transit instead of private networking

  3. Node-Level Network Issues

    • Node network interface problems

    • Network driver issues

    • Hardware network limitations

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

  1. Check DDCS Pod Count and Scaling

    Verify there are sufficient DDCS pods scheduled to handle network load. Refer to the DDCS: Configure guide for scaling requirements.

    # Check current DDCS pod count
    kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
    
    # Calculate required DDCS pods
    # Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
    # Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required
    
    # Check network metrics per pod
    # Query: container_network_transmit_bytes_total{pod=~"ddcs-.*"}
    # Query: container_network_receive_bytes_total{pod=~"ddcs-.*"}
    
    Analysis:
    - Pod count below requirements indicates insufficient scaling.
    - High network utilization per pod suggests pods are network-bound.
    - Network metrics approaching interface limits indicate saturation.
    Resolution:
    - Scale DDCS StatefulSet to match calculated pod requirements.
    - Refer to DDCS: Configure for scaling guidance.
    - Update Helm values: helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yaml
    - Monitor network metrics after scaling to verify improvement.
  2. Monitor Network Metrics

    Review network metrics to identify bandwidth saturation or bottlenecks.

    # Query DDCS bytes returned metric
    # ddcs_m_adapter_bytes_returned_total
    
    # Query container network metrics
    # container_network_receive_bytes_total{pod=~"ddcs-.*"}
    # container_network_transmit_bytes_total{pod=~"ddcs-.*"}
    
    # Calculate network utilization
    # Compare transmit/receive bytes against network interface capacity
    # Check if metrics are approaching interface limits
    
    Analysis:
    - High ddcs_m_adapter_bytes_returned_total relative to network capacity indicates potential bottlenecks.
    - Network transmit/receive bytes approaching interface limits indicate saturation.
    - High utilization across nodes suggests network congestion.
    Resolution:
    - If network is saturated, scale DDCS pods to distribute load (see step 1).
    - Upgrade to VM SKUs with higher NIC speeds if bandwidth is insufficient.
    - Investigate network congestion from other workloads.

Other Diagnostic Actions#

  • Check pod placement: Verify DDCS and GPU pods are optimally placed:

    kubectl get pods -n ddcs -o wide
    kubectl get pods -n <workload-namespace> -o wide
    # Ensure pods are in same availability zone for optimal performance
    
  • Review network policies: Check if network policies are affecting performance:

    kubectl get networkpolicies -n ddcs
    kubectl describe networkpolicy <policy-name> -n ddcs
    
  • Monitor network trends: Track network performance over time:

    # Use cloud provider network metrics
    # Monitor throughput, latency, and error rates
    # Identify patterns and trends
    

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Network bandwidth thresholds: Alert when network utilization exceeds 80% of available bandwidth

  • DDCS pod count: Alert when DDCS pod count is below calculated requirements

  • Network saturation: Alert when container_network_transmit_bytes_total or container_network_receive_bytes_total approach interface limits

  • High bytes returned: Alert when ddcs_m_adapter_bytes_returned_total indicates potential network bottlenecks