DDCS: Network Bandwidth or Latency Bottlenecks#
Overview#
When DDCS (Derived Data Cache Service) experiences network bandwidth or latency bottlenecks, cache population is slow, throughput between DDCS and GPU nodes is degraded, or scene load times increase. DDCS requires high bandwidth and low latency to efficiently serve derived data to render workers over the network.
DDCS serves derived data to render workers over the network. The service requires high bandwidth and low latency to support efficient data transfer. It is generally recommended to have 3.3Gbps of network bandwidth available for each GPU in a cluster, with a minimum of 1Gbps per GPU.
Network bandwidth or latency bottlenecks can occur when:
Insufficient DDCS pods scheduled to handle network load
Insufficient network bandwidth between DDCS and GPU nodes
Network congestion from noisy neighbors in shared environments
Misconfigured Kubernetes networking causing suboptimal routing
Cross-availability zone or cross-data center traffic
When network bottlenecks occur, DDCS cannot efficiently serve data to render workers, causing slow cache population, degraded throughput, and increased scene load times.
Symptoms and Detection Signals#
Visible Symptoms#
Slow cache population - Cache taking longer than expected to populate with derived data
Degraded throughput - Reduced data transfer rates between DDCS and GPU nodes
Increased scene load times - Scene loads taking significantly longer than expected
Metric Signals#
The following Prometheus metrics can be used to detect network bandwidth or latency bottlenecks. Review these metrics to determine if networking is a problem.
Metric |
Description |
|---|---|
ddcs_m_adapter_bytes_returned_total { }
|
Total bytes returned from all levels (cache or disk). High values relative to network capacity may indicate network bottlenecks. Compare against network transmit metrics to identify if DDCS is saturating available bandwidth. |
container_network_receive_bytes_total {
pod =~ "ddcs-.*"
}
|
Total bytes received by DDCS pods. High values approaching network interface limits indicate network saturation. Compare against network interface capacity to identify bottlenecks. |
container_network_transmit_bytes_total {
pod =~ "ddcs-.*"
}
|
Total bytes transmitted by DDCS pods. High values approaching network interface limits indicate network saturation. This metric combined with receive bytes helps identify if DDCS pods are network-bound. |
Root Cause Analysis#
Known Causes#
Network bandwidth or latency bottlenecks in DDCS are typically caused by insufficient DDCS pods scheduled, insufficient network bandwidth, or network congestion.
Insufficient DDCS Pods Scheduled#
DDCS must be configured with replica count equal to the number of compute nodes calculated based on network bandwidth requirements. The rule of thumb is to provision at least 3.3 Gbps of bandwidth per GPU. If there are not enough DDCS pods scheduled, the available pods may become network-bound, causing bottlenecks. Refer to the DDCS: Configure guide for scaling guidance.
Check DDCS pod count:
# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs
kubectl describe statefulset -n ddcs <ddcs-statefulset-name>
# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required
Insufficient Network Bandwidth#
Network bandwidth may be insufficient for the workload. DDCS requires high bandwidth to serve derived data efficiently. It is recommended to have 3.3Gbps per GPU, with a minimum of 1Gbps.
Network Congestion#
In shared network environments, other workloads may consume bandwidth, causing congestion and degraded performance for DDCS traffic.
Other Possible Causes#
Misconfigured Kubernetes Networking
Suboptimal routing causing increased latency
Network policies limiting throughput
Cross-availability zone traffic
Cross-Availability Zone or Cross-Region Traffic
Pods in different availability zones causing higher latency
Cross-region traffic increasing latency significantly
Public internet transit instead of private networking
Node-Level Network Issues
Node network interface problems
Network driver issues
Hardware network limitations
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
Check DDCS Pod Count and Scaling
Verify there are sufficient DDCS pods scheduled to handle network load. Refer to the DDCS: Configure guide for scaling requirements.
# Check current DDCS pod count kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs # Calculate required DDCS pods # Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth # Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required # Check network metrics per pod # Query: container_network_transmit_bytes_total{pod=~"ddcs-.*"} # Query: container_network_receive_bytes_total{pod=~"ddcs-.*"}
Analysis:- Pod count below requirements indicates insufficient scaling.- High network utilization per pod suggests pods are network-bound.- Network metrics approaching interface limits indicate saturation.Resolution:- Scale DDCS StatefulSet to match calculated pod requirements.- Refer to DDCS: Configure for scaling guidance.- Update Helm values:helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yaml- Monitor network metrics after scaling to verify improvement.Monitor Network Metrics
Review network metrics to identify bandwidth saturation or bottlenecks.
# Query DDCS bytes returned metric # ddcs_m_adapter_bytes_returned_total # Query container network metrics # container_network_receive_bytes_total{pod=~"ddcs-.*"} # container_network_transmit_bytes_total{pod=~"ddcs-.*"} # Calculate network utilization # Compare transmit/receive bytes against network interface capacity # Check if metrics are approaching interface limits
Analysis:- Highddcs_m_adapter_bytes_returned_totalrelative to network capacity indicates potential bottlenecks.- Network transmit/receive bytes approaching interface limits indicate saturation.- High utilization across nodes suggests network congestion.Resolution:- If network is saturated, scale DDCS pods to distribute load (see step 1).- Upgrade to VM SKUs with higher NIC speeds if bandwidth is insufficient.- Investigate network congestion from other workloads.
Other Diagnostic Actions#
Check pod placement: Verify DDCS and GPU pods are optimally placed:
kubectl get pods -n ddcs -o wide kubectl get pods -n <workload-namespace> -o wide # Ensure pods are in same availability zone for optimal performance
Review network policies: Check if network policies are affecting performance:
kubectl get networkpolicies -n ddcs kubectl describe networkpolicy <policy-name> -n ddcs
Monitor network trends: Track network performance over time:
# Use cloud provider network metrics # Monitor throughput, latency, and error rates # Identify patterns and trends
Prevention#
Proactive Monitoring#
Set up alerts for:
Network bandwidth thresholds: Alert when network utilization exceeds 80% of available bandwidth
DDCS pod count: Alert when DDCS pod count is below calculated requirements
Network saturation: Alert when
container_network_transmit_bytes_totalorcontainer_network_receive_bytes_totalapproach interface limitsHigh bytes returned: Alert when
ddcs_m_adapter_bytes_returned_totalindicates potential network bottlenecks