UCC: Network Bandwidth Saturation#

Overview#

UCC serves cached USD assets to render workers over the network. Each GPU node downloads textures, geometry, materials, and derived data from UCC during scene loads. When aggregate network bandwidth between UCC and GPU nodes is insufficient, downloads slow down, scene load times increase, and simulations may timeout.

Recommended network bandwidth: 10 Gbps minimum per UCC node, 25+ Gbps preferred for workloads with large or complex scenes.

UCC replicas operate independently and do not share cached data. Each replica maintains its own copy of content. The load balancer distributes client requests across replicas, so aggregate network bandwidth scales with replica count.

For example, with 5 UCC replicas on EC2 instances with 10 Gbps NICs:

Aggregate bandwidth = 5 replicas * 10 Gbps = 50 Gbps

If your workload requires higher aggregate bandwidth (e.g., 66 GPUs with high scene complexity), you must scale the number of replicas to distribute network load.

Network bandwidth saturation occurs when:

Too few UCC replicas for the GPU fleet size and scene complexity
EC2 instance type has insufficient network bandwidth (underpowered NICs)
Load balancer distributes traffic unevenly across replicas
UCC pods scheduled across multiple availability zones (cross-AZ bandwidth penalties)

When network bandwidth is saturated, downloads slow down (throughput drops), scene load times increase significantly, and render workers may report slow asset retrieval.

Note: UCC is not designed to scale dynamically. Replica count must be pre-sized based on expected workload.

Symptoms and Detection Signals#

Visible Symptoms#

Slow asset downloads during cache HITs - Downloads slow despite content being cached (disk is not the bottleneck)
Scene load time increases - Scenes taking longer than expected even with warm cache
High network throughput on UCC pods - Network TX approaching instance NIC limits

Metric Signals#

Metric	Description
`rate( container_network_transmit_bytes_total { pod =~ "usd-content-cache-.*", namespace = "ucc" } [5m] )`	Network transmit bandwidth per UCC pod. Values approaching EC2 instance NIC limits indicate saturation. Compare against instance type network performance specs.
`sum( rate( container_network_transmit_bytes_total { pod =~ "usd-content-cache-.*" } [5m] ) )`	Aggregate network transmit bandwidth across all UCC replicas. Compare against workload requirements to determine if scaling is needed.

Root Cause Analysis#

Known Causes#

Network bandwidth saturation is typically caused by too few UCC replicas for the GPU fleet size, or uneven load distribution across replicas.

Too Few UCC Replicas#

If UCC replica count is too low, network bandwidth load cannot be distributed effectively. Each replica’s NIC reaches capacity, becoming a bottleneck.

Sizing guidance: - Minimum 10 Gbps per UCC node - Preferred 25+ Gbps for complex scenes - Scale replicas to match GPU fleet and scene complexity

Check current replica count:

# Check UCC StatefulSet replicas
kubectl get statefulset -n ucc

# Check per-pod network bandwidth
kubectl top pods -n ucc

Uneven Load Distribution#

If load balancer does not distribute traffic evenly across replicas, some pods become network-bound while others remain underutilized.

Check load distribution:

# Compare network bandwidth across pods
# Should be relatively even across all replicas

Other Possible Causes#

EC2 Instance Type Bandwidth Insufficient - Smaller instance types (e.g., m5.2xlarge) have lower network performance than larger types (e.g., m5.8xlarge, m5.16xlarge)
Cross-AZ Traffic - UCC and GPU pods in different availability zones incur latency and bandwidth penalties
Network Congestion - Other workloads on same nodes competing for bandwidth

Troubleshooting Steps#

Diagnostic Steps#

Confirm Network is the Bottleneck (Not Disk or Cache)

Verify network is saturated during cache HIT scenarios (when disk is not the bottleneck).

# Check network bandwidth during cache HITs
# Query Prometheus during warm cache period:
# rate(container_network_transmit_bytes_total{pod=~"usd-content-cache-.*"}[5m])

# Compare against disk I/O during same period
# rate(container_fs_writes_bytes_total{pod=~"usd-content-cache-.*"}[5m])

# If network TX is high but disk writes are low, network is the bottleneck
# If both are high, disk may be the bottleneck (see Data Disk Bandwidth runbook)

# Check EC2 instance network performance for your instance type
# aws ec2 describe-instance-types --instance-types m5.4xlarge \
#   --query 'InstanceTypes[*].NetworkInfo.NetworkPerformance'

Analysis:

High network TX with low disk I/O during cache HITs confirms network bottleneck.
If disk I/O is also high, address disk first (see Data Disk Bandwidth runbook).

Resolution:

If network is bottleneck, proceed to scale replicas (step 2).

Scale UCC Replicas to Distribute Network Load

Increase replica count to distribute network bandwidth across more pods/nodes.

# Calculate current aggregate bandwidth
# replicas * per_instance_bandwidth
# Example: 5 replicas * 10 Gbps = 50 Gbps

# Determine if scaling is needed based on workload
# For large GPU fleets or complex scenes, scale to 10-20 replicas

# Update Helm values: replicaCount
helm get values <ucc-release-name> -n ucc -o yaml > values.yaml
# Edit: replicaCount: 10  # Increased from 5

# Apply updated replica count
helm upgrade <ucc-release-name> <chart-path> -n ucc -f values.yaml

# Verify new pods are running
kubectl get pods -n ucc

# Monitor network bandwidth distribution
kubectl top pods -n ucc --sort-by=cpu

Analysis:

More replicas distribute network load, reducing per-pod bandwidth.
Ensure sufficient EC2 nodes available to schedule new UCC pods.

Resolution:

Scale to match workload needs (start with 2x current if saturated).
Monitor per-pod bandwidth to verify distribution.
Verify load balancer distributes traffic evenly.

Verify Load Distribution Across Replicas

Check if traffic is evenly distributed across all UCC replicas.

# Check request count per pod (from UCC metrics)
kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_http_requests_total"

# Compare network bandwidth per pod
# Query: rate(container_network_transmit_bytes_total{pod=~"usd-content-cache-.*"}[5m])

# Pods should have similar bandwidth usage (within 20-30%)
# If one pod has 2x bandwidth of others, load is uneven

# Check Service sessionAffinity
kubectl get svc -n ucc <ucc-service> -o yaml | grep "sessionAffinity"

Analysis:

Uneven distribution indicates load balancer or session affinity issues.
All replicas have same content; traffic should distribute evenly.

Resolution:

If uneven, review load balancer configuration.
Disable session affinity if it’s causing concentration.
Verify Service selector matches all UCC pods.

Check Pod Placement in Same Availability Zone

Verify UCC and GPU pods are co-located in the same AZ to avoid cross-AZ penalties.

# Check node AZ distribution
kubectl get nodes --show-labels | grep "topology.kubernetes.io/zone"

# Check UCC pod placement
kubectl get pods -n ucc -o wide

# Check GPU/workload pod placement
kubectl get pods -n <workload-namespace> -o wide

# Verify UCC and GPU pods are in same AZ
# Cross-AZ traffic has latency and may have lower effective bandwidth

Resolution:

For optimal performance, deploy UCC and GPU workloads in same AZ.
Configure pod affinity if needed to co-locate UCC with GPU nodes.

Prevention#

Proactive Monitoring#

Set up alerts for:

Per-pod network bandwidth thresholds: Alert when TX bandwidth per pod exceeds 80% of instance NIC capacity
Aggregate bandwidth shortfall: Alert when total UCC bandwidth appears insufficient for workload
Uneven load distribution: Alert when one pod uses >50% more bandwidth than average

Configuration Best Practices#

Pre-size replicas for workload: Provision 10-25 Gbps aggregate bandwidth based on GPU count and scene complexity
Use consistent EC2 instance types: Deploy UCC on nodes with adequate NIC bandwidth (m5.4xlarge or larger)
Deploy in single AZ: Co-locate UCC and GPU pods in same availability zone
Test with representative workload: Validate network capacity with peak GPU count before production