Troubleshooting NeMo Microservices Deployment on Kubernetes#

Use this documentation to troubleshoot issues that can arise while you deploy and run the NeMo microservices on Kubernetes.

Network Issues#

Azure Default Egress Load Balancer Limitations#

Problem: On Azure Kubernetes Service (AKS), model downloads may fail when using the default egress load balancer configuration.

Symptoms: You might see connection errors in the logs like:

Exception: RuntimeError: Download Error: RequestError(reqwest::Error { kind: Request, url: "https://xlfiles.ngc.nvidia.com/org/orgs/nvidia/teams/nemo/models/llama-3_1-70b-instruct/versions/0.0.1/files/weights/__0_0.distcp", source: hyper_util::client::legacy::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) })

Root Cause: The default Azure egress load balancer cannot handle the high volume of traffic required to download large model files, images, and configurations. This can lead to:

  • SNAT port exhaustion

  • Connection timeouts (default 4-minute timeout may be insufficient)

  • Connection refused errors

Solution: Configure an Azure NAT Gateway for egress traffic instead of using the default load balancer.

For detailed instructions on migrating from the default outbound access to a NAT gateway, refer to the Azure NAT Gateway migration tutorial.

Considerations:

  • Ensure each node has enough ephemeral storage for model downloads (some models require 350GB+ storage)

  • Consider increasing NAT Gateway idle timeout settings for large file downloads

  • Track SNAT port allocation if experiencing intermittent failures

General Debugging#

If you need to inspect errors in your cluster, run kubectl events to list the most important changes in the cluster namespace. You can further debug individual pods by following Debug running pods in the Kubernetes documentation.

Authentication Issues#

Cluster permission issues#

If kubectl commands fail due to permission issues, you might see the following error:

$ kubectl get pods
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:default:default" cannot list resource "pods" in API group "" in the namespace "default"

This error occurs when the service account doesn’t have the necessary permissions to access the resource.

To resolve this, you can either:

  1. Add the necessary permissions to the service account.

  2. Use the kubectl command with the --as flag to use a different service account.