Troubleshooting Common Issues for NVIDIA Speech NIM Microservices#

This page covers troubleshooting common issues that apply to all NVIDIA Speech NIM microservices. For NIM-specific issues, see ASR, TTS, or NMT.

Container Startup Takes Longer Than 30 Minutes#

Symptom#

The NIM container takes longer than 30 minutes to become ready after docker run.

Cause#

On first startup, the container downloads the model from NGC. If the model includes RMIR files, the container runs riva-deploy to generate TensorRT engines, which adds significant time. After engine generation, the container extracts model archives into /data/models and starts the Riva gRPC server.

Solution#

Mount a local cache directory to avoid repeated downloads on subsequent runs.

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
chmod 777 $LOCAL_NIM_CACHE

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

After the first run, the model and TensorRT engines load from the local cache.

NGC Authentication Failure#

Symptom#

The container fails to start with an error related to downloading models from NGC, such as 401 Unauthorized or NGC_API_KEY not set.

Solution#

  1. Verify that NGC_API_KEY is exported in your terminal:

    echo $NGC_API_KEY
    
  2. Ensure the key has “NGC Catalog” access. Generate a new key at NGC API Keys if needed.

  3. Log in to the NGC container registry:

    echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
    
  4. Pass the key to the container:

    docker run ... -e NGC_API_KEY ...
    

GPU Out of Memory (OOM)#

Symptom#

The container crashes or fails to load the model with CUDA out-of-memory errors.

Cause#

TTS model profiles vary significantly in GPU memory requirements. For example, the Magpie TTS Multilingual model uses approximately 10.87 GB at batch_size=8 but 31.55 GB at batch_size=32. Other processes consuming GPU memory can push usage over the limit.

The Riva Translate 1.6b model requires approximately 9.5 GB of GPU memory and 4.6 GB of CPU memory. Other processes consuming GPU memory can push usage over the limit.

Solution#

  1. Check available GPU memory before deploying:

    nvidia-smi
    
  2. Ensure no other processes are using the target GPU. Free GPU memory or select a different device:

    docker run ... --gpus '"device=1"' ...
    
  1. Select a model profile with lower memory requirements by adjusting NIM_TAGS_SELECTOR:

    export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,batch_size=8"
    
  2. Use a GPU with at least 16 GB of VRAM. See the TTS support matrix for supported GPUs.

  1. Use a GPU with at least 16 GB of VRAM. See the NMT support matrix for supported GPUs.

Health Check Returns 503#

Symptom#

Requests to /v1/health/ready return HTTP 503 with Server not ready, or /v1/health/live returns HTTP 503 with Server not live.

Cause#

The health endpoints check whether the underlying Triton Inference Server or Riva gRPC server is ready. During model loading, RMIR deployment, or TensorRT engine compilation, the server is not yet ready and returns 503.

Solution#

  1. Wait for the container to finish model loading. Monitor the container logs:

    docker logs -f $CONTAINER_ID
    

    Look for Riva gRPC Server is READY to confirm the server is ready.

  2. If the 503 persists after logs show readiness, verify the health endpoint:

    curl http://localhost:9000/v1/health/ready
    
  3. Ensure --shm-size=8GB is set in the docker run command.

  4. TTS only – If using SSL/TLS, verify the certificate configuration:

    • Valid server certificate and key.

    • For mTLS: NIM_SSL_CLIENT_CERT_PATH and NIM_SSL_CLIENT_KEY_PATH must be set.

    • Server SAN in the certificate or NIM_SSL_DOMAIN_NAME environment variable.

Port 8000 Conflict#

Symptom#

The container fails to start or returns connection errors when NIM_HTTP_API_PORT is set to 8000.

Cause#

Port 8000 is reserved internally for the Triton Inference Server HTTP endpoint. Setting the NIM HTTP API to the same port causes a conflict. The Triton gRPC port defaults to 8001 (NIM_GRPC_TRITON_PORT).

Solution#

Use any port other than 8000 or 8001 for NIM_HTTP_API_PORT. The default is 9000.

docker run ... -e NIM_HTTP_API_PORT=9000 -p 9000:9000 ...

Pod Stuck in Pending State#

Symptom#

The NIM pod remains in Pending state and never transitions to Running.

Cause#

The Kubernetes scheduler cannot place the pod on any node. Common reasons include:

  • Node taints that the pod does not tolerate.

  • Insufficient GPU resources on available nodes.

  • Storage volume mount failures (for example, PVC not bound).

Solution#

  1. Inspect the pod events to identify the scheduling failure:

    kubectl describe pod <pod-name>
    

    Check the Events section at the bottom of the output.

  2. If the failure is a taint-related issue, verify that the Helm chart includes the required tolerations. The default values.yaml tolerates nvidia.com/gpu:

    tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    

    Add additional tolerations if your cluster uses custom taints.

  3. If the failure is insufficient GPU resources, confirm that GPU nodes are available and not fully allocated:

    kubectl describe nodes | grep -A5 "Allocated resources"
    
  4. If the failure is a PVC mount issue, check that the PVC is bound:

    kubectl get pvc
    

Pods Fail to Start After Scaling or Upgrading Without StatefulSet#

Symptom#

After scaling replicas or upgrading the Helm release with statefulSet.enabled: false and persistence.enabled: true, new pods remain in Pending or ContainerCreating state.

Cause#

With statefulSet.enabled: false, the Helm chart creates a Deployment instead of a StatefulSet. A Deployment does not support per-replica PVC templates, so all replicas attempt to mount the same PVC. If the PVC uses ReadWriteOnce (the default accessMode), only one pod can mount it at a time. Additional pods block waiting for the volume.

Solution#

Choose one of the following approaches.

  • Use ReadWriteMany storage (recommended). Set the accessMode to ReadWriteMany so multiple pods can mount the same PVC simultaneously:

    persistence:
      enabled: true
      accessMode: ReadWriteMany
    

    This requires a storage class that supports ReadWriteMany, such as an NFS-based provisioner or CephFS.

  • Enable StatefulSet mode. Each replica gets its own PVC through PVC templates:

    statefulSet:
      enabled: true
    persistence:
      enabled: true
    
  • Use direct NFS mount. Mount the model cache from an NFS server directly, bypassing PVC:

    nfs:
      enabled: true
      server: nfs-server.example.com
      path: /exports
    
  • Disable persistence. Each pod downloads the model to an ephemeral emptyDir volume on startup. This is the simplest option but requires every pod to re-download the model:

    persistence:
      enabled: false