Troubleshooting Common Issues for NVIDIA Speech NIM Microservices#
This page covers troubleshooting common issues that apply to all NVIDIA Speech NIM microservices. For NIM-specific issues, see ASR, TTS, or NMT.
Container Startup Takes Longer Than 30 Minutes#
Symptom#
The NIM container takes longer than 30 minutes to become ready after docker run.
Cause#
On first startup, the container downloads the model from NGC. If the model includes RMIR files, the container runs riva-deploy to generate TensorRT engines, which adds significant time. After engine generation, the container extracts model archives into /data/models and starts the Riva gRPC server.
Solution#
Mount a local cache directory to avoid repeated downloads on subsequent runs.
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
chmod 777 $LOCAL_NIM_CACHE
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
After the first run, the model and TensorRT engines load from the local cache.
NGC Authentication Failure#
Symptom#
The container fails to start with an error related to downloading models from NGC, such as 401 Unauthorized or NGC_API_KEY not set.
Solution#
Verify that
NGC_API_KEYis exported in your terminal:echo $NGC_API_KEY
Ensure the key has “NGC Catalog” access. Generate a new key at NGC API Keys if needed.
Log in to the NGC container registry:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Pass the key to the container:
docker run ... -e NGC_API_KEY ...
GPU Out of Memory (OOM)#
Symptom#
The container crashes or fails to load the model with CUDA out-of-memory errors.
Cause#
TTS model profiles vary significantly in GPU memory requirements. For example, the Magpie TTS Multilingual model uses approximately 10.87 GB at batch_size=8 but 31.55 GB at batch_size=32. Other processes consuming GPU memory can push usage over the limit.
The Riva Translate 1.6b model requires approximately 9.5 GB of GPU memory and 4.6 GB of CPU memory. Other processes consuming GPU memory can push usage over the limit.
Solution#
Check available GPU memory before deploying:
nvidia-smi
Ensure no other processes are using the target GPU. Free GPU memory or select a different device:
docker run ... --gpus '"device=1"' ...
Select a model profile with lower memory requirements by adjusting
NIM_TAGS_SELECTOR:export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,batch_size=8"
Use a GPU with at least 16 GB of VRAM. See the TTS support matrix for supported GPUs.
Use a GPU with at least 16 GB of VRAM. See the NMT support matrix for supported GPUs.
Health Check Returns 503#
Symptom#
Requests to /v1/health/ready return HTTP 503 with Server not ready, or /v1/health/live returns HTTP 503 with Server not live.
Cause#
The health endpoints check whether the underlying Triton Inference Server or Riva gRPC server is ready. During model loading, RMIR deployment, or TensorRT engine compilation, the server is not yet ready and returns 503.
Solution#
Wait for the container to finish model loading. Monitor the container logs:
docker logs -f $CONTAINER_ID
Look for
Riva gRPC Server is READYto confirm the server is ready.If the 503 persists after logs show readiness, verify the health endpoint:
curl http://localhost:9000/v1/health/readyEnsure
--shm-size=8GBis set in thedocker runcommand.TTS only – If using SSL/TLS, verify the certificate configuration:
Valid server certificate and key.
For mTLS:
NIM_SSL_CLIENT_CERT_PATHandNIM_SSL_CLIENT_KEY_PATHmust be set.Server SAN in the certificate or
NIM_SSL_DOMAIN_NAMEenvironment variable.
Port 8000 Conflict#
Symptom#
The container fails to start or returns connection errors when NIM_HTTP_API_PORT is set to 8000.
Cause#
Port 8000 is reserved internally for the Triton Inference Server HTTP endpoint. Setting the NIM HTTP API to the same port causes a conflict. The Triton gRPC port defaults to 8001 (NIM_GRPC_TRITON_PORT).
Solution#
Use any port other than 8000 or 8001 for NIM_HTTP_API_PORT. The default is 9000.
docker run ... -e NIM_HTTP_API_PORT=9000 -p 9000:9000 ...
Pod Stuck in Pending State#
Symptom#
The NIM pod remains in Pending state and never transitions to Running.
Cause#
The Kubernetes scheduler cannot place the pod on any node. Common reasons include:
Node taints that the pod does not tolerate.
Insufficient GPU resources on available nodes.
Storage volume mount failures (for example, PVC not bound).
Solution#
Inspect the pod events to identify the scheduling failure:
kubectl describe pod <pod-name>
Check the
Eventssection at the bottom of the output.If the failure is a taint-related issue, verify that the Helm chart includes the required tolerations. The default
values.yamltoleratesnvidia.com/gpu:tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
Add additional tolerations if your cluster uses custom taints.
If the failure is insufficient GPU resources, confirm that GPU nodes are available and not fully allocated:
kubectl describe nodes | grep -A5 "Allocated resources"
If the failure is a PVC mount issue, check that the PVC is bound:
kubectl get pvc
Pods Fail to Start After Scaling or Upgrading Without StatefulSet#
Symptom#
After scaling replicas or upgrading the Helm release with statefulSet.enabled: false and persistence.enabled: true, new pods remain in Pending or ContainerCreating state.
Cause#
With statefulSet.enabled: false, the Helm chart creates a Deployment instead of a StatefulSet. A Deployment does not support per-replica PVC templates, so all replicas attempt to mount the same PVC. If the PVC uses ReadWriteOnce (the default accessMode), only one pod can mount it at a time. Additional pods block waiting for the volume.
Solution#
Choose one of the following approaches.
Use
ReadWriteManystorage (recommended). Set theaccessModetoReadWriteManyso multiple pods can mount the same PVC simultaneously:persistence: enabled: true accessMode: ReadWriteMany
This requires a storage class that supports
ReadWriteMany, such as an NFS-based provisioner or CephFS.Enable StatefulSet mode. Each replica gets its own PVC through PVC templates:
statefulSet: enabled: true persistence: enabled: true
Use direct NFS mount. Mount the model cache from an NFS server directly, bypassing PVC:
nfs: enabled: true server: nfs-server.example.com path: /exports
Disable persistence. Each pod downloads the model to an ephemeral
emptyDirvolume on startup. This is the simplest option but requires every pod to re-download the model:persistence: enabled: false