About Configuring Speech NIM Deployment#
Each Speech NIM microservice runs as a GPU-accelerated container that packages a Nemotron model with the NVIDIA inference stack (CUDA, TensorRT, Triton) and exposes gRPC and HTTP endpoints. You deploy each service — ASR, TTS, NMT, or Speech-to-Speech — as an independent container with its own GPU allocation, model cache, and network ports.
Choosing a Deployment Method#
Speech NIM microservices support two deployment paths. Both produce the same running service; the difference is how you manage containers and infrastructure.
Use Docker for:
Minimal setup by running
docker runwith GPU flags, NGC API key, and port mappings.Configuring GPU selection, shared memory, model cache mounts, and environment variables for each container.
Local model caching to avoid repeated NGC downloads on startup.
Single-GPU or multi-GPU hosts where you manage the containers individually.
Use Helm for:
A Kubernetes cluster with GPU nodes and the NVIDIA GPU Operator installed.
Managing secrets, storage, autoscaling, ingress, and health probes declaratively through Helm values.
Persistent volume claims, NFS mounts, and StatefulSet-based scaling for model caches.
SSL/TLS and Prometheus metrics integration.
Deploy NVIDIA Speech NIM microservices as Docker containers.
Deploy NVIDIA Speech NIM microservices using Helm charts.
Key Deployment Considerations#
- GPU requirements
Each NIM container requires at least one NVIDIA GPU. The specific GPU type and count depend on the model. Pass
--gpus '"device=N"'(Docker) or setresources.limits.nvidia.com/gpu(Helm) to assign GPUs. Running--gpus allis not supported on multi-GPU hosts.- Model caching
On first startup, the container downloads model artifacts from NGC. Mount a host directory to
/opt/nim/.cache(Docker) or configure a persistent volume (Helm) to cache models locally and avoid repeated downloads. Some models use prebuilt artifacts; others use RMIR format that requires an initial export step. Refer to Model Caching for details.- Network ports
Each NIM exposes two ports by default: HTTP on
9000and gRPC on50051. Map these ports to your host or Kubernetes service as needed. Internal Triton ports (8000,8001,8002) do not need to be exposed.- Security
NIMs optionally support TLS and mTLS for encrypted communication. Set
NIM_SSL_MODEtoTLSorMTLSand provide certificate paths. Refer to Configuration for Docker or the Helm security section for Kubernetes.- Model selection
Use the
NIM_TAGS_SELECTORenvironment variable to select a specific model and profile (for example,name=parakeet-1-1b-ctc-en-us,mode=str). Refer to the support matrix for available models per service.