Is this page helpful?

About Configuring Speech NIM Deployment#

Each Speech NIM microservice runs as a GPU-accelerated container that packages a Nemotron model with the NVIDIA inference stack (CUDA, TensorRT, Triton) and exposes gRPC and HTTP endpoints. You deploy each service — ASR, TTS, NMT, or Speech-to-Speech — as an independent container with its own GPU allocation, model cache, and network ports.

Choosing a Deployment Method#

Speech NIM microservices support two deployment paths. Both produce the same running service; the difference is how you manage containers and infrastructure.

Use Docker for:

Minimal setup by running docker run with GPU flags, NGC API key, and port mappings.
Configuring GPU selection, shared memory, model cache mounts, and environment variables for each container.
Local model caching to avoid repeated NGC downloads on startup.
Single-GPU or multi-GPU hosts where you manage the containers individually.

Use Helm for:

A Kubernetes cluster with GPU nodes and the NVIDIA GPU Operator installed.
Managing secrets, storage, autoscaling, ingress, and health probes declaratively through Helm values.
Persistent volume claims, NFS mounts, and StatefulSet-based scaling for model caches.
SSL/TLS and Prometheus metrics integration.

Deploying with Docker

Deploy NVIDIA Speech NIM microservices as Docker containers.

How To

Deploying NVIDIA Speech NIM Microservices with Docker

Deploying with Helm

Deploy NVIDIA Speech NIM microservices using Helm charts.

How To

Deploying NVIDIA Speech NIM Microservices with Helm

Key Deployment Considerations#

GPU requirements: Each NIM container requires at least one NVIDIA GPU. The specific GPU type and count depend on the model. Pass --gpus '"device=N"' (Docker) or set resources.limits.nvidia.com/gpu (Helm) to assign GPUs. Running --gpus all is not supported on multi-GPU hosts.
Model caching: On first startup, the container downloads model artifacts from NGC. Mount a host directory to /opt/nim/.cache (Docker) or configure a persistent volume (Helm) to cache models locally and avoid repeated downloads. Some models use prebuilt artifacts; others use RMIR format that requires an initial export step. Refer to Model Caching for details.
Network ports: Each NIM exposes two ports by default: HTTP and WebSocket share 9000 (clients select the protocol by path or upgrade), and gRPC uses 50051. Map both to your host or Kubernetes service as needed. Internal Triton ports (8000, 8001, 8002) do not need to be exposed. For the protocol-to-port mapping, refer to Runtime Parameters.
Security: NIMs optionally support TLS and mTLS for encrypted communication. Set NIM_SSL_MODE to TLS or MTLS and provide certificate paths. Refer to Configuration for Docker or the Helm security section for Kubernetes.
Model selection: Use the NIM_TAGS_SELECTOR environment variable to select a specific model and profile (for example, name=parakeet-1-1b-ctc-en-us,mode=str). Refer to the support matrix for available models per service.