About Configuring Speech NIM Deployment#

Each Speech NIM microservice runs as a GPU-accelerated container that packages a Nemotron model with the NVIDIA inference stack (CUDA, TensorRT, Triton) and exposes gRPC and HTTP endpoints. You deploy each service — ASR, TTS, NMT, or Speech-to-Speech — as an independent container with its own GPU allocation, model cache, and network ports.

Choosing a Deployment Method#

Speech NIM microservices support two deployment paths. Both produce the same running service; the difference is how you manage containers and infrastructure.

Use Docker for:

  • Minimal setup by running docker run with GPU flags, NGC API key, and port mappings.

  • Configuring GPU selection, shared memory, model cache mounts, and environment variables for each container.

  • Local model caching to avoid repeated NGC downloads on startup.

  • Single-GPU or multi-GPU hosts where you manage the containers individually.

Use Helm for:

  • A Kubernetes cluster with GPU nodes and the NVIDIA GPU Operator installed.

  • Managing secrets, storage, autoscaling, ingress, and health probes declaratively through Helm values.

  • Persistent volume claims, NFS mounts, and StatefulSet-based scaling for model caches.

  • SSL/TLS and Prometheus metrics integration.

Deploying with Docker

Deploy NVIDIA Speech NIM microservices as Docker containers.

Deploying NVIDIA Speech NIM Microservices with Docker
Deploying with Helm

Deploy NVIDIA Speech NIM microservices using Helm charts.

Deploying NVIDIA Speech NIM Microservices with Helm

Key Deployment Considerations#

GPU requirements

Each NIM container requires at least one NVIDIA GPU. The specific GPU type and count depend on the model. Pass --gpus '"device=N"' (Docker) or set resources.limits.nvidia.com/gpu (Helm) to assign GPUs. Running --gpus all is not supported on multi-GPU hosts.

Model caching

On first startup, the container downloads model artifacts from NGC. Mount a host directory to /opt/nim/.cache (Docker) or configure a persistent volume (Helm) to cache models locally and avoid repeated downloads. Some models use prebuilt artifacts; others use RMIR format that requires an initial export step. Refer to Model Caching for details.

Network ports

Each NIM exposes two ports by default: HTTP on 9000 and gRPC on 50051. Map these ports to your host or Kubernetes service as needed. Internal Triton ports (8000, 8001, 8002) do not need to be exposed.

Security

NIMs optionally support TLS and mTLS for encrypted communication. Set NIM_SSL_MODE to TLS or MTLS and provide certificate paths. Refer to Configuration for Docker or the Helm security section for Kubernetes.

Model selection

Use the NIM_TAGS_SELECTOR environment variable to select a specific model and profile (for example, name=parakeet-1-1b-ctc-en-us,mode=str). Refer to the support matrix for available models per service.