Is this page helpful?

NVIDIA Speech NIM Microservices Overview#

NVIDIA Speech NIM microservices are GPU-accelerated Docker containers that provide speech AI capabilities as building blocks for your applications. Each NIM microservice packages a Nemotron model, the full NVIDIA inference stack (CUDA, TensorRT, Triton), and a unified API into a single container that you deploy, scale, and interact with through standard gRPC and HTTP interfaces.

You do not interact with models directly. Instead, each NIM microservice provides the API layer that your application calls to run inference on the containerized models.

NIM Microservices#

Each NIM microservice serves a Nemotron model and exposes it through a dedicated API.

NIM Microservice	Supported Model Family	Input	Output	Default API Protocols
NVIDIA ASR NIM	Nemotron ASR	Audio stream or audio buffer	Text transcripts with optional metadata	gRPC (port 50051), HTTP (port 9000)
NVIDIA TTS NIM	Nemotron TTS	Text	Synthesized speech audio	gRPC (port 50051), HTTP (port 9000)
NVIDIA NMT NIM	Nemotron NMT	Text (source language)	Text (target language)	gRPC (port 50051), HTTP (port 9000)

You can deploy only the NIM microservices your application needs. Each runs as an independent Docker container with GPU acceleration. You select a specific model at deploy time by setting the CONTAINER_ID and NIM_TAGS_SELECTOR environment variables. For supported models and container IDs, refer to the Support Matrix.

To try each NIM microservice, visit build.nvidia.com.

Building Applications with Speech NIM Microservices#

Speech NIM microservices are building blocks. Your application sends requests to the NIM container APIs and receives results. The NIM handles model loading, GPU execution, batching, and streaming internally.

graph LR App("Your Application") -->|gRPC / HTTP| ASR("ASR NIM") App -->|gRPC / HTTP| TTS("TTS NIM") App -->|gRPC / HTTP| NMT("NMT NIM") style App fill:#ffffff,stroke:#000000,color:#000000 style ASR fill:#76b900,stroke:#000000,color:#000000 style TTS fill:#76b900,stroke:#000000,color:#000000 style NMT fill:#76b900,stroke:#000000,color:#000000 linkStyle default stroke:#76b900,stroke-width:2px

You can chain multiple NIM microservices together for complex pipelines. For example, a real-time translation application calls the ASR NIM to transcribe audio, passes the transcript to the NMT NIM for translation, and sends the translated text to the TTS NIM for speech synthesis. By integrating the NVIDIA Speech NIM microservices, you can orchestrate the data flow and build complex pipelines for end-to-end speech applications while scaling each NIM microservice independently.

Use Cases#

Use Case	NIM Microservices	Description
Call center transcription	ASR NIM	Transcribe customer calls in real time or batch for analytics, compliance, and agent assistance.
Voice-enabled applications	ASR NIM + TTS NIM	Add speech input and spoken responses to virtual assistants, kiosks, or IoT devices.
Multilingual customer support	ASR NIM + NMT NIM	Transcribe speech and translate to an agent’s language for cross-language support workflows.
Real-time translation pipeline	ASR NIM + NMT NIM + TTS NIM	Capture speech, translate the transcript, and synthesize the output in the target language for live interpretation.
Accessibility and captioning	ASR NIM	Generate real-time captions or subtitles for live events, meetings, and media.
Content localization	NMT NIM + TTS NIM	Translate text content and produce localized voice-overs for training materials, videos, or documentation.

Next Steps#

How It Works: Learn how the NVIDIA Speech NIM microservices work together to build speech applications.
Release Notes: Track the release notes for the NVIDIA Speech NIM microservices.
Support Matrix: Find supported models and hardware requirements.
Get Started: Learn how to deploy Speech NIM microservices with Docker.