NVIDIA Speech NIM Microservices Overview#
NVIDIA Speech NIM microservices are GPU-accelerated Docker containers that provide speech AI capabilities as building blocks for your applications. Each NIM microservice packages a Nemotron model, the full NVIDIA inference stack (CUDA, TensorRT, Triton), and a unified API into a single container that you deploy, scale, and interact with through standard gRPC and HTTP interfaces.
You do not interact with models directly. Instead, each NIM microservice provides the API layer that your application calls to run inference on the containerized models.
NIM Microservices#
Each NIM microservice serves a Nemotron model and exposes it through a dedicated API.
NIM Microservice |
Supported Model Family |
Input |
Output |
Default API Protocols |
|---|---|---|---|---|
NVIDIA ASR NIM |
Nemotron ASR |
Audio stream or audio buffer |
Text transcripts with optional metadata |
gRPC (port 50051), HTTP (port 9000) |
NVIDIA TTS NIM |
Nemotron TTS |
Text |
Synthesized speech audio |
gRPC (port 50051), HTTP (port 9000) |
NVIDIA NMT NIM |
Nemotron NMT |
Text (source language) |
Text (target language) |
gRPC (port 50051), HTTP (port 9000) |
You can deploy only the NIM microservices your application needs. Each runs as an independent Docker container with GPU acceleration. You select a specific model at deploy time by setting the CONTAINER_ID and NIM_TAGS_SELECTOR environment variables. For supported models and container IDs, refer to the Support Matrix.
To try each NIM microservice, visit build.nvidia.com.
Building Applications with Speech NIM Microservices#
Speech NIM microservices are building blocks. Your application sends requests to the NIM container APIs and receives results. The NIM handles model loading, GPU execution, batching, and streaming internally.
You can chain multiple NIM microservices together for complex pipelines. For example, a real-time translation application calls the ASR NIM to transcribe audio, passes the transcript to the NMT NIM for translation, and sends the translated text to the TTS NIM for speech synthesis. By integrating the NVIDIA Speech NIM microservices, you can orchestrate the data flow and build complex pipelines for end-to-end speech applications while scaling each NIM microservice independently.
Use Cases#
Use Case |
NIM Microservices |
Description |
|---|---|---|
Call center transcription |
ASR NIM |
Transcribe customer calls in real time or batch for analytics, compliance, and agent assistance. |
Voice-enabled applications |
ASR NIM + TTS NIM |
Add speech input and spoken responses to virtual assistants, kiosks, or IoT devices. |
Multilingual customer support |
ASR NIM + NMT NIM |
Transcribe speech and translate to an agent’s language for cross-language support workflows. |
Real-time translation pipeline |
ASR NIM + NMT NIM + TTS NIM |
Capture speech, translate the transcript, and synthesize the output in the target language for live interpretation. |
Accessibility and captioning |
ASR NIM |
Generate real-time captions or subtitles for live events, meetings, and media. |
Content localization |
NMT NIM + TTS NIM |
Translate text content and produce localized voice-overs for training materials, videos, or documentation. |
Next Steps#
How It Works: Learn how the NVIDIA Speech NIM microservices work together to build speech applications.
Release Notes: Track the release notes for the NVIDIA Speech NIM microservices.
Support Matrix: Find supported models and hardware requirements.
Get Started: Learn how to deploy Speech NIM microservices with Docker.