NVIDIA Speech NIM Microservices Overview#

NVIDIA Speech NIM microservices are GPU-accelerated Docker containers that provide speech AI capabilities as building blocks for your applications. Each NIM microservice packages a Nemotron model, the full NVIDIA inference stack (CUDA, TensorRT, Triton), and a unified API into a single container that you deploy, scale, and interact with through standard gRPC and HTTP interfaces.

You do not interact with models directly. Instead, each NIM microservice provides the API layer that your application calls to run inference on the containerized models.

NIM Microservices#

Each NIM microservice serves a Nemotron model and exposes it through a dedicated API.

NIM Microservice

Supported Model Family

Input

Output

Default API Protocols

NVIDIA ASR NIM

Nemotron ASR

Audio stream or audio buffer

Text transcripts with optional metadata

gRPC (port 50051), HTTP (port 9000)

NVIDIA TTS NIM

Nemotron TTS

Text

Synthesized speech audio

gRPC (port 50051), HTTP (port 9000)

NVIDIA NMT NIM

Nemotron NMT

Text (source language)

Text (target language)

gRPC (port 50051), HTTP (port 9000)

You can deploy only the NIM microservices your application needs. Each runs as an independent Docker container with GPU acceleration. You select a specific model at deploy time by setting the CONTAINER_ID and NIM_TAGS_SELECTOR environment variables. For supported models and container IDs, refer to the Support Matrix.

To try each NIM microservice, visit build.nvidia.com.

Building Applications with Speech NIM Microservices#

Speech NIM microservices are building blocks. Your application sends requests to the NIM container APIs and receives results. The NIM handles model loading, GPU execution, batching, and streaming internally.

graph LR App("Your Application") -->|gRPC / HTTP| ASR("ASR NIM") App -->|gRPC / HTTP| TTS("TTS NIM") App -->|gRPC / HTTP| NMT("NMT NIM") style App fill:#ffffff,stroke:#000000,color:#000000 style ASR fill:#76b900,stroke:#000000,color:#000000 style TTS fill:#76b900,stroke:#000000,color:#000000 style NMT fill:#76b900,stroke:#000000,color:#000000 linkStyle default stroke:#76b900,stroke-width:2px

You can chain multiple NIM microservices together for complex pipelines. For example, a real-time translation application calls the ASR NIM to transcribe audio, passes the transcript to the NMT NIM for translation, and sends the translated text to the TTS NIM for speech synthesis. By integrating the NVIDIA Speech NIM microservices, you can orchestrate the data flow and build complex pipelines for end-to-end speech applications while scaling each NIM microservice independently.

Use Cases#

Use Case

NIM Microservices

Description

Call center transcription

ASR NIM

Transcribe customer calls in real time or batch for analytics, compliance, and agent assistance.

Voice-enabled applications

ASR NIM + TTS NIM

Add speech input and spoken responses to virtual assistants, kiosks, or IoT devices.

Multilingual customer support

ASR NIM + NMT NIM

Transcribe speech and translate to an agent’s language for cross-language support workflows.

Real-time translation pipeline

ASR NIM + NMT NIM + TTS NIM

Capture speech, translate the transcript, and synthesize the output in the target language for live interpretation.

Accessibility and captioning

ASR NIM

Generate real-time captions or subtitles for live events, meetings, and media.

Content localization

NMT NIM + TTS NIM

Translate text content and produce localized voice-overs for training materials, videos, or documentation.

Next Steps#

  • How It Works: Learn how the NVIDIA Speech NIM microservices work together to build speech applications.

  • Release Notes: Track the release notes for the NVIDIA Speech NIM microservices.

  • Support Matrix: Find supported models and hardware requirements.

  • Get Started: Learn how to deploy Speech NIM microservices with Docker.