How NVIDIA Speech NIM Microservices Work#

This page describes how the NVIDIA Speech NIM microservices work together to build speech applications.

Microservices and Models#

NVIDIA Speech NIM microservices are released and versioned together as a collection. Each NIM microservice containerizes one or more NVIDIA NeMo models and exposes them through a unified API. You interact with the models through the NIM microservice’s gRPC and HTTP APIs, not directly with the models.

The following table lists each NIM microservice and the model families it serves.

NIM Microservice

Nemotron Model Family

Models

Function

NVIDIA ASR NIM

Nemotron ASR

Parakeet, Canary, Conformer, Whisper

Converts audio to text. Interact with Nemotron ASR models through the ASR NIM API.

NVIDIA TTS NIM

Nemotron TTS

Magpie, FastPitch HifiGAN

Synthesizes speech from text. Interact with Nemotron TTS models through the TTS NIM API.

NVIDIA NMT NIM

Nemotron NMT

Megatron-based translation

Translates text between languages. Interact with Nemotron NMT models through the NMT NIM API.

Each NIM microservice is a standalone Docker container. You deploy only the microservices your application needs. For example, a transcription service only requires the ASR NIM, while a real-time translation pipeline chains the ASR NIM, NMT NIM, and TTS NIM together.

You select a specific model at deploy time by setting the CONTAINER_ID and NIM_TAGS_SELECTOR environment variables. For supported models and container IDs, refer to the Support Matrix.

Architecture#

Every NIM container packages the full inference stack. The architecture is as follows.

graph TD A("Client Application<br/>(gRPC or HTTP/REST API)") subgraph NIM["NIM Container (ASR / TTS / NMT)"] B("NIM Application<br/>(model selection, config)") C("NVIDIA Triton Server<br/>(request scheduling, batching, streaming)") D("TensorRT / CUDA Runtime<br/>(GPU-accelerated inference)") B --> C --> D end A --> B style A fill:#ffffff,stroke:#000000,color:#000000 style NIM fill:#000000,stroke:#76b900,color:#76b900 style B fill:#76b900,stroke:#000000,color:#000000 style C fill:#76b900,stroke:#000000,color:#000000 style D fill:#76b900,stroke:#000000,color:#000000 linkStyle default stroke:#76b900,stroke-width:2px

Component

Description

NIM Application

Handles model profile selection, configuration, and API routing. Exposes gRPC (port 50051) and HTTP (port 9000) endpoints.

NVIDIA Triton Inference Server

Manages request scheduling, dynamic batching, and streaming connections.

TensorRT / CUDA

Executes the neural network on the GPU with optimized kernels.

Model Profiles#

Each NIM container ships with multiple model profiles optimized for different GPUs and inference modes. On startup, the NIM selects or downloads the best profile for the detected hardware.

Profiles are categorized as pre-built and Riva Model Intermediate Representation (RMIR).

Profile Type

Description

Pre-built

Pre-optimized TensorRT engines for GPUs with Compute Capability >= 8.0. Fastest startup.

RMIR

Portable format that generates an optimized engine on-the-fly for the target GPU with Compute Capability >= 8.0. Longer first startup.

You can override automatic selection with NIM_TAGS_SELECTOR to choose a specific inference mode (streaming, offline, or both) and other profile options. Use the nim_list_model_profiles utility to refer to available profiles for your hardware.

Request Flow#

With the NVIDIA Speech NIM microservices running, the request flow is as follows:

  1. Client sends a request through gRPC or HTTP to the NIM container.

  2. NIM routes the request to the appropriate model pipeline on Triton.

  3. Triton batches and schedules the request for GPU execution.

  4. TensorRT executes inference on the GPU.

  5. The response is returned to the client as a single result (offline) or as a stream of partial results (streaming).

For streaming ASR, the client sends audio chunks continuously and receives partial transcripts in real time. For streaming TTS, the client sends text and receives synthesized audio chunks as they are generated.

Composing Pipelines#

Each NIM is an independent container that communicates over standard network protocols. To build a multi-stage pipeline (for example, speech-to-speech translation):

  1. The ASR NIM receives audio and outputs a transcript.

  2. The NMT NIM receives the transcript and outputs translated text.

  3. The TTS NIM receives the translated text and outputs synthesized audio.

Your application orchestrates the data flow between containers. Each NIM scales independently.