Run Your First NVIDIA Speech NIM Microservice for Automatic Speech Recognition#

In this tutorial, you will deploy a NVIDIA Speech NIM microservice for Automatic Speech Recognition (ASR) and use it to transcribe audio to text. Along the way, you will learn the NIM deployment workflow, the difference between streaming and offline transcription, and how to interact with the service through gRPC and HTTP.

What You Learn#

By completing this tutorial, you:

  • Understand how a NIM container downloads, optimizes, and serves a speech model.

  • Know the difference between streaming transcription (partial results in real time) and offline transcription (complete result after processing the full audio).

  • Be able to deploy an ASR NIM, check its readiness, and send transcription requests through both gRPC and HTTP.

  • Recognize the role of CONTAINER_ID and NIM_TAGS_SELECTOR in selecting which model and inference mode to deploy.

What You Need#

Key Concepts#

Before starting, read the following concepts to understand what the commands in this tutorial perform at a high level.

Concept

Description

NIM container

A self-contained Docker image that packages a speech model with the full NVIDIA inference stack (TensorRT, Triton). Models are not installed separately. The container handles model download, optimization, and serving.

Model profiles

Each ASR NIM container includes multiple profiles optimized for different GPUs and inference modes. You select a profile at deploy time using environment variables:

  • CONTAINER_ID: Identifies which model container to pull and run.
  • NIM_TAGS_SELECTOR: Selects a specific model profile within that container (model name, inference mode, batch size).

Inference modes

ASR NIMs support two inference modes:

  • Streaming: Sends audio in chunks and returns partial transcripts as they arrive (ideal for real-time applications like live captioning).
  • Offline: Processes the entire audio file at once and returns a single complete transcript (ideal for batch processing pre-recorded files).

Step 1: Deploy the ASR NIM Microservice#

Choose a model and deploy it. This tutorial uses the Parakeet 1.1b CTC English (en-US) model with mode=all, which enables both streaming and offline inference so you can try both.

Tip

To find all available ASR models and modes, refer to Supported Models.

export CONTAINER_ID=parakeet-1-1b-ctc-en-us
export NIM_TAGS_SELECTOR="name=parakeet-1-1b-ctc-en-us,mode=all"

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

This command starts a Docker container that downloads the Parakeet model from NGC (using your API key), optimizes it for your GPU, and starts serving requests on two ports: HTTP on 9000 and gRPC on 50051.

The first run downloads the model and can build a TensorRT engine for your GPU. This can take up to 30 minutes depending on your network and GPU. Subsequent starts are faster because the model is cached.

Tip

To reduce memory usage, you can deploy only one inference mode: use mode=str for streaming only or mode=offline for offline only. The mode=all setting loads both, which uses more GPU memory but lets you try both modes in this tutorial.

Note

The Parakeet CTC 1.1b English (en-US) model can report “Too many open files”. Add --ulimit nofile=2048:2048 to the docker run command as a workaround.

Step 2: Check Service Readiness#

Open a new terminal (leave the NIM container running in the first one) and check that the service is ready to accept requests.

curl -X 'GET' 'http://localhost:9000/v1/health/ready'

Expected response:

{"status":"ready"}

If you refer to a connection error, the NIM is still starting up. Wait and retry. The health endpoint tells you when model loading and optimization are complete.

Step 3: Get Sample Audio#

Copy a sample WAV file from the running container to use as test input.

docker cp $CONTAINER_ID:/opt/riva/wav/en-US_sample.wav .

You can also use your own audio file in the formats of mono, 16-bit WAV, OPUS, or FLAC.

Step 4: List Available Models#

Before sending a transcription request, check which models the NIM microservice is serving.

python3 python-clients/scripts/asr/transcribe_file.py \
  --server 0.0.0.0:50051 \
  --list-models

You should refer to model names containing streaming and offline, confirming that both inference modes are available (because you deployed with mode=all).

In production, you can select models by name in your API calls. This step shows you how to discover what is available in a running NIM microservice.

Step 5: Run Inference#

Streaming Transcription#

Streaming transcription sends audio to the service in chunks and receives partial transcription results as they arrive. This is the mode you would use for live microphone input or real-time captioning.

gRPC#

python3 python-clients/scripts/asr/transcribe_file.py \
  --server 0.0.0.0:50051 \
  --language-code en-US --automatic-punctuation \
  --input-file en-US_sample.wav

This command should output incrementally as the audio is processed. You should refer to partial transcripts that get refined as more audio arrives. This is the streaming behavior: the service does not wait for the full file before returning results.

WebSocket#

python3 python-clients/scripts/asr/realtime_asr_client.py \
  --server 0.0.0.0:9000 \
  --language-code en-US --automatic-punctuation \
  --input-file en-US_sample.wav

This client uses the HTTP/WebSocket endpoint (port 9000) instead of gRPC (port 50051). The audio file is streamed to the service in chunks, simulating a real-time audio source. Both approaches produce streaming transcription, but WebSocket is easier to integrate from web browsers.

Offline Transcription#

Offline transcription sends the entire audio file in one request and receives a single, complete transcript. This is the mode you would use for batch processing recorded files where real-time results are not needed.

gRPC#

python3 python-clients/scripts/asr/transcribe_file_offline.py \
  --server 0.0.0.0:50051 \
  --language-code en-US --automatic-punctuation \
  --input-file en-US_sample.wav

HTTP#

curl -s http://0.0.0.0:9000/v1/audio/transcriptions -F language=en-US \
  -F file="@en-US_sample.wav"

The full file is sent in one request, and you receive the complete transcript in one response. There are no partial results. The HTTP endpoint follows a standard REST pattern, making it easy to integrate from any language or tool that can make HTTP requests.


What You Learned#

In this tutorial, you have learned the following:

  • Deployed a NIM container using CONTAINER_ID and NIM_TAGS_SELECTOR to select a model and inference mode, and saw that the container handles model download, optimization, and serving automatically.

  • Verified readiness using the health endpoint, which tells you when the service is ready to accept requests.

  • Discovered available models using the --list-models flag, which shows what a running NIM is serving.

  • Compared streaming and offline transcription. Streaming returns partial results in real time and is suited for live applications. Offline processes the full audio and returns a complete result, suited for batch workflows.

  • Used both gRPC and HTTP to interact with the service. gRPC is the primary protocol for streaming. HTTP/REST is simpler for offline requests and integration.


Next Steps#

Use the following links to continue your learning journey.