Getting Started#

Prerequisites#

Setup#

  • NVIDIA AI Enterprise License: Riva TTS NIM is available for self-hosting under the NVIDIA AI Enterprise (NVAIE) License.

  • NVIDIA GPU(s): Riva TTS NIM runs on any NVIDIA GPU with sufficient GPU memory but some model/GPU combinations are optimized. Refer to the Supported Models for more information.

  • CPU: x86_64 architecture only for this release

  • OS: any Linux distributions which:

  • CUDA Drivers: Follow the installation guide.   We recommend:

    • Using a network repository as part of a package manager installation, skipping the CUDA toolkit installation as the libraries are available within the NIM container

    • Installing the open kernels for a specific version:

      Major Version

      EOL

      Data Center & RTX/Quadro GPUs

      GeForce GPUs

      > 550

      TBD

      X

      X

      550

      Feb 2025

      X

      X

      545

      Oct 2023

      X

      X

      535

      June 2026

      X

      525

      Nov 2023

      X

      470

      Sept 2024

      X

  1. Install Docker.

  2. Install the NVIDIA Container Toolkit.

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to the following, where you can confirm the CUDA driver version and available GPUs.

   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
   | N/A   36C    P0            112W /  700W |   78489MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+

NGC Authentication#

Generate an API key#

To access NGC resources, you need an NGC API key. You can generate a key here: Generate Personal Key.

When creating an NGC API Personal key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More services can be included if this key is to be reused for other purposes.

Note

Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, refer the NGC User Guide.

Export the API key#

Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal:

export NGC_API_KEY=<value>

Run one of the following commands to make the key available at startup:

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

More secure options include saving the value in a file so that you can retrieve with cat $NGC_API_KEY_FILE or using a password manager.

Docker Login to NGC#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Launching the NIM#

Models are available in two formats:

  • Prebuilt: Prebuilt models use TensorRT engines for optimized inference. ONNX or PyTorch models are used in cases where TensorRT engine is not available. You can download and use them directly on the corresponding GPU.

  • RMIR: This intermediate model format (Riva Model Intermediate Representation) requires an additional deployment step before you can use it. You can optimize this model with TensorRT and deploy it on any supported GPU. This model format is automatically chosen if a prebuilt model is not available for your GPU.

Riva TTS NIM automatically downloads the prebuilt model on supported GPUs or generates an optimized model on-the-fly using RMIR model on other GPUs.

Refer to the Supported Models section to choose the desired model. Afterward, set CONTAINER_ID and NIM_TAGS_SELECTOR appropriately in the following commands.

Refer to the following section to deploy available TTS models.

export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual

docker run -it --rm --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -e NIM_TAGS_SELECTOR \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest
export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot

docker run -it --rm --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -e NIM_TAGS_SELECTOR \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest
export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow

docker run -it --rm --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -e NIM_TAGS_SELECTOR \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to the Magpie TTS Zeroshot and Magpie TTS Flow models is restricted. Fill out this form to request access.

Note

It could take up to 30 minutes for the Docker container to be ready and start accepting requests, depending on your network speed.

Running Inference#

  1. Open a new terminal and run the following command to check if the service is ready to handle inference requests:

curl -X 'GET' 'http://localhost:9000/v1/health/ready'

If the service is ready, you get a response similar to the following.

{"status":"ready"}
  1. Install the Riva Python client

Riva uses gRPC APIs. You can download proto files from Riva gRPC Proto files and compile them to a target language using Protoc compiler. You can find Riva clients in C++ and Python languages at the following locations.

Install Riva Python client

sudo apt-get install python3-pip
pip install -U nvidia-riva-client

Download Riva sample client

cd $HOME
git clone https://github.com/nvidia-riva/python-clients.git
  1. Run Inference

Following sections demonstrate various TTS models using gRPC and HTTP APIs using sample Python clients and curl command respectively.

The Magpie TTS Multilingual model supports text to speech in multiple languages.

Ensure that you have deployed the Magpie TTS Multilingual model, by referring to Supported Models section.

The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.

List available models and voices

python3 python-clients/scripts/tts/talk.py \
   --server 0.0.0.0:50051 \
   --list-voices
curl -sS http://localhost:9000/v1/audio/list_voices | jq

Output is piped to jq command to format the JSON string for better readability.

You will see a output with list of voices for supported languages. Output is truncated for brevity.

{
   "en-US,es-US,fr-FR,de-DE": {
      "voices": [
            "Magpie-Multilingual.EN-US.Sofia",
            "Magpie-Multilingual.EN-US.Ray",
            ...
            "Magpie-Multilingual.DE-DE.Leo",
            "Magpie-Multilingual.DE-DE.Aria"
      ]
   }
}

Synthesize speech with Offline API

With Offline API, entire synthesized speech is returned to the client at once. Synthesized speech will be saved in output.wav.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --voice Magpie-Multilingual.EN-US.Sofia \
   --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
   -F language=en-US  \
   -F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   -F voice=Magpie-Multilingual.EN-US.Sofia \
   --output output.wav

It is possible to use intermix the voices and languages to generate voice with different accents. For example, one can synthesize English speech with French accent with below command.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --voice Magpie-Multilingual.FR-FR.Pascal \
   --output output.wav

Note

By default, gRPC limits incoming message size to 4 MB. As the Offline API returns synthesized speech in a single chunk, an error will occur if the synthesized speech exceeds this size. In such cases, we recommend using the Streaming API instead.

Synthesize speech with Streaming API

With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --voice Magpie-Multilingual.EN-US.Sofia \
   --stream \
   --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
   -F language=en-US  \
   -F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   -F voice=Magpie-Multilingual.EN-US.Sofia \
   -F sample_rate_hz=22050 \
   --output output.raw

Streaming HTTP API output is in RAW LPCM format without WAV header. Tool like sox can be used to prefix a WAV header and save it as WAV file.

sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

The Magpie TTS Zeroshot model supports text to speech in English using an audio prompt. Voice characteristics from the audio prompt are applied to the synthesized output speech. This model supports streaming and offline inference.

The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.

Make sure you have deployed the Magpie TTS Zeroshot, by referring to Supported Models section.

You can create a audio prompt using any voice recording application.

Guidelines for creating an effective audio prompt
  • Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.

  • Aim for a duration of five seconds.

  • Trim silence from the beginning and end so that speech fills most of the prompt.

  • Record the prompt in a noise-free environment.

The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the –zero_shot_audio_prompt_file argument.

Synthesize speech with Offline API

With Offline API, entire synthesized speech is returned to the client at once.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
   --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
   -F language=en-US  \
   -F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   -F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
   --output output.wav

Note

‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.

Synthesize speech with Streaming API

With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
   --stream \
   --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
   -F language=en-US  \
   -F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   -F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
   --output output.wav

Note

‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.

The Magpie TTS Flow model supports text to speech in English using an audio prompt and prompt transcript text. Voice characteristics from the audio prompt are applied to the synthesized output speech. Compared to Magpie TTS Zeroshot model, this model requires additional prompt transcript text as input. This model supports only offline inference.

The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.

Make sure you have deployed the Magpie TTS Flow, by referring to Supported Models section.

Audio prompt can be created using any voice recording application.

Guidelines for creating an effective audio prompt
  • Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.

  • Aim for a duration of five seconds.

  • Trim silence from the beginning and end so that speech fills most of the prompt.

  • Record the prompt in a noise-free environment.

The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the –zero_shot_audio_prompt_file and –zero_shot_transcript arguments.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
   --zero_shot_transcript "I consent to use my voice to create a synthetic voice." \
   --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
   -F language=en-US \
   -F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   -F sample_rate_hz=22050 \
   -F audio_prompt="@$HOME/python-clients/data/examples/sample_audio_prompt.wav" \
   -F audio_prompt_transcript="I consent to use my voice to create a synthetic voice." \
   --output output.wav

Note

The Magpie TTS Flow model supports only offline APIs.

‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.

The Fastpitch HifiGAN TTS model supports text to speech only in English (en-US) language.

Ensure that you have deployed the Fastpitch HifiGAN TTS model, by referring to Supported Models section.

List available models and voices

python3 python-clients/scripts/tts/talk.py \
   --server 0.0.0.0:50051 \
   --list-voices

Synthesize speech with Offline API

With Offline API, entire synthesized speech is returned to the client at once. Synthesized speech will be saved in output.wav.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --output output.wav

Note

By default, gRPC limits incoming message size to 4 MB. As the Offline API returns synthesized speech in a single chunk, an error will occur if the synthesized speech exceeds this size. In such cases, we recommend using the Streaming API instead.

Synthesize speech with Streaming API

With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
   --language-code en-US \
   --text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
   --stream \
   --output output.wav

The sections above demonstrate the Riva TTS NIM features using sample Python clients. To build your own application in Python, you can refer the provided Python code or try out the Riva TTS Notebook Jupyter Notebook that offers an interactive guide.

Runtime Parameters for the Container#

Flags

Description

-it

--interactive + --tty (see Docker docs)

--rm

Delete the container after it stops (see Docker docs).

--name=<container-name>

Give a name to the NIM container. Use any preferred value.

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container.

--gpus '"device=0"'

Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs.

--shm-size=8GB

Allocate host memory for multi-GPU communication.

-e NGC_API_KEY=$NGC_API_KEY

Provide the container with the token necessary to download adequate models and resources from NGC. See NGC Authentication.

-e NIM_HTTP_API_PORT=<port>

Specify the port to use for HTTP endpoint. Port can have any value except 8000. Default 9000.

-e NIM_GRPC_API_PORT=<port>

Specify the port to use for GRPC endpoint. Default 50051.

-p 9000:9000

Forward the port where the NIM HTTP server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (9000 here), while the right-hand side is the container port where the NIM HTTP server is published. Container port can be any value except 8000.

-p 50051:50051

Forward the port where the NIM gRPC server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (50051 here), while the right-hand side is the container port where the NIM gRPC server is published.

-e NIM_TAGS_SELECTOR=<key=value,...>

Use this to filter tags in auto profile selector. This can be a list of key-value pairs, where the key is the profile property name and the value is the desired property value. For example: name=fastpitch-hifigan-en-us

NIM_SSL_MODE

No

NIM_SSL_KEY_PATH

Required if NIM_SSL_MODE is enabled

NIM_SSL_CERTS_PATH

Required if NIM_SSL_MODE is enabled

NIM_SSL_CA_CERTS_PATH

Required if NIM_SSL_MODE="MTLS"

Model Caching#

On initial startup, the container downloads the models from NGC. You can skip this download step on future runs by caching the model locally using a cache directory as shown in the following example. For information on which NIMs support Prebuilt and RMIR formats, refer to the Support Matrix.

# Create the cache directory on the host machine
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE
chmod 777 $LOCAL_NIM_CACHE

# Set appropriate value for container ID
export CONTAINER_ID=magpie-tts-multilingual

# Set the appropriate values for NIM_TAGS_SELECTOR.
export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,model_type=prebuilt"

# Run the container with the cache directory mounted in the appropriate location
docker run -it --rm --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_TAGS_SELECTOR \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest

On subsequent runs, the models will be loaded from cache.

RMIR models needs to be deployed before you can use it. We need to deploy the models and export the generated models for later use.

# Create the cache directory on the host machine
export NIM_EXPORT_PATH=~/nim_export
mkdir -p $NIM_EXPORT_PATH
chmod 777 $NIM_EXPORT_PATH

# Set appropriate value for container ID
export CONTAINER_ID=riva-tts

# Set the appropriate values for <model> from the Supported Models table
export NIM_TAGS_SELECTOR="name=fastpitch-hifigan-en-us,model_type=rmir"

# Run the container with the export directory mounted in the appropriate location
docker run -it --rm --name=$CONTAINER_ID \
      --runtime=nvidia \
      --gpus '"device=0"' \
      --shm-size=8GB \
      -e NGC_API_KEY \
      -e NIM_TAGS_SELECTOR \
      -e NIM_HTTP_API_PORT=9000 \
      -e NIM_GRPC_API_PORT=50051 \
      -p 9000:9000 \
      -p 50051:50051 \
      -v $NIM_EXPORT_PATH:/opt/nim/export \
      -e NIM_EXPORT_PATH=/opt/nim/export \
      nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Once the model deployment is complete, container terminates with the following log.

INFO:inference:Riva model generation completed
INFO:inference:Models exported to /opt/nim/export
INFO:inference:Exiting container

Subsequent runs can be made with the following command NIM_DISABLE_MODEL_DOWNLOAD=true. Exported models are loaded instead of downloading models from NGC.

# Run the container with the cache directory mounted in the appropriate location
docker run -it --rm --name=$CONTAINER_ID \
      --runtime=nvidia \
      --gpus '"device=0"' \
      --shm-size=8GB \
      -e NGC_API_KEY \
      -e NIM_TAGS_SELECTOR \
      -e NIM_DISABLE_MODEL_DOWNLOAD=true \
      -e NIM_HTTP_API_PORT=9000 \
      -e NIM_GRPC_API_PORT=50051 \
      -p 9000:9000 \
      -p 50051:50051 \
      -v $NIM_EXPORT_PATH:/opt/nim/export \
      -e NIM_EXPORT_PATH=/opt/nim/export \
      nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Stopping the Container#

The following commands stop the container by stopping and removing the running docker container.

docker stop $CONTAINER_ID
docker rm $CONTAINER_ID