Troubleshooting NVIDIA TTS NIM Microservice Issues#

This page covers troubleshooting issues specific to the NVIDIA TTS NIM microservices. For issues shared across all NVIDIA Speech NIM microservices, see Common Issues.

gRPC Message Size Limit Exceeded#

Symptom#

Offline synthesis fails with a gRPC error when synthesizing long text. The error message indicates the response exceeds the maximum message size.

Cause#

gRPC limits message size to 4 MB by default. Long input text can produce audio that exceeds this limit when returned as a single response in offline mode. TTS output is capped at 20 seconds of audio per request.

Solution#

Use streaming synthesis instead of offline synthesis. Streaming returns audio in chunks and handles arbitrarily long text.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
  --language-code en-US \
  --text "Your long input text here" \
  --voice Magpie-Multilingual.EN-US.Aria \
  --stream \
  --output output.wav

Invalid or Unrecognized Voice Name#

Symptom#

The synthesis request returns an error indicating the voice name is invalid or not found.

Cause#

Voice names are model-specific and follow the format Model.LOCALE.Speaker. Using an incorrect name, misspelling a voice, or requesting a voice that belongs to a different model triggers this error.

Solution#

  1. List the available voices for the deployed model:

    python3 python-clients/scripts/tts/talk.py \
      --server 0.0.0.0:50051 \
      --list-voices
    
  2. Use the exact voice name from the list. Voice name formats vary by model:

    Model

    Format

    Example

    Magpie TTS Multilingual

    Magpie-Multilingual.LOCALE.Speaker

    Magpie-Multilingual.EN-US.Aria

    Magpie TTS Zeroshot

    Magpie-ZeroShot.Speaker

    Magpie-ZeroShot.Female-1

    Magpie TTS Flow

    English-US-Magpie-Flow.Speaker

    English-US-Magpie-Flow.Female-1

    RAD-TTS HiFi-GAN

    English-US-RadTTS.Speaker

    English-US-RadTTS.Female-1

  3. Verify you are using a voice that matches the deployed model. For example, Magpie-ZeroShot.Female-1 is only available on the Magpie TTS Zeroshot model. See the TTS support matrix for all available voices per model.

Streaming HTTP Output Is Raw Audio (Not WAV)#

Symptom#

The audio file produced by the streaming HTTP endpoint (/v1/audio/synthesize_online) cannot be played or sounds like static. Audio players report an invalid or unrecognized format.

Cause#

The streaming HTTP API returns raw LPCM audio data without a WAV header. Opening the raw file directly in an audio player fails because the player cannot determine the sample rate, bit depth, or channel count.

Solution#

Convert the raw output to WAV using sox. Match the sample rate to your model: 22050 Hz for Magpie models, 44100 Hz for RAD-TTS.

curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
  -F language=en-US \
  -F text="Your text here" \
  -F voice=Magpie-Multilingual.EN-US.Aria \
  -F sample_rate_hz=22050 \
  --output output.raw
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

Match the -r value to the sample_rate_hz used in the request.

Voice Cloning Audio Prompt Rejected#

Symptom#

The voice cloning request fails or returns an error indicating the audio prompt is invalid.

Cause#

The audio prompt does not meet the requirements. Magpie TTS Zeroshot and Magpie TTS Flow require a reference audio file that is:

  • 3 to 5 seconds in duration (recommended).

  • 16-bit mono WAV format.

  • 22.05 kHz sample rate.

Solution#

  1. Verify the audio prompt format:

    sox --info reference.wav
    
  2. Convert the audio to the required format if needed:

    sox input.wav -r 22050 -c 1 -b 16 reference.wav
    
  3. Trim the audio to 3–5 seconds:

    sox reference.wav trimmed.wav trim 0 5
    
  4. For Magpie TTS Flow, provide the --zero_shot_transcript parameter with the exact transcript of the audio prompt.