TTS HTTP REST API Reference#

Top

Overview#

The TTS NIM exposes an HTTP REST API for speech synthesis on the port set by NIM_HTTP_API_PORT (default 9000). Synthesis endpoints accept multipart/form-data requests. Use this API when you need simple curl-based access, language-agnostic client integration, or do not want to take a gRPC dependency.

Base URL:

http://<address>:9000

For browser-friendly streaming and interactive applications, use the WebSocket Realtime API instead.


Endpoints#

GET /v1/audio/list_voices#

Returns all voices available on the running TTS NIM. Call this endpoint to discover which voice names to pass to the synthesis endpoints.

Example#

curl -s http://localhost:9000/v1/audio/list_voices | jq .

Response#

The response is a JSON object with a single key: a comma-separated string of all supported locale codes. The value is an object with a voices array listing every available voice name.

{
  "en-US,es-US,fr-FR,de-DE,zh-CN,vi-VN,it-IT,hi-IN,ja-JP": {
    "voices": [
      "Magpie-Multilingual.EN-US.Aria",
      "Magpie-Multilingual.EN-US.Aria.Neutral",
      "Magpie-Multilingual.EN-US.Aria.Calm",
      "Magpie-Multilingual.EN-US.Aria.Angry",
      "Magpie-Multilingual.EN-US.Aria.Happy",
      "Magpie-Multilingual.EN-US.Aria.Sad",
      "Magpie-Multilingual.EN-US.Aria.Fearful",
      "Magpie-Multilingual.EN-US.Jason",
      "..."
    ]
  }
}

Voice names follow the pattern Model.LOCALE.Speaker or Model.LOCALE.Speaker.Emotion. The base name (without emotion suffix) uses the model’s default emotion style. Available emotion suffixes vary by speaker: Neutral, Calm, Angry, Happy, Sad, Fearful, Disgust, PleasantSurprised.

Status Codes#

Code

Description

200 OK

Voice list returned successfully.

503 Service Unavailable

The NIM is still initializing.


POST /v1/audio/synthesize#

Synthesizes speech from text and returns a complete WAV audio file in a single response.

Content-Type: multipart/form-data

Request Parameters#

Parameter

Type

Required

Default

Description

text

string

Yes

Text to synthesize. Maximum 2,000 characters after normalization. Omitting or sending empty text returns 400.

language

string

Yes

BCP-47 language code (for example, en-US, es-US, fr-FR, de-DE, zh-CN, vi-VN, it-IT, hi-IN, ja-JP). Omitting or sending empty language returns 400.

voice

string

No

model default

Voice name as returned by /v1/audio/list_voices (for example, Magpie-Multilingual.EN-US.Aria or Magpie-Multilingual.EN-US.Aria.Happy). Omitting uses the model’s built-in default voice.

sample_rate_hz

integer

No

22050

Output audio sample rate in Hz.

encoding

string

No

LINEAR_PCM

Output audio encoding. Only LINEAR_PCM is currently supported.

custom_dictionary

string

No

""

Custom pronunciation rules string.

audio_prompt

file

No

WAV audio file for zero-shot voice cloning. Requires a zero-shot capable model (Magpie TTS Zeroshot or Magpie TTS Flow). Format: 16-bit mono WAV, 22.05 kHz or higher.

audio_prompt_transcript

string

No

Verbatim transcript of the audio_prompt recording. Required by Magpie TTS Flow; unused by other models.

prompt_quality

integer

No

20

Voice adaptation quality for zero-shot synthesis.

Examples#

Standard synthesis:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice="Magpie-Multilingual.EN-US.Aria" \
  --output output.wav

With sample rate and emotion voice:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="I am delighted to help you today." \
  -F voice="Magpie-Multilingual.EN-US.Aria.Happy" \
  -F sample_rate_hz=44100 \
  --output output.wav

Zero-shot voice cloning:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F audio_prompt=@sample_audio_prompt.wav \
  --output output.wav

Note

When passing a file to audio_prompt, the @ prefix in curl is required — for example, -F audio_prompt=@prompt.wav.

Response#

A WAV file (audio/wav) at the requested sample rate. Default output: 16-bit mono, 22050 Hz.

Content-Type: audio/wav

Status Codes#

Code

Description

200 OK

Synthesis succeeded. Response body is a WAV file.

400 Bad Request

text is empty, language is empty, the specified voice does not exist for the given language, encoding is not LINEAR_PCM, or text exceeds 2,000 normalized characters. Response body: {"detail": "<reason>"}.

503 Service Unavailable

The NIM is still initializing.


POST /v1/audio/synthesize_online#

Synthesizes speech from text and streams the audio back as raw LPCM chunks as they are generated, using HTTP chunked transfer encoding. Use this endpoint when low time-to-first-audio is important.

Content-Type: multipart/form-data

Request Parameters#

Identical to POST /v1/audio/synthesize.

Parameter

Type

Required

Default

Description

text

string

Yes

Text to synthesize. Maximum 2,000 normalized characters.

language

string

Yes

BCP-47 language code.

voice

string

No

model default

Voice name from /v1/audio/list_voices.

sample_rate_hz

integer

No

22050

Output sample rate in Hz.

encoding

string

No

LINEAR_PCM

Only LINEAR_PCM is supported.

custom_dictionary

string

No

""

Custom pronunciation rules.

audio_prompt

file

No

WAV reference audio for zero-shot cloning.

audio_prompt_transcript

string

No

Transcript of the audio prompt (Magpie TTS Flow only).

prompt_quality

integer

No

20

Zero-shot adaptation quality.

Example#

curl -sS http://localhost:9000/v1/audio/synthesize_online \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice="Magpie-Multilingual.EN-US.Aria" \
  -F sample_rate_hz=22050 \
  --output output.raw

Response Format#

The response body is raw 16-bit signed LPCM audio (no WAV header), delivered as a chunked stream (Transfer-Encoding: chunked). No Content-Type header is set on the response. Before playback or further processing, add a WAV header:

# Option 1: Python standard library
python3 -c "
import wave, pathlib
w = wave.open('output.wav', 'wb')
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(22050)
w.writeframes(pathlib.Path('output.raw').read_bytes())
w.close()
"

# Option 2: sox
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

Adjust -r / setframerate to match the sample_rate_hz value from the request.

Status Codes#

Code

Description

200 OK

Stream started. Response body is raw LPCM audio.

400 Bad Request

text is empty, language is empty, the voice does not exist, or text exceeds 2,000 normalized characters.

503 Service Unavailable

The NIM is still initializing.


GET /v1/health/ready#

Returns the readiness state of the NIM.

Example#

curl http://localhost:9000/v1/health/ready

Response#

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Status Codes#

Code

Description

200 OK

Ready to accept requests.

503 Service Unavailable

Still initializing.


GET /v1/health/live#

Returns the liveness state of the NIM process. Used by container orchestrators to determine whether to restart the container.

Example#

curl http://localhost:9000/v1/health/live

Response#

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

GET /v1/models#

Returns the list of available models in OpenAI-compatible format.

Example#

curl -s http://localhost:9000/v1/models | jq .

Response#

{
  "object": "list",
  "data": [
    {
      "id": "unknown",
      "object": "model",
      "created": 0,
      "owned_by": "system"
    }
  ]
}

GET /v1/version#

Returns the NIM release and API version.

Example#

curl -s http://localhost:9000/v1/version | jq .

Response#

{
  "release": "1.8.0",
  "api": "3.1.0"
}

GET /v1/metadata#

Returns metadata about the deployed model, including the selected profile ID and NGC model URLs.

Example#

curl -s http://localhost:9000/v1/metadata | jq .

Response#

{
  "version": "1.8.0",
  "selectedModelProfileId": "<profile-hash>",
  "modelInfo": [
    {
      "modelUrl": "ngc://nim/nvidia/magpie-tts-multilingual:<tag>",
      "shortName": "magpie-tts-multilingual:<tag>"
    }
  ],
  "repository_override": "",
  "assetInfo": [],
  "licenseInfo": {}
}

GET /v1/metrics#

Returns runtime metrics in Prometheus text format.

Example#

curl http://localhost:9000/v1/metrics

Key Metrics#

Metric

Description

num_requests_tts_total

Total TTS requests received since startup.

num_requests_tts_running

TTS requests currently being processed.

num_requests_tts_success_total

Total successful TTS requests.

num_characters_tts_total

Total characters synthesized across all requests.

request_duration_seconds_tts_total

Cumulative wall-clock time on TTS requests, in seconds.


Error Format#

All 4xx errors return {"detail": "<reason>"}:

Scenario

HTTP

detail value

text field empty or missing

400

Bad Request, empty text input

language field empty or missing

400

Bad Request, empty language code

Unknown voice for the given language

400

Model is not available on server: Voice <name> for language <code> not found. Please specify the voice name in your SynthesizeSpeechRequest.

Unsupported encoding value

400

Bad Request, invalid encoding

text exceeds 2,000 normalized characters

400

Error: Triton model failed during inference. Error message: ... Input text is larger than the maximum input length: <size> > 2000


Input Text Limits#

The TTS NIM enforces a limit of 2,000 characters per request on normalized text. Normalization expands numbers, abbreviations, and SSML tags before the limit is applied, so annotated input can exceed the cap with fewer raw characters than expected.

For longer content, split the source text on sentence or paragraph boundaries and send one request per chunk:

while IFS= read -r chunk; do
  [ -z "$chunk" ] && continue
  curl -sS http://localhost:9000/v1/audio/synthesize \
    -F language=en-US \
    -F text="$chunk" \
    -F voice="Magpie-Multilingual.EN-US.Aria" \
    --output "chunk_${RANDOM}.wav"
done < paragraphs.txt

Output Audio Format#

Property

synthesize

synthesize_online

Container

WAV (with header)

Raw 16-bit signed LPCM (no header)

Content-Type

audio/wav

(not set)

Transfer-Encoding

standard

chunked

Channels

1 (mono)

1 (mono)

Bit depth

16-bit signed

16-bit signed

Default sample rate

22050 Hz

22050 Hz


Port Configuration#

The HTTP port is configured with the NIM_HTTP_API_PORT environment variable (default: 9000). Avoid port 8000, which is reserved for the internal Triton HTTP endpoint.

docker run ... -e NIM_HTTP_API_PORT=9000 ...

For the complete list of runtime parameters, refer to Runtime Parameters.