TTS HTTP REST API Reference#

Overview#

The TTS NIM exposes an HTTP REST API for speech synthesis on the port set by NIM_HTTP_API_PORT (default 9000). Synthesis endpoints accept multipart/form-data requests. Use this API when you need simple curl-based access, language-agnostic client integration, or do not want to take a gRPC dependency.

Base URL:

http://<address>:9000

For browser-friendly streaming and interactive applications, use the WebSocket Realtime API instead.

Endpoints#

`GET /v1/audio/list_voices`#

Returns all voices available on the running TTS NIM. Call this endpoint to discover which voice names to pass to the synthesis endpoints.

Example#

curl -s http://localhost:9000/v1/audio/list_voices | jq .

Response#

The response is a JSON object with a single key: a comma-separated string of all supported locale codes. The value is an object with a voices array listing every available voice name.

{
  "en-US,es-US,fr-FR,de-DE,zh-CN,vi-VN,it-IT,hi-IN,ja-JP": {
    "voices": [
      "Magpie-Multilingual.EN-US.Aria",
      "Magpie-Multilingual.EN-US.Aria.Neutral",
      "Magpie-Multilingual.EN-US.Aria.Calm",
      "Magpie-Multilingual.EN-US.Aria.Angry",
      "Magpie-Multilingual.EN-US.Aria.Happy",
      "Magpie-Multilingual.EN-US.Aria.Sad",
      "Magpie-Multilingual.EN-US.Aria.Fearful",
      "Magpie-Multilingual.EN-US.Jason",
      "..."
    ]
  }
}

Voice names follow the pattern Model.LOCALE.Speaker or Model.LOCALE.Speaker.Emotion. The base name (without emotion suffix) uses the model’s default emotion style. Available emotion suffixes vary by speaker: Neutral, Calm, Angry, Happy, Sad, Fearful, Disgust, PleasantSurprised.

Status Codes#

Code	Description
`200 OK`	Voice list returned successfully.
`503 Service Unavailable`	The NIM is still initializing.

`POST /v1/audio/synthesize`#

Synthesizes speech from text and returns a complete WAV audio file in a single response.

Content-Type: multipart/form-data

Request Parameters#

Parameter	Type	Required	Default	Description
`text`	string	Yes	—	Text to synthesize. Maximum 2,000 characters after normalization. Omitting or sending empty text returns `400`.
`language`	string	Yes	—	BCP-47 language code (for example, `en-US`, `es-US`, `fr-FR`, `de-DE`, `zh-CN`, `vi-VN`, `it-IT`, `hi-IN`, `ja-JP`). Omitting or sending empty language returns `400`.
`voice`	string	No	model default	Voice name as returned by `/v1/audio/list_voices` (for example, `Magpie-Multilingual.EN-US.Aria` or `Magpie-Multilingual.EN-US.Aria.Happy`). Omitting uses the model’s built-in default voice.
`sample_rate_hz`	integer	No	`22050`	Output audio sample rate in Hz.
`encoding`	string	No	`LINEAR_PCM`	Output audio encoding. Only `LINEAR_PCM` is currently supported.
`custom_dictionary`	string	No	`""`	Custom pronunciation rules string.
`audio_prompt`	file	No	—	WAV audio file for zero-shot voice cloning. Requires a zero-shot capable model (Magpie TTS Zeroshot or Magpie TTS Flow). Format: 16-bit mono WAV, 22.05 kHz or higher.
`audio_prompt_transcript`	string	No	—	Verbatim transcript of the `audio_prompt` recording. Required by Magpie TTS Flow; unused by other models.
`prompt_quality`	integer	No	`20`	Voice adaptation quality for zero-shot synthesis.

Examples#

Standard synthesis:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice="Magpie-Multilingual.EN-US.Aria" \
  --output output.wav

With sample rate and emotion voice:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="I am delighted to help you today." \
  -F voice="Magpie-Multilingual.EN-US.Aria.Happy" \
  -F sample_rate_hz=44100 \
  --output output.wav

Zero-shot voice cloning:

curl -sS http://localhost:9000/v1/audio/synthesize \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F audio_prompt=@sample_audio_prompt.wav \
  --output output.wav

Note

When passing a file to audio_prompt, the @ prefix in curl is required — for example, -F audio_prompt=@prompt.wav.

Response#

A WAV file (audio/wav) at the requested sample rate. Default output: 16-bit mono, 22050 Hz.

Content-Type: audio/wav

Status Codes#

Code	Description
`200 OK`	Synthesis succeeded. Response body is a WAV file.
`400 Bad Request`	`text` is empty, `language` is empty, the specified `voice` does not exist for the given `language`, `encoding` is not `LINEAR_PCM`, or `text` exceeds 2,000 normalized characters. Response body: `{"detail": "<reason>"}`.
`503 Service Unavailable`	The NIM is still initializing.

`POST /v1/audio/synthesize_online`#

Synthesizes speech from text and streams the audio back as raw LPCM chunks as they are generated, using HTTP chunked transfer encoding. Use this endpoint when low time-to-first-audio is important.

Content-Type: multipart/form-data

Request Parameters#

Identical to POST /v1/audio/synthesize.

Parameter	Type	Required	Default	Description
`text`	string	Yes	—	Text to synthesize. Maximum 2,000 normalized characters.
`language`	string	Yes	—	BCP-47 language code.
`voice`	string	No	model default	Voice name from `/v1/audio/list_voices`.
`sample_rate_hz`	integer	No	`22050`	Output sample rate in Hz.
`encoding`	string	No	`LINEAR_PCM`	Only `LINEAR_PCM` is supported.
`custom_dictionary`	string	No	`""`	Custom pronunciation rules.
`audio_prompt`	file	No	—	WAV reference audio for zero-shot cloning.
`audio_prompt_transcript`	string	No	—	Transcript of the audio prompt (Magpie TTS Flow only).
`prompt_quality`	integer	No	`20`	Zero-shot adaptation quality.

Example#

curl -sS http://localhost:9000/v1/audio/synthesize_online \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice="Magpie-Multilingual.EN-US.Aria" \
  -F sample_rate_hz=22050 \
  --output output.raw

Response Format#

The response body is raw 16-bit signed LPCM audio (no WAV header), delivered as a chunked stream (Transfer-Encoding: chunked). No Content-Type header is set on the response. Before playback or further processing, add a WAV header:

# Option 1: Python standard library
python3 -c "
import wave, pathlib
w = wave.open('output.wav', 'wb')
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(22050)
w.writeframes(pathlib.Path('output.raw').read_bytes())
w.close()
"

# Option 2: sox
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

Adjust -r / setframerate to match the sample_rate_hz value from the request.

Status Codes#

Code	Description
`200 OK`	Stream started. Response body is raw LPCM audio.
`400 Bad Request`	`text` is empty, `language` is empty, the `voice` does not exist, or text exceeds 2,000 normalized characters.
`503 Service Unavailable`	The NIM is still initializing.

`GET /v1/health/ready`#

Returns the readiness state of the NIM.

Example#

curl http://localhost:9000/v1/health/ready

Response#

{
  "object": "health.response",
  "message": "ready",
  "status": "ready"
}

Status Codes#

Code	Description
`200 OK`	Ready to accept requests.
`503 Service Unavailable`	Still initializing.

`GET /v1/health/live`#

Returns the liveness state of the NIM process. Used by container orchestrators to determine whether to restart the container.

Example#

curl http://localhost:9000/v1/health/live

Response#

{
  "object": "health.response",
  "message": "live",
  "status": "live"
}

`GET /v1/models`#

Returns the list of available models in OpenAI-compatible format.

Example#

curl -s http://localhost:9000/v1/models | jq .

Response#

{
  "object": "list",
  "data": [
    {
      "id": "unknown",
      "object": "model",
      "created": 0,
      "owned_by": "system"
    }
  ]
}

`GET /v1/version`#

Returns the NIM release and API version.

Example#

curl -s http://localhost:9000/v1/version | jq .

Response#

{
  "release": "1.8.0",
  "api": "3.1.0"
}

`GET /v1/metadata`#

Returns metadata about the deployed model, including the selected profile ID and NGC model URLs.

Example#

curl -s http://localhost:9000/v1/metadata | jq .

Response#

{
  "version": "1.8.0",
  "selectedModelProfileId": "<profile-hash>",
  "modelInfo": [
    {
      "modelUrl": "ngc://nim/nvidia/magpie-tts-multilingual:<tag>",
      "shortName": "magpie-tts-multilingual:<tag>"
    }
  ],
  "repository_override": "",
  "assetInfo": [],
  "licenseInfo": {}
}

`GET /v1/metrics`#

Returns runtime metrics in Prometheus text format.

Example#

curl http://localhost:9000/v1/metrics

Key Metrics#

Metric	Description
`num_requests_tts_total`	Total TTS requests received since startup.
`num_requests_tts_running`	TTS requests currently being processed.
`num_requests_tts_success_total`	Total successful TTS requests.
`num_characters_tts_total`	Total characters synthesized across all requests.
`request_duration_seconds_tts_total`	Cumulative wall-clock time on TTS requests, in seconds.

Error Format#

All 4xx errors return {"detail": "<reason>"}:

Scenario	HTTP	`detail` value
`text` field empty or missing	`400`	`Bad Request, empty text input`
`language` field empty or missing	`400`	`Bad Request, empty language code`
Unknown `voice` for the given `language`	`400`	`Model is not available on server: Voice <name> for language <code> not found. Please specify the voice name in your SynthesizeSpeechRequest.`
Unsupported `encoding` value	`400`	`Bad Request, invalid encoding`
`text` exceeds 2,000 normalized characters	`400`	`Error: Triton model failed during inference. Error message: ... Input text is larger than the maximum input length: <size> > 2000`

Input Text Limits#

The TTS NIM enforces a limit of 2,000 characters per request on normalized text. Normalization expands numbers, abbreviations, and SSML tags before the limit is applied, so annotated input can exceed the cap with fewer raw characters than expected.

For longer content, split the source text on sentence or paragraph boundaries and send one request per chunk:

while IFS= read -r chunk; do
  [ -z "$chunk" ] && continue
  curl -sS http://localhost:9000/v1/audio/synthesize \
    -F language=en-US \
    -F text="$chunk" \
    -F voice="Magpie-Multilingual.EN-US.Aria" \
    --output "chunk_${RANDOM}.wav"
done < paragraphs.txt

Output Audio Format#

Property	`synthesize`	`synthesize_online`
Container	WAV (with header)	Raw 16-bit signed LPCM (no header)
Content-Type	`audio/wav`	(not set)
Transfer-Encoding	standard	`chunked`
Channels	1 (mono)	1 (mono)
Bit depth	16-bit signed	16-bit signed
Default sample rate	22050 Hz	22050 Hz

Port Configuration#

The HTTP port is configured with the NIM_HTTP_API_PORT environment variable (default: 9000). Avoid port 8000, which is reserved for the internal Triton HTTP endpoint.

docker run ... -e NIM_HTTP_API_PORT=9000 ...

For the complete list of runtime parameters, refer to Runtime Parameters.

TTS HTTP REST API Reference#

Overview#

Endpoints#

GET /v1/audio/list_voices#

Example#

Response#

Status Codes#

POST /v1/audio/synthesize#

Request Parameters#

Examples#

Response#

Status Codes#

POST /v1/audio/synthesize_online#

Request Parameters#

Example#

Response Format#

Status Codes#

GET /v1/health/ready#

Example#

Response#

Status Codes#

GET /v1/health/live#

Example#

Response#

GET /v1/models#

Example#

Response#

GET /v1/version#

Example#

Response#

GET /v1/metadata#

Example#

Response#

GET /v1/metrics#

Example#

Key Metrics#

Error Format#

Input Text Limits#

Output Audio Format#

Port Configuration#

Related#

`GET /v1/audio/list_voices`#

`POST /v1/audio/synthesize`#

`POST /v1/audio/synthesize_online`#

`GET /v1/health/ready`#

`GET /v1/health/live`#

`GET /v1/models`#

`GET /v1/version`#

`GET /v1/metadata`#

`GET /v1/metrics`#