TTS HTTP REST API Reference#
Overview#
The TTS NIM exposes an HTTP REST API for speech synthesis on the port set by NIM_HTTP_API_PORT (default 9000). Synthesis endpoints accept multipart/form-data requests. Use this API when you need simple curl-based access, language-agnostic client integration, or do not want to take a gRPC dependency.
Base URL:
http://<address>:9000
For browser-friendly streaming and interactive applications, use the WebSocket Realtime API instead.
Endpoints#
GET /v1/audio/list_voices#
Returns all voices available on the running TTS NIM. Call this endpoint to discover which voice names to pass to the synthesis endpoints.
Example#
curl -s http://localhost:9000/v1/audio/list_voices | jq .
Response#
The response is a JSON object with a single key: a comma-separated string of all supported locale codes. The value is an object with a voices array listing every available voice name.
{
"en-US,es-US,fr-FR,de-DE,zh-CN,vi-VN,it-IT,hi-IN,ja-JP": {
"voices": [
"Magpie-Multilingual.EN-US.Aria",
"Magpie-Multilingual.EN-US.Aria.Neutral",
"Magpie-Multilingual.EN-US.Aria.Calm",
"Magpie-Multilingual.EN-US.Aria.Angry",
"Magpie-Multilingual.EN-US.Aria.Happy",
"Magpie-Multilingual.EN-US.Aria.Sad",
"Magpie-Multilingual.EN-US.Aria.Fearful",
"Magpie-Multilingual.EN-US.Jason",
"..."
]
}
}
Voice names follow the pattern Model.LOCALE.Speaker or Model.LOCALE.Speaker.Emotion. The base name (without emotion suffix) uses the model’s default emotion style. Available emotion suffixes vary by speaker: Neutral, Calm, Angry, Happy, Sad, Fearful, Disgust, PleasantSurprised.
Status Codes#
Code |
Description |
|---|---|
|
Voice list returned successfully. |
|
The NIM is still initializing. |
POST /v1/audio/synthesize#
Synthesizes speech from text and returns a complete WAV audio file in a single response.
Content-Type: multipart/form-data
Request Parameters#
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
— |
Text to synthesize. Maximum 2,000 characters after normalization. Omitting or sending empty text returns |
|
string |
Yes |
— |
BCP-47 language code (for example, |
|
string |
No |
model default |
Voice name as returned by |
|
integer |
No |
|
Output audio sample rate in Hz. |
|
string |
No |
|
Output audio encoding. Only |
|
string |
No |
|
Custom pronunciation rules string. |
|
file |
No |
— |
WAV audio file for zero-shot voice cloning. Requires a zero-shot capable model (Magpie TTS Zeroshot or Magpie TTS Flow). Format: 16-bit mono WAV, 22.05 kHz or higher. |
|
string |
No |
— |
Verbatim transcript of the |
|
integer |
No |
|
Voice adaptation quality for zero-shot synthesis. |
Examples#
Standard synthesis:
curl -sS http://localhost:9000/v1/audio/synthesize \
-F language=en-US \
-F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
-F voice="Magpie-Multilingual.EN-US.Aria" \
--output output.wav
With sample rate and emotion voice:
curl -sS http://localhost:9000/v1/audio/synthesize \
-F language=en-US \
-F text="I am delighted to help you today." \
-F voice="Magpie-Multilingual.EN-US.Aria.Happy" \
-F sample_rate_hz=44100 \
--output output.wav
Zero-shot voice cloning:
curl -sS http://localhost:9000/v1/audio/synthesize \
-F language=en-US \
-F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
-F audio_prompt=@sample_audio_prompt.wav \
--output output.wav
Note
When passing a file to audio_prompt, the @ prefix in curl is required — for example, -F audio_prompt=@prompt.wav.
Response#
A WAV file (audio/wav) at the requested sample rate. Default output: 16-bit mono, 22050 Hz.
Content-Type: audio/wav
Status Codes#
Code |
Description |
|---|---|
|
Synthesis succeeded. Response body is a WAV file. |
|
|
|
The NIM is still initializing. |
POST /v1/audio/synthesize_online#
Synthesizes speech from text and streams the audio back as raw LPCM chunks as they are generated, using HTTP chunked transfer encoding. Use this endpoint when low time-to-first-audio is important.
Content-Type: multipart/form-data
Request Parameters#
Identical to POST /v1/audio/synthesize.
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
— |
Text to synthesize. Maximum 2,000 normalized characters. |
|
string |
Yes |
— |
BCP-47 language code. |
|
string |
No |
model default |
Voice name from |
|
integer |
No |
|
Output sample rate in Hz. |
|
string |
No |
|
Only |
|
string |
No |
|
Custom pronunciation rules. |
|
file |
No |
— |
WAV reference audio for zero-shot cloning. |
|
string |
No |
— |
Transcript of the audio prompt (Magpie TTS Flow only). |
|
integer |
No |
|
Zero-shot adaptation quality. |
Example#
curl -sS http://localhost:9000/v1/audio/synthesize_online \
-F language=en-US \
-F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
-F voice="Magpie-Multilingual.EN-US.Aria" \
-F sample_rate_hz=22050 \
--output output.raw
Response Format#
The response body is raw 16-bit signed LPCM audio (no WAV header), delivered as a chunked stream (Transfer-Encoding: chunked). No Content-Type header is set on the response. Before playback or further processing, add a WAV header:
# Option 1: Python standard library
python3 -c "
import wave, pathlib
w = wave.open('output.wav', 'wb')
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(22050)
w.writeframes(pathlib.Path('output.raw').read_bytes())
w.close()
"
# Option 2: sox
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav
Adjust -r / setframerate to match the sample_rate_hz value from the request.
Status Codes#
Code |
Description |
|---|---|
|
Stream started. Response body is raw LPCM audio. |
|
|
|
The NIM is still initializing. |
GET /v1/health/ready#
Returns the readiness state of the NIM.
Example#
curl http://localhost:9000/v1/health/ready
Response#
{
"object": "health.response",
"message": "ready",
"status": "ready"
}
Status Codes#
Code |
Description |
|---|---|
|
Ready to accept requests. |
|
Still initializing. |
GET /v1/health/live#
Returns the liveness state of the NIM process. Used by container orchestrators to determine whether to restart the container.
Example#
curl http://localhost:9000/v1/health/live
Response#
{
"object": "health.response",
"message": "live",
"status": "live"
}
GET /v1/models#
Returns the list of available models in OpenAI-compatible format.
Example#
curl -s http://localhost:9000/v1/models | jq .
Response#
{
"object": "list",
"data": [
{
"id": "unknown",
"object": "model",
"created": 0,
"owned_by": "system"
}
]
}
GET /v1/version#
Returns the NIM release and API version.
Example#
curl -s http://localhost:9000/v1/version | jq .
Response#
{
"release": "1.8.0",
"api": "3.1.0"
}
GET /v1/metadata#
Returns metadata about the deployed model, including the selected profile ID and NGC model URLs.
Example#
curl -s http://localhost:9000/v1/metadata | jq .
Response#
{
"version": "1.8.0",
"selectedModelProfileId": "<profile-hash>",
"modelInfo": [
{
"modelUrl": "ngc://nim/nvidia/magpie-tts-multilingual:<tag>",
"shortName": "magpie-tts-multilingual:<tag>"
}
],
"repository_override": "",
"assetInfo": [],
"licenseInfo": {}
}
GET /v1/metrics#
Returns runtime metrics in Prometheus text format.
Example#
curl http://localhost:9000/v1/metrics
Key Metrics#
Metric |
Description |
|---|---|
|
Total TTS requests received since startup. |
|
TTS requests currently being processed. |
|
Total successful TTS requests. |
|
Total characters synthesized across all requests. |
|
Cumulative wall-clock time on TTS requests, in seconds. |
Error Format#
All 4xx errors return {"detail": "<reason>"}:
Scenario |
HTTP |
|
|---|---|---|
|
|
|
|
|
|
Unknown |
|
|
Unsupported |
|
|
|
|
|
Input Text Limits#
The TTS NIM enforces a limit of 2,000 characters per request on normalized text. Normalization expands numbers, abbreviations, and SSML tags before the limit is applied, so annotated input can exceed the cap with fewer raw characters than expected.
For longer content, split the source text on sentence or paragraph boundaries and send one request per chunk:
while IFS= read -r chunk; do
[ -z "$chunk" ] && continue
curl -sS http://localhost:9000/v1/audio/synthesize \
-F language=en-US \
-F text="$chunk" \
-F voice="Magpie-Multilingual.EN-US.Aria" \
--output "chunk_${RANDOM}.wav"
done < paragraphs.txt
Output Audio Format#
Property |
|
|
|---|---|---|
Container |
WAV (with header) |
Raw 16-bit signed LPCM (no header) |
Content-Type |
|
(not set) |
Transfer-Encoding |
standard |
|
Channels |
1 (mono) |
1 (mono) |
Bit depth |
16-bit signed |
16-bit signed |
Default sample rate |
22050 Hz |
22050 Hz |
Port Configuration#
The HTTP port is configured with the NIM_HTTP_API_PORT environment variable (default: 9000). Avoid port 8000, which is reserved for the internal Triton HTTP endpoint.
docker run ... -e NIM_HTTP_API_PORT=9000 ...
For the complete list of runtime parameters, refer to Runtime Parameters.