riva/proto/riva_tts.proto#

service RivaSpeechSynthesis
rpc SynthesizeSpeechResponse Synthesize(SynthesizeSpeechRequest)

Used to request text-to-speech from the service. Submit a request containing the desired text and configuration, and receive audio bytes in the requested format.

rpc stream SynthesizeSpeechResponse SynthesizeOnline(SynthesizeSpeechRequest)

Used to request text-to-speech returned via stream as it becomes available. Submit a SynthesizeSpeechRequest with desired text and configuration, and receive stream of bytes in the requested format.

rpc RivaSynthesisConfigResponse GetRivaSynthesisConfig(RivaSynthesisConfigRequest)

Enables clients to request the configuration of the current Synthesize service, or a specific model within the service.

message RivaSynthesisConfigRequest
string model_name

If model is specified only return config for model, otherwise return all configs.

message RivaSynthesisConfigResponse
RivaSynthesisConfigResponse.Config model_config (repeated)
message RivaSynthesisConfigResponse.Config
string model_name
RivaSynthesisConfigResponse.Config.ParametersEntry parameters (repeated)
message RivaSynthesisConfigResponse.Config.ParametersEntry
string key
string value
message SynthesizeSpeechRequest
string text
string language_code
nvidia.riva.AudioEncoding encoding

audio encoding params

int32 sample_rate_hz
string voice_name

voice params

ZeroShotData zero_shot_data

Zero Shot model params

nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message SynthesizeSpeechResponse
bytes audio
SynthesizeSpeechResponseMetadata meta
nvidia.riva.RequestId id

The ID associated with the request

message SynthesizeSpeechResponseMetadata
string text

Currently experimental API addition that returns the input text after preprocessing has been completed as well as the predicted duration for each token. Note: this message is subject to future breaking changes, and potential removal.

string processed_text
float predicted_durations(repeated)
message ZeroShotData

Required for Zero Shot model

bytes audio_prompt

Small (upto 5-seconds) audio prompt for Zero Shot model.

int32 sample_rate_hz

Sample rate for input audio prompt. Current defaults to 22050.

nvidia.riva.AudioEncoding encoding

Encoding of audio prompt, defaults to PCM.

int32 quality

The number of times user wants to pass audio through decoder. This ranges between 1-40. Defaults to 20.