riva/proto/riva_tts.proto

riva/proto/riva_tts.proto#

service RivaSpeechSynthesis

rpc SynthesizeSpeechResponse Synthesize(SynthesizeSpeechRequest): Used to request text-to-speech from the service. Submit a request containing the desired text and configuration, and receive audio bytes in the requested format.

rpc stream SynthesizeSpeechResponse SynthesizeOnline(SynthesizeSpeechRequest): Used to request text-to-speech returned via stream as it becomes available. Submit a SynthesizeSpeechRequest with desired text and configuration, and receive stream of bytes in the requested format.

rpc RivaSynthesisConfigResponse GetRivaSynthesisConfig(RivaSynthesisConfigRequest): Enables clients to request the configuration of the current Synthesize service, or a specific model within the service.

message RivaSynthesisConfigRequest

string model_name: If model is specified only return config for model, otherwise return all configs.

message RivaSynthesisConfigResponse

RivaSynthesisConfigResponse.Config model_config (repeated)

message RivaSynthesisConfigResponse.Config

string model_name

RivaSynthesisConfigResponse.Config.ParametersEntry parameters (repeated)

message RivaSynthesisConfigResponse.Config.ParametersEntry

string key

string value

message SynthesizeSpeechRequest

string text

string language_code

nvidia.riva.AudioEncoding encoding: audio encoding params

int32 sample_rate_hz

string voice_name: voice params

ZeroShotData zero_shot_data: Zero Shot model params

string custom_dictionary: A string containing comma-separated key-value pairs of grapheme and corresponding phoneme separated by double spaces.

nvidia.riva.RequestId id: The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message SynthesizeSpeechResponse

bytes audio

SynthesizeSpeechResponseMetadata meta

nvidia.riva.RequestId id: The ID associated with the request

message SynthesizeSpeechResponseMetadata

string text: Currently experimental API addition that returns the input text after preprocessing has been completed as well as the predicted duration for each token. Note: this message is subject to future breaking changes, and potential removal.

string processed_text

float predicted_durations(repeated)

message ZeroShotData

Required for Zero Shot model

bytes audio_prompt: Audio prompt for Zero Shot model. Duration should be between 3 to 10 seconds.

int32 sample_rate_hz: Sample rate for input audio prompt.

nvidia.riva.AudioEncoding encoding: Encoding of audio prompt. Supported encodings are LINEAR_PCM and OGGOPUS.

int32 quality: The number of times user wants to pass audio through decoder. This ranges between 1-40. Defaults to 20.