gRPC & Protocol Buffers#

riva/proto/riva_asr.proto#

service RivaSpeechRecognition#

The RivaSpeechRecognition service provides two mechanisms for converting speech to text.

rpc RecognizeResponse Recognize(RecognizeRequest)

Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.

rpc stream StreamingRecognizeResponse StreamingRecognize(stream StreamingRecognizeRequest)

StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.

rpc RivaSpeechRecognitionConfigResponse GetRivaSpeechRecognitionConfig(RivaSpeechRecognitionConfigRequest)

Enables clients to request the configuration of the current ASR service, or a specific model within the service.

message EndpointingConfig#

EndpointingConfig is used for configuring different fields related to start or end of utterance

int32 start_history#

start_history is the size of the window, in milliseconds, used to detect start of utterance. start_threshold is the percentage threshold used to detect start of utterance. (0.0 to 1.0) If start_threshold of start_history ms of the acoustic model output have non-blank tokens, start of utterance is detected.

optional

float start_threshold#

optional

int32 stop_history#

stop_history is the size of the window, in milliseconds, used to detect end of utterance. stop_threshold is the percentage threshold used to detect end of utterance. (0.0 to 1.0) If stop_threshold of stop_history ms of the acoustic model output have non-blank tokens, end of utterance is detected and decoder will be reset.

optional

float stop_threshold#

optional

int32 stop_history_eou#

stop_history_eou and stop_threshold_eou are used for 2-pass end of utterance. stop_history_eou is the size of the window, in milliseconds, used to trigger 1st pass of end of utterance and generate a partial transcript with stability of 1. (stop_history_eou < stop_history) stop_threshold_eou is the percentage threshold used to trigger 1st pass of end of utterance. (0.0 to 1.0) If stop_threshold_eou of stop_history_eou ms of the acoustic model output have non-blank tokens, 1st pass of end of utterance is triggered.

optional

float stop_threshold_eou#

optional

message RecognitionConfig#

Provides information to the recognizer that specifies how to process the request

nvidia.riva.AudioEncoding encoding

The encoding of the audio data sent in the request.

All encodings support only 1 channel (mono) audio.

int32 sample_rate_hertz#

The sample rate in hertz (Hz) of the audio data sent in the RecognizeRequest or StreamingRecognizeRequest messages. The Riva server will automatically down-sample/up-sample the audio to match the ASR acoustic model sample rate. The sample rate value below 8kHz will not produce any meaningful output.

string language_code#

Required. The language of the supplied audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: “en-US”.

int32 max_alternatives#

Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognizeAlternative messages within each SpeechRecognizeResult. The server may return fewer than max_alternatives. If omitted, will return a maximum of one.

bool profanity_filter#

A custom field that enables profanity filtering for the generated transcripts. If set to ‘true’, the server filters out profanities, replacing all but the initial character in each filtered word with asterisks. For example, “x**”. If set to false or omitted, profanities will not be filtered out. The default is false.

SpeechContext speech_contexts(repeated)#

Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see SpeechContext section

int32 audio_channel_count#

The number of channels in the input audio data. If 0 or omitted, defaults to one channel (mono). Note: Only single channel audio input is supported as of now.

bool enable_word_time_offsets#

If true, the top result includes a list of words and the start and end time offsets (timestamps), and confidence scores for those words. If false, no word-level time offset information is returned. The default is false.

bool enable_automatic_punctuation#

If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses.

bool enable_separate_recognition_per_channel#

This needs to be set to true explicitly and audio_channel_count > 1 to get each channel recognized separately. The recognition result will contain a channel_tag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audio_channel_count multiplied by the length of the audio. Note: This field is not yet supported.

string model#

Which model to select for the given request. If empty, Riva will select the right model based on the other RecognitionConfig parameters. The model should correspond to the name passed to riva-build with the –name argument

bool verbatim_transcripts#

The verbatim_transcripts flag enables or disable inverse text normalization. ‘true’ returns exactly what was said, with no denormalization. ‘false’ applies inverse text normalization, also this is the default

SpeakerDiarizationConfig diarization_config#

Config to enable speaker diarization and set additional parameters. For non-streaming requests, the diarization results will be provided only in the top alternative of the FINAL SpeechRecognitionResult.

RecognitionConfig.CustomConfigurationEntry custom_configuration (repeated)

Custom fields for passing request-level configuration options to plugins used in the model pipeline.

EndpointingConfig endpointing_config#

Config for tuning start or end of utterance parameters. If empty, Riva will use default values or custom values if specified in riva-build arguments.

optional

message RecognitionConfig.CustomConfigurationEntry
string key#
string value#
message RecognizeRequest#

RecognizeRequest is used for batch processing of a single audio recording.

RecognitionConfig config#

Provides information to recognizer that specifies how to process the request.

bytes audio#

The raw audio data to be processed. The audio bytes must be encoded as specified in RecognitionConfig.

nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message RecognizeResponse#

The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

SpeechRecognitionResult results(repeated)#

Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.

nvidia.riva.RequestId id

The ID associated with the request

message RivaSpeechRecognitionConfigRequest#
string model_name#

If model is specified only return config for model, otherwise return all configs.

message RivaSpeechRecognitionConfigResponse#
RivaSpeechRecognitionConfigResponse.Config model_config (repeated)
message RivaSpeechRecognitionConfigResponse.Config
string model_name#
RivaSpeechRecognitionConfigResponse.Config.ParametersEntry parameters (repeated)
message RivaSpeechRecognitionConfigResponse.Config.ParametersEntry
string key
string value
message SpeakerDiarizationConfig#

Config to enable speaker diarization.

bool enable_speaker_diarization#

If ‘true’, enables speaker detection for each recognized word in the top alternative of the recognition result using a speaker_tag provided in the WordInfo.

int32 max_speaker_count#

Maximum number of speakers in the conversation. This gives flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 8.

message SpeechContext#

Provides “hints” to the speech recognizer to favor specific words and phrases in the results.

string phrases(repeated)#

A list of strings containing words and phrases “hints” so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer.

float boost#

Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Though boost can accept a wide range of positive values, most use cases are best served with values between 0 and 20. We recommend using a binary search approach to finding the optimal value for your use case.

message SpeechRecognitionAlternative#

Alternative hypotheses (a.k.a. n-best list).

string transcript#

Transcript text representing the words that the user spoke.

float confidence#

The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.

WordInfo words(repeated)#

A list of word-specific information for each recognized word. Only populated if is_final=true

message SpeechRecognitionResult#

A speech recognition result corresponding to the latest transcript

SpeechRecognitionAlternative alternatives(repeated)#

May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

int32 channel_tag#

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.

float audio_processed#

Length of audio processed so far in seconds

message StreamingRecognitionConfig#

Provides information to the recognizer that specifies how to process the request

RecognitionConfig config#

Provides information to the recognizer that specifies how to process the request

bool interim_results#

If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned.

message StreamingRecognitionResult#

A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.

SpeechRecognitionAlternative alternatives(repeated)#

May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

bool is_final#

If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.

float stability#

An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (is_final=false). The default of 0.0 is a sentinel value indicating stability was not set.

int32 channel_tag#

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.

float audio_processed#

Length of audio processed so far in seconds

message StreamingRecognizeRequest#

A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.

StreamingRecognitionConfig streaming_config#

Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.

bytes audio_content#

The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio data and all subsequent StreamingRecognizeRequest messages must contain audio data. The audio bytes must be encoded as specified in RecognitionConfig.

nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding responses.

message StreamingRecognizeResponse#
StreamingRecognitionResult results(repeated)#

This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives

nvidia.riva.RequestId id

The ID associated with the request

message WordInfo#

Word-specific information for recognized words.

int32 start_time#

Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.

int32 end_time#

Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.

string word#

The word corresponding to this set of information.

float confidence#

The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.

int32 speaker_tag#

Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from ‘1’ to diarization_speaker_count. speaker_tag is set if enable_speaker_diarization = ‘true’ and only in the top alternative.

riva/proto/riva_nlp.proto#

service RivaLanguageUnderstanding#
rpc TextClassResponse ClassifyText(TextClassRequest)

ClassifyText takes as input an input/query string and parameters related to the requested model to use to evaluate the text. The service evaluates the text with the requested model, and returns one or more classifications.

rpc TokenClassResponse ClassifyTokens(TokenClassRequest)

ClassifyTokens takes as input either a string or list of tokens and parameters related to which model to use. The service evaluates the text with the requested model, performing additional tokenization if necessary, and returns one or more class labels per token.

rpc TextTransformResponse TransformText(TextTransformRequest)

TransformText takes an input/query string and parameters related to the requested model and returns another string. The behavior of the function is defined entirely by the underlying model and may be used for tasks like translation, adding punctuation, augment the input directly, etc.

rpc TokenClassResponse AnalyzeEntities(AnalyzeEntitiesRequest)

AnalyzeEntities accepts an input string and returns all named entities within the text, as well as a category and likelihood.

rpc AnalyzeIntentResponse AnalyzeIntent(AnalyzeIntentRequest)

AnalyzeIntent accepts an input string and returns the most likely intent as well as slots relevant to that intent.

The model requires that a valid “domain” be passed in, and optionally supports including a previous intent classification result to provide context for the model.

rpc TextTransformResponse PunctuateText(TextTransformRequest)

PunctuateText takes text with no- or limited- punctuation and returns the same text with corrected punctuation and capitalization.

rpc NaturalQueryResponse NaturalQuery(NaturalQueryRequest)

NaturalQuery is a search function that enables querying one or more documents or contexts with a query that is written in natural language.

rpc RivaNLPConfigResponse GetRivaNLPConfig(RivaNLPConfigRequest)

Enables clients to request the configuration of the current ASR service, or a specific model within the service.

message AnalyzeEntitiesOptions#

AnalyzeEntitiesOptions is an optional configuration message to be sent as part of an AnalyzeEntitiesRequest with query metadata

string lang#

Deprecated. Optional language field. Assumed to be “en-US” if not specified.

message AnalyzeEntitiesRequest#

AnalyzeEntitiesRequest is the input message for the AnalyzeEntities service

string query#

Deprecated. The string to analyze for intent and slots

Deprecated. Optional configuration for the request, including providing context from previous turns and hardcoding a domain/language

nvidia.riva.RequestId id

Deprecated. The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message AnalyzeIntentContext#

AnalyzeIntentContext is reserved for future use when we may send context back in a a variety of different formats (including raw neural network hidden states)

Reserved for future use

message AnalyzeIntentOptions#

AnalyzeIntentOptions is an optional configuration message to be sent as part of an AnalyzeIntentRequest with query metadata

string previous_intent#

Deprecated.

Deprecated.

string domain#

Deprecated. Optional domain field. Domain must be supported otherwise an error will be returned. If left blank, a domain detector will be run first and then the query routed to the appropriate intent classifier (if it exists)

string lang#

Deprecated. Optional language field. Assumed to be “en-US” if not specified.

message AnalyzeIntentRequest#

AnalyzeIntentRequest is the input message for the AnalyzeIntent service

string query#

Deprecated. The string to analyze for intent and slots

Deprecated. Optional configuration for the request, including providing context from previous turns and hardcoding a domain/language

nvidia.riva.RequestId id

Deprecated. The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message AnalyzeIntentResponse#

AnalyzeIntentResponse is returned by the AnalyzeIntent service, and includes information related to the query’s intent, (optionally) slot data, and its domain.

Classification intent#

Deprecated. Intent classification result, including the label and score

TokenClassValue slots(repeated)#

Deprecated. List of tokens explicitly marked as filling a slot relevant to the intent, where the tokens may not exactly match the input (based on the recombined values after tokenization)

string domain_str#

Deprecated. Returns the inferred domain for the query if not hardcoded in the request. In the case where the domain was hardcoded in AnalyzeIntentRequest, the returned domain is an exact match to the request. In the case where no domain matches the query, intent and slots will be unset.

DEPRECATED, use Classification domain field.

Classification domain

Deprecated. Returns the inferred domain for the query if not hardcoded in the request. In the case where the domain was hardcoded in AnalyzeIntentRequest, the returned domain is an exact match to the request. In the case where no domain matches the query, intent and slots will be unset.

nvidia.riva.RequestId id

Deprecated. The ID associated with the request

message Classification#

Classification messages return a class name and corresponding score

string class_name#

Deprecated.

float score#

Deprecated.

message ClassificationResult#

ClassificationResults contain zero or more Classification messages If the number of Classifications is > 1, top_n > 1 must have been specified.

Classification labels(repeated)#

Deprecated.

message NLPModelParams#

NLPModelParams is a metadata message that is included in every request message used by the Core NLP Service and is used to specify model characteristics/requirements

string model_name#

Requested model to use. If specified, this takes preference over language_code.

string language_code#

Specify language of the supplied text as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Defaults to “en-US” if not set.

message NaturalQueryRequest#
string query#

Deprecated. The natural language query

uint32 top_n#

Deprecated. Maximum number of answers to return for the query. Defaults to 1 if not set.

string context#

Deprecated. Context to search with the above query

nvidia.riva.RequestId id

Deprecated. The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message NaturalQueryResponse#
NaturalQueryResult results(repeated)#

Deprecated.

nvidia.riva.RequestId id

Deprecated. The ID associated with the request

message NaturalQueryResult#
string answer#

Deprecated. text which answers the query

float score

Deprecated. Score representing confidence in result

message RivaNLPConfigRequest#
string model_name#

If model is specified only return config for model, otherwise return all configs.

message RivaNLPConfigResponse#
RivaNLPConfigResponse.Config model_config (repeated)
message RivaNLPConfigResponse.Config
string model_name
RivaNLPConfigResponse.Config.ParametersEntry parameters (repeated)
message RivaNLPConfigResponse.Config.ParametersEntry
string key
string value
message Span#

Span of a particular result

uint32 start#

Deprecated.

uint32 end#

Deprecated.

message TextClassRequest#

TextClassRequest is the input message to the ClassifyText service.

string text(repeated)#

Deprecated. Each repeated text element is handled independently for handling multiple input strings with a single request

uint32 top_n

Deprecated. Return the top N classification results for each input. 0 or 1 will return top class, otherwise N. Note: Current disabled.

Deprecated.

nvidia.riva.RequestId id

Deprecated. The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message TextClassResponse#

TextClassResponse is the return message from the ClassifyText service.

ClassificationResult results(repeated)#

Deprecated.

nvidia.riva.RequestId id

Deprecated. The ID associated with the request

message TextTransformRequest#

TextTransformRequest is a request type intended for services like TransformText which take an arbitrary text input

string text(repeated)#

Each repeated text element is handled independently for handling multiple input strings with a single request

uint32 top_n#
NLPModelParams model#
nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message TextTransformResponse#

TextTransformResponse is returned by the TransformText method. Responses are returned in the same order as they were requested.

string text(repeated)#
nvidia.riva.RequestId id

The ID associated with the request

message TokenClassRequest#

TokenClassRequest is the input message to the ClassifyText service.

string text(repeated)#

Deprecated. Each repeated text element is handled independently for handling multiple input strings with a single request

uint32 top_n

Deprecated. Return the top N classification results for each input. 0 or 1 will return top class, otherwise N. Note: Current disabled.

Deprecated.

nvidia.riva.RequestId id

Deprecated. The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message TokenClassResponse#

TokenClassResponse returns a single TokenClassSequence per input request

TokenClassSequence results(repeated)#

Deprecated.

nvidia.riva.RequestId id

Deprecated. The ID associated with the request

message TokenClassSequence#

TokenClassSequence is used for returning a sequence of TokenClassValue objects in the original order of input tokens

TokenClassValue results(repeated)#

Deprecated.

message TokenClassValue#

TokenClassValue is used to correlate an input token with its classification results

string token#

Deprecated.

Classification label(repeated)#

Deprecated.

Span span(repeated)#

Deprecated.

riva/proto/riva_nmt.proto#

service RivaTranslation#

RivaTranslation service provides rpcs to translate between languages.

rpc TranslateTextResponse TranslateText(TranslateTextRequest)

Translate text to text, from a source to a target language. Currently source and target language fields is required, along with the model name. Multiple texts may be passed per request up to the given batch size for the model, which is set at translation pipeline creation time.

rpc AvailableLanguageResponse ListSupportedLanguagePairs(AvailableLanguageRequest)

Lists the available language pairs and models names to be used for TranslateText

rpc stream StreamingTranslateSpeechToTextResponse StreamingTranslateSpeechToText(stream StreamingTranslateSpeechToTextRequest)

streaming speech to text translation api.

rpc stream StreamingTranslateSpeechToSpeechResponse StreamingTranslateSpeechToSpeech(stream StreamingTranslateSpeechToSpeechRequest)
message AvailableLanguageRequest#

Returns a map of model names to its source and target language pairs. Can specify a specific model name to retrieve only its language pairs.

string model#

Supported values: “s2s_model”, “s2t_model”, and name of the deployed t2t model. If empty, returns all available models and languages.

message AvailableLanguageResponse#

Language pairs are the sets of src to tgt languages available per model. languages contains all the model_name -> Language pair

AvailableLanguageResponse.LanguagesEntry languages (repeated)
message AvailableLanguageResponse.LanguagePair
string src_lang(repeated)#
string tgt_lang(repeated)#
message AvailableLanguageResponse.LanguagesEntry
string key
AvailableLanguageResponse.LanguagePair value
message StreamingTranslateSpeechToSpeechConfig#

Configuration for Translate S2S. reuse existing protos from other services.

nvidia.riva.asr.StreamingRecognitionConfig asr_config

From riva_asr.proto

SynthesizeSpeechConfig tts_config#
TranslationConfig translation_config#
message StreamingTranslateSpeechToSpeechRequest#

Streaming translate speech to speech used to configure the entire pipline for speech translation. This can be be backed by a cascade of ASR, NMT, TTS models or an end to end model

StreamingTranslateSpeechToSpeechConfig config#
bytes audio_content#
nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message StreamingTranslateSpeechToSpeechResponse#
nvidia.riva.tts.SynthesizeSpeechResponse speech

Contains speech responses, the last response sends an empty buffer to mark the end of stream.

from riva_tts.proto

nvidia.riva.RequestId id

The ID associated with the request

message StreamingTranslateSpeechToTextConfig#
nvidia.riva.asr.StreamingRecognitionConfig asr_config

existing ASR config

TranslationConfig translation_config#
message StreamingTranslateSpeechToTextRequest#
StreamingTranslateSpeechToTextConfig config#
bytes audio_content#
nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message StreamingTranslateSpeechToTextResponse#
nvidia.riva.asr.StreamingRecognitionResult results (repeated)

from riva_asr.proto

nvidia.riva.RequestId id

The ID associated with the request

message SynthesizeSpeechConfig#
nvidia.riva.AudioEncoding encoding
int32 sample_rate_hz#
string voice_name#
string language_code#
string prosody_rate#
string prosody_pitch#
string prosody_volume#
message TranslateTextRequest#

request for synchronous translation of each text in texts. Available languages can be queried using ListSupportLanguagePairs RPC. source and target languages must be specified, are currently two character ISO codes, this will likely change to BCP-47 inline with other Riva Services for GA.

string texts(repeated)#
string model#
string source_language#
string target_language#
nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message TranslateTextResponse#

Translations are returned as text:language pairs. These are 1:1 for the passed in ‘texts’ from the request.

Translation translations(repeated)#
nvidia.riva.RequestId id

The ID associated with the request

message Translation#

contains a single translation, collecting into the translate text response Includes the target language code, since with multi lingual models there are multiple possibilities.

string text#
string language#
message TranslationConfig#
string source_language_code#

BCP-47 “en-US”

string target_language_code#
string model_name#
string dnt_phrases(repeated)#

A list of words or phrases that will not be translated by the pipeline. This list can include special words or phrases, for example, names, acronyms or any phrases desired to be excluded from translation. These words or phrases will be present as-is in the translated output.

riva/proto/riva_tts.proto#

service RivaSpeechSynthesis#
rpc SynthesizeSpeechResponse Synthesize(SynthesizeSpeechRequest)

Used to request text-to-speech from the service. Submit a request containing the desired text and configuration, and receive audio bytes in the requested format.

rpc stream SynthesizeSpeechResponse SynthesizeOnline(SynthesizeSpeechRequest)

Used to request text-to-speech returned via stream as it becomes available. Submit a SynthesizeSpeechRequest with desired text and configuration, and receive stream of bytes in the requested format.

rpc RivaSynthesisConfigResponse GetRivaSynthesisConfig(RivaSynthesisConfigRequest)

Enables clients to request the configuration of the current Synthesize service, or a specific model within the service.

message RivaSynthesisConfigRequest#
string model_name#

If model is specified only return config for model, otherwise return all configs.

message RivaSynthesisConfigResponse#
RivaSynthesisConfigResponse.Config model_config (repeated)
message RivaSynthesisConfigResponse.Config
string model_name
RivaSynthesisConfigResponse.Config.ParametersEntry parameters (repeated)
message RivaSynthesisConfigResponse.Config.ParametersEntry
string key
string value
message SynthesizeSpeechRequest#
string text#
string language_code#
nvidia.riva.AudioEncoding encoding

audio encoding params

int32 sample_rate_hz#
string voice_name#

voice params

ZeroShotData zero_shot_data#

Zero Shot model params

string custom_dictionary#

A string containing comma-separated key-value pairs of grapheme and corresponding phoneme separated by double spaces.

nvidia.riva.RequestId id

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

message SynthesizeSpeechResponse#
bytes audio#
SynthesizeSpeechResponseMetadata meta#
nvidia.riva.RequestId id

The ID associated with the request

message SynthesizeSpeechResponseMetadata#
string text#

Currently experimental API addition that returns the input text after preprocessing has been completed as well as the predicted duration for each token. Note: this message is subject to future breaking changes, and potential removal.

string processed_text#
float predicted_durations(repeated)#
message ZeroShotData#

Required for Zero Shot model

bytes audio_prompt#

Audio prompt for Zero Shot model. Duration should be between 3 to 10 seconds.

int32 sample_rate_hz#

Sample rate for input audio prompt.

nvidia.riva.AudioEncoding encoding

Encoding of audio prompt. Supported encodings are LINEAR_PCM and OGGOPUS.

int32 quality#

The number of times user wants to pass audio through decoder. This ranges between 1-40. Defaults to 20.

riva/proto/riva_common.proto#

message RequestId#

Specifies the request ID of the request.

string value#

riva/proto/riva_audio.proto#

enum AudioEncoding

AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.

enumerator ENCODING_UNSPECIFIED = 0#

Not specified.

enumerator LINEAR_PCM = 1#

Uncompressed 16-bit signed little-endian samples (Linear PCM).

enumerator FLAC = 2#

FLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless–therefore recognition is not compromised–and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.

enumerator MULAW = 3#

8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.

enumerator OGGOPUS = 4#
enumerator ALAW = 20#

8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law.

riva/proto/health.proto#

service Health
rpc HealthCheckResponse Check(HealthCheckRequest)
rpc stream HealthCheckResponse Watch(HealthCheckRequest)
message HealthCheckRequest
string service
message HealthCheckResponse
HealthCheckResponse.ServingStatus status
enum HealthCheckResponse.ServingStatus
enumerator UNKNOWN = 0
enumerator SERVING = 1
enumerator NOT_SERVING = 2