API Reference#

riva/proto/health.proto#

HealthCheckRequest#

Field	Type	Label	Description
service	string

HealthCheckResponse#

Field	Type	Label	Description
status	HealthCheckResponse.ServingStatus

HealthCheckResponse.ServingStatus#

Name	Number	Description
UNKNOWN	0
SERVING	1
NOT_SERVING	2

Health#

Method Name	Request Type	Response Type	Description
Check	HealthCheckRequest	HealthCheckResponse
Watch	HealthCheckRequest	HealthCheckResponse stream

Top

riva/proto/riva_asr.proto#

EndpointingConfig#

EndpointingConfig is used for configuring different fields related to start or end of utterance

Field	Type	Label	Description
start_history	int32	optional	`start_history` is the size of the window, in milliseconds, used to detect start of utterance. `start_threshold` is the percentage threshold used to detect start of utterance. (0.0 to 1.0) If `start_threshold` of `start_history` ms of the acoustic model output have non-blank tokens, start of utterance is detected.
start_threshold	float	optional
stop_history	int32	optional	`stop_history` is the size of the window, in milliseconds, used to detect end of utterance. `stop_threshold` is the percentage threshold used to detect end of utterance. (0.0 to 1.0) If `stop_threshold` of `stop_history` ms of the acoustic model output have non-blank tokens, end of utterance is detected and decoder will be reset.
stop_threshold	float	optional
stop_history_eou	int32	optional	`stop_history_eou` and `stop_threshold_eou` are used for 2-pass end of utterance. `stop_history_eou` is the size of the window, in milliseconds, used to trigger 1st pass of end of utterance and generate a partial transcript with stability of 1. (stop_history_eou < stop_history) `stop_threshold_eou` is the percentage threshold used to trigger 1st pass of end of utterance. (0.0 to 1.0) If `stop_threshold_eou` of `stop_history_eou` ms of the acoustic model output have non-blank tokens, 1st pass of end of utterance is triggered.
stop_threshold_eou	float	optional

PipelineStates#

Field	Type	Label	Description
vad_probabilities	float	repeated	Neural VAD probabilities

RecognitionConfig#

Provides information to the recognizer that specifies how to process the request

Field	Type	Label	Description
encoding	nvidia.riva.AudioEncoding		The encoding of the audio data sent in the request. All encodings support only 1 channel (mono) audio.
sample_rate_hertz	int32		The sample rate in hertz (Hz) of the audio data sent in the `RecognizeRequest` or `StreamingRecognizeRequest` messages. The Riva server will automatically down-sample/up-sample the audio to match the ASR acoustic model sample rate. The sample rate value below 8kHz will not produce any meaningful output.
language_code	string		Required. The language of the supplied audio as a BCP-47 language tag. Example: “en-US”.
max_alternatives	int32		Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of `SpeechRecognizeAlternative` messages within each `SpeechRecognizeResult`. The server may return fewer than `max_alternatives`. If omitted, will return a maximum of one.
profanity_filter	bool		A custom field that enables profanity filtering for the generated transcripts. If set to ‘true’, the server filters out profanities, replacing all but the initial character in each filtered word with asterisks. For example, “x**”. If set to `false` or omitted, profanities will not be filtered out. The default is `false`.
speech_contexts	SpeechContext	repeated	Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see SpeechContext section
audio_channel_count	int32		The number of channels in the input audio data. If `0` or omitted, defaults to one channel (mono). Note: Only single channel audio input is supported as of now.
enable_word_time_offsets	bool		If `true`, the top result includes a list of words and the start and end time offsets (timestamps), and confidence scores for those words. If `false`, no word-level time offset information is returned. The default is `false`.
enable_automatic_punctuation	bool		If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses.
enable_separate_recognition_per_channel	bool		This needs to be set to `true` explicitly and `audio_channel_count` > 1 to get each channel recognized separately. The recognition result will contain a `channel_tag` field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: `audio_channel_count` multiplied by the length of the audio. Note: This field is not yet supported.
model	string		Which model to select for the given request. If empty, Riva will select the right model based on the other RecognitionConfig parameters. The model should correspond to the name passed to `riva-build` with the `--name` argument
verbatim_transcripts	bool		The verbatim_transcripts flag enables or disable inverse text normalization. ‘true’ returns exactly what was said, with no denormalization. ‘false’ applies inverse text normalization, also this is the default
custom_configuration	RecognitionConfig.CustomConfigurationEntry	repeated	Custom fields for passing request-level configuration options to plugins used in the model pipeline.
endpointing_config	EndpointingConfig	optional	Config for tuning start or end of utterance parameters. If empty, Riva will use default values or custom values if specified in riva-build arguments.

RecognitionConfig.CustomConfigurationEntry#

Field	Type	Label	Description
key	string
value	string

RecognizeRequest#

RecognizeRequest is used for batch processing of a single audio recording.

Field	Type	Description
config	RecognitionConfig	Provides information to recognizer that specifies how to process the request.
audio	bytes	The raw audio data to be processed. The audio bytes must be encoded as specified in `RecognitionConfig`.
id	nvidia.riva.RequestId	The ID to be associated with the request. If provided, this will be returned in the corresponding response.

RecognizeResponse#

The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

Field	Type	Label	Description
results	SpeechRecognitionResult	repeated	Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.
id	nvidia.riva.RequestId		The ID associated with the request

RivaSpeechRecognitionConfigRequest#

Field	Type	Label	Description
model_name	string		If model is specified only return config for model, otherwise return all configs.

RivaSpeechRecognitionConfigResponse#

Field	Type	Label	Description
model_config	RivaSpeechRecognitionConfigResponse.Config	repeated

RivaSpeechRecognitionConfigResponse.Config#

Field	Type	Label	Description
model_name	string
parameters	RivaSpeechRecognitionConfigResponse.Config.ParametersEntry	repeated

RivaSpeechRecognitionConfigResponse.Config.ParametersEntry#

Field	Type	Label	Description
key	string
value	string

SpeechContext#

Provides “hints” to the speech recognizer to favor specific words and phrases in the results.

Field	Type	Label	Description
phrases	string	repeated	A list of strings containing words and phrases “hints” so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer.
boost	float		Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Though `boost` can accept a wide range of positive values, most use cases are best served with values between 0 and 20. We recommend using a binary search approach to finding the optimal value for your use case.

SpeechRecognitionAlternative#

Alternative hypotheses (a.k.a. n-best list).

Field	Type	Label	Description
transcript	string		Transcript text representing the words that the user spoke.
confidence	float		The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.
words	WordInfo	repeated	A list of word-specific information for each recognized word. Only populated if is_final=true

SpeechRecognitionResult#

A speech recognition result corresponding to the latest transcript

Field	Type	Label	Description
alternatives	SpeechRecognitionAlternative	repeated	May contain one or more recognition hypotheses (up to the maximum specified in `max_alternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
channel_tag	int32		For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
audio_processed	float		Length of audio processed so far in seconds

StreamingRecognitionConfig#

Provides information to the recognizer that specifies how to process the request

Field	Type	Label	Description
config	RecognitionConfig		Provides information to the recognizer that specifies how to process the request
interim_results	bool		If `true`, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the `is_final=false` flag). If `false` or omitted, only `is_final=true` result(s) are returned.

StreamingRecognitionResult#

A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.

Field	Type	Label	Description
alternatives	SpeechRecognitionAlternative	repeated	May contain one or more recognition hypotheses (up to the maximum specified in `max_alternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
is_final	bool		If `false`, this `StreamingRecognitionResult` represents an interim result that may change. If `true`, this is the final time the speech service will return this particular `StreamingRecognitionResult`, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.
stability	float		An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (`is_final=false`). The default of 0.0 is a sentinel value indicating `stability` was not set.
channel_tag	int32		For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
audio_processed	float		Length of audio processed so far in seconds
pipeline_states	PipelineStates	optional	Message for pipeline states

StreamingRecognizeRequest#

A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.

Field	Type	Description
streaming_config	StreamingRecognitionConfig	Provides information to the recognizer that specifies how to process the request. The first `StreamingRecognizeRequest` message must contain a `streaming_config` message.
audio_content	bytes	The audio data to be recognized. Sequential chunks of audio data are sent in sequential `StreamingRecognizeRequest` messages. The first `StreamingRecognizeRequest` message must not contain `audio` data and all subsequent `StreamingRecognizeRequest` messages must contain `audio` data. The audio bytes must be encoded as specified in `RecognitionConfig`.
id	nvidia.riva.RequestId	The ID to be associated with the request. If provided, this will be returned in the corresponding responses.

StreamingRecognizeResponse#

Field	Type	Label	Description
results	StreamingRecognitionResult	repeated	This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives
id	nvidia.riva.RequestId		The ID associated with the request

WordInfo#

Word-specific information for recognized words.

Field	Type	Description
start_time	int32	Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.
end_time	int32	Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.
word	string	The word corresponding to this set of information.
confidence	float	The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.
speaker_tag	int32	Output only. Not available in this release

RivaSpeechRecognition#

The RivaSpeechRecognition service provides two mechanisms for converting speech to text.

Method Name	Request Type	Response Type	Description
Recognize	RecognizeRequest	RecognizeResponse	Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.
StreamingRecognize	StreamingRecognizeRequest stream	StreamingRecognizeResponse stream	StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.
GetRivaSpeechRecognitionConfig	RivaSpeechRecognitionConfigRequest	RivaSpeechRecognitionConfigResponse	Enables clients to request the configuration of the current ASR service, or a specific model within the service.

Top

riva/proto/riva_audio.proto#

AudioEncoding#

AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.

Name	Number	Description
ENCODING_UNSPECIFIED	0	Not specified.
LINEAR_PCM	1	Uncompressed 16-bit signed little-endian samples (Linear PCM).
FLAC	2	`FLAC` (Free Lossless Audio Codec) is the recommended encoding because it is lossless–therefore recognition is not compromised–and requires only about half the bandwidth of `LINEAR16`. `FLAC` stream encoding supports 16-bit and 24-bit samples, however, not all fields in `STREAMINFO` are supported.
MULAW	3	8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.
OGGOPUS	4
ALAW	20	8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law.

Top

riva/proto/riva_common.proto#

RequestId#

Specifies the request ID of the request.

Field	Type	Label	Description
value	string

Scalar Value Types#

.proto Type	Notes	C++	Java	Python	Go	C#	PHP	Ruby
double		double	double	float	float64	double	float	Float
float		float	float	float	float32	float	float	Float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	long	int/long	int64	long	integer/string	Bignum
uint32	Uses variable-length encoding.	uint32	int	int/long	uint32	uint	integer	Bignum or Fixnum (as required)
uint64	Uses variable-length encoding.	uint64	long	int/long	uint64	ulong	integer/string	Bignum or Fixnum (as required)
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	long	int/long	int64	long	integer/string	Bignum
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int	uint32	uint	integer	Bignum or Fixnum (as required)
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long	uint64	ulong	integer/string	Bignum
sfixed32	Always four bytes.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
sfixed64	Always eight bytes.	int64	long	int/long	int64	long	integer/string	Bignum
bool		bool	boolean	boolean	bool	bool	boolean	TrueClass/FalseClass
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode	string	string	string	String (UTF-8)
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str	[]byte	ByteString	string	String (ASCII-8BIT)