API Reference#

Top

riva/proto/health.proto#

HealthCheckRequest#

Field

Type

Label

Description

service

string

HealthCheckResponse#

Field

Type

Label

Description

status

HealthCheckResponse.ServingStatus

HealthCheckResponse.ServingStatus#

Name

Number

Description

UNKNOWN

0

SERVING

1

NOT_SERVING

2

Health#

Method Name

Request Type

Response Type

Description

Check

HealthCheckRequest

HealthCheckResponse

Watch

HealthCheckRequest

HealthCheckResponse stream

Top

riva/proto/riva_asr.proto#

EndpointingConfig#

EndpointingConfig is used for configuring different fields related to start or end of utterance

Field

Type

Label

Description

start_history

int32

optional

start_history is the size of the window, in milliseconds, used to detect start of utterance. start_threshold is the percentage threshold used to detect start of utterance. (0.0 to 1.0) If start_threshold of start_history ms of the acoustic model output have non-blank tokens, start of utterance is detected.

start_threshold

float

optional

stop_history

int32

optional

stop_history is the size of the window, in milliseconds, used to detect end of utterance. stop_threshold is the percentage threshold used to detect end of utterance. (0.0 to 1.0) If stop_threshold of stop_history ms of the acoustic model output have non-blank tokens, end of utterance is detected and decoder will be reset.

stop_threshold

float

optional

stop_history_eou

int32

optional

stop_history_eou and stop_threshold_eou are used for 2-pass end of utterance. stop_history_eou is the size of the window, in milliseconds, used to trigger 1st pass of end of utterance and generate a partial transcript with stability of 1. (stop_history_eou < stop_history) stop_threshold_eou is the percentage threshold used to trigger 1st pass of end of utterance. (0.0 to 1.0) If stop_threshold_eou of stop_history_eou ms of the acoustic model output have non-blank tokens, 1st pass of end of utterance is triggered.

stop_threshold_eou

float

optional

PipelineStates#

Field

Type

Label

Description

vad_probabilities

float

repeated

Neural VAD probabilities

RecognitionConfig#

Provides information to the recognizer that specifies how to process the request

Field

Type

Label

Description

encoding

nvidia.riva.AudioEncoding

The encoding of the audio data sent in the request. All encodings support only 1 channel (mono) audio.

sample_rate_hertz

int32

The sample rate in hertz (Hz) of the audio data sent in the RecognizeRequest or StreamingRecognizeRequest messages. The Riva server will automatically down-sample/up-sample the audio to match the ASR acoustic model sample rate. The sample rate value below 8kHz will not produce any meaningful output.

language_code

string

Required. The language of the supplied audio as a BCP-47 language tag. Example: “en-US”.

max_alternatives

int32

Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognizeAlternative messages within each SpeechRecognizeResult. The server may return fewer than max_alternatives. If omitted, will return a maximum of one.

profanity_filter

bool

A custom field that enables profanity filtering for the generated transcripts. If set to ‘true’, the server filters out profanities, replacing all but the initial character in each filtered word with asterisks. For example, “x**”. If set to false or omitted, profanities will not be filtered out. The default is false.

speech_contexts

SpeechContext

repeated

Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see SpeechContext section

audio_channel_count

int32

The number of channels in the input audio data. If 0 or omitted, defaults to one channel (mono). Note: Only single channel audio input is supported as of now.

enable_word_time_offsets

bool

If true, the top result includes a list of words and the start and end time offsets (timestamps), and confidence scores for those words. If false, no word-level time offset information is returned. The default is false.

enable_automatic_punctuation

bool

If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses.

enable_separate_recognition_per_channel

bool

This needs to be set to true explicitly and audio_channel_count > 1 to get each channel recognized separately. The recognition result will contain a channel_tag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audio_channel_count multiplied by the length of the audio. Note: This field is not yet supported.

model

string

Which model to select for the given request. If empty, Riva will select the right model based on the other RecognitionConfig parameters. The model should correspond to the name passed to riva-build with the --name argument

verbatim_transcripts

bool

The verbatim_transcripts flag enables or disable inverse text normalization. ‘true’ returns exactly what was said, with no denormalization. ‘false’ applies inverse text normalization, also this is the default

custom_configuration

RecognitionConfig.CustomConfigurationEntry

repeated

Custom fields for passing request-level configuration options to plugins used in the model pipeline.

endpointing_config

EndpointingConfig

optional

Config for tuning start or end of utterance parameters. If empty, Riva will use default values or custom values if specified in riva-build arguments.

RecognitionConfig.CustomConfigurationEntry#

Field

Type

Label

Description

key

string

value

string

RecognizeRequest#

RecognizeRequest is used for batch processing of a single audio recording.

Field

Type

Label

Description

config

RecognitionConfig

Provides information to recognizer that specifies how to process the request.

audio

bytes

The raw audio data to be processed. The audio bytes must be encoded as specified in RecognitionConfig.

id

nvidia.riva.RequestId

The ID to be associated with the request. If provided, this will be returned in the corresponding response.

RecognizeResponse#

The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

Field

Type

Label

Description

results

SpeechRecognitionResult

repeated

Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.

id

nvidia.riva.RequestId

The ID associated with the request

RivaSpeechRecognitionConfigRequest#

Field

Type

Label

Description

model_name

string

If model is specified only return config for model, otherwise return all configs.

RivaSpeechRecognitionConfigResponse#

Field

Type

Label

Description

model_config

RivaSpeechRecognitionConfigResponse.Config

repeated

RivaSpeechRecognitionConfigResponse.Config#

Field

Type

Label

Description

model_name

string

parameters

RivaSpeechRecognitionConfigResponse.Config.ParametersEntry

repeated

RivaSpeechRecognitionConfigResponse.Config.ParametersEntry#

Field

Type

Label

Description

key

string

value

string

SpeechContext#

Provides “hints” to the speech recognizer to favor specific words and phrases in the results.

Field

Type

Label

Description

phrases

string

repeated

A list of strings containing words and phrases “hints” so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer.

boost

float

Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Though boost can accept a wide range of positive values, most use cases are best served with values between 0 and 20. We recommend using a binary search approach to finding the optimal value for your use case.

SpeechRecognitionAlternative#

Alternative hypotheses (a.k.a. n-best list).

Field

Type

Label

Description

transcript

string

Transcript text representing the words that the user spoke.

confidence

float

The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.

words

WordInfo

repeated

A list of word-specific information for each recognized word. Only populated if is_final=true

SpeechRecognitionResult#

A speech recognition result corresponding to the latest transcript

Field

Type

Label

Description

alternatives

SpeechRecognitionAlternative

repeated

May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

channel_tag

int32

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.

audio_processed

float

Length of audio processed so far in seconds

StreamingRecognitionConfig#

Provides information to the recognizer that specifies how to process the request

Field

Type

Label

Description

config

RecognitionConfig

Provides information to the recognizer that specifies how to process the request

interim_results

bool

If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned.

StreamingRecognitionResult#

A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.

Field

Type

Label

Description

alternatives

SpeechRecognitionAlternative

repeated

May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

is_final

bool

If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.

stability

float

An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (is_final=false). The default of 0.0 is a sentinel value indicating stability was not set.

channel_tag

int32

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.

audio_processed

float

Length of audio processed so far in seconds

pipeline_states

PipelineStates

optional

Message for pipeline states

StreamingRecognizeRequest#

A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.

Field

Type

Label

Description

streaming_config

StreamingRecognitionConfig

Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.

audio_content

bytes

The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio data and all subsequent StreamingRecognizeRequest messages must contain audio data. The audio bytes must be encoded as specified in RecognitionConfig.

id

nvidia.riva.RequestId

The ID to be associated with the request. If provided, this will be returned in the corresponding responses.

StreamingRecognizeResponse#

Field

Type

Label

Description

results

StreamingRecognitionResult

repeated

This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives

id

nvidia.riva.RequestId

The ID associated with the request

WordInfo#

Word-specific information for recognized words.

Field

Type

Label

Description

start_time

int32

Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.

end_time

int32

Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.

word

string

The word corresponding to this set of information.

confidence

float

The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set.

speaker_tag

int32

Output only. Not available in this release

RivaSpeechRecognition#

The RivaSpeechRecognition service provides two mechanisms for converting speech to text.

Method Name

Request Type

Response Type

Description

Recognize

RecognizeRequest

RecognizeResponse

Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.

StreamingRecognize

StreamingRecognizeRequest stream

StreamingRecognizeResponse stream

StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.

GetRivaSpeechRecognitionConfig

RivaSpeechRecognitionConfigRequest

RivaSpeechRecognitionConfigResponse

Enables clients to request the configuration of the current ASR service, or a specific model within the service.

Top

riva/proto/riva_audio.proto#

AudioEncoding#

AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.

Name

Number

Description

ENCODING_UNSPECIFIED

0

Not specified.

LINEAR_PCM

1

Uncompressed 16-bit signed little-endian samples (Linear PCM).

FLAC

2

FLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless–therefore recognition is not compromised–and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.

MULAW

3

8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.

OGGOPUS

4

ALAW

20

8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law.

Top

riva/proto/riva_common.proto#

RequestId#

Specifies the request ID of the request.

Field

Type

Label

Description

value

string

Scalar Value Types#

.proto Type

Notes

C++

Java

Python

Go

C#

PHP

Ruby

double

double

double

float

float64

double

float

Float

float

float

float

float

float32

float

float

Float

int32

Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.

int32

int

int

int32

int

integer

Bignum or Fixnum (as required)

int64

Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.

int64

long

int/long

int64

long

integer/string

Bignum

uint32

Uses variable-length encoding.

uint32

int

int/long

uint32

uint

integer

Bignum or Fixnum (as required)

uint64

Uses variable-length encoding.

uint64

long

int/long

uint64

ulong

integer/string

Bignum or Fixnum (as required)

sint32

Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.

int32

int

int

int32

int

integer

Bignum or Fixnum (as required)

sint64

Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.

int64

long

int/long

int64

long

integer/string

Bignum

fixed32

Always four bytes. More efficient than uint32 if values are often greater than 2^28.

uint32

int

int

uint32

uint

integer

Bignum or Fixnum (as required)

fixed64

Always eight bytes. More efficient than uint64 if values are often greater than 2^56.

uint64

long

int/long

uint64

ulong

integer/string

Bignum

sfixed32

Always four bytes.

int32

int

int

int32

int

integer

Bignum or Fixnum (as required)

sfixed64

Always eight bytes.

int64

long

int/long

int64

long

integer/string

Bignum

bool

bool

boolean

boolean

bool

bool

boolean

TrueClass/FalseClass

string

A string must always contain UTF-8 encoded or 7-bit ASCII text.

string

String

str/unicode

string

string

string

String (UTF-8)

bytes

May contain any arbitrary sequence of bytes.

string

ByteString

str

[]byte

ByteString

string

String (ASCII-8BIT)