API Reference#
riva/proto/health.proto#
HealthCheckRequest#
Field |
Type |
Label |
Description |
---|---|---|---|
service |
string |
HealthCheckResponse#
Field |
Type |
Label |
Description |
---|---|---|---|
status |
HealthCheckResponse.ServingStatus#
Name |
Number |
Description |
---|---|---|
UNKNOWN |
0 |
|
SERVING |
1 |
|
NOT_SERVING |
2 |
Health#
Method Name |
Request Type |
Response Type |
Description |
---|---|---|---|
Check |
|||
Watch |
HealthCheckResponse stream |
riva/proto/riva_asr.proto#
EndpointingConfig#
EndpointingConfig is used for configuring different fields related to start or end of utterance
Field |
Type |
Label |
Description |
---|---|---|---|
start_history |
int32 |
optional |
|
start_threshold |
float |
optional |
|
stop_history |
int32 |
optional |
|
stop_threshold |
float |
optional |
|
stop_history_eou |
int32 |
optional |
|
stop_threshold_eou |
float |
optional |
PipelineStates#
Field |
Type |
Label |
Description |
---|---|---|---|
vad_probabilities |
float |
repeated |
Neural VAD probabilities |
RecognitionConfig#
Provides information to the recognizer that specifies how to process the request
Field |
Type |
Label |
Description |
---|---|---|---|
encoding |
The encoding of the audio data sent in the request. All encodings support only 1 channel (mono) audio. |
||
sample_rate_hertz |
int32 |
The sample rate in hertz (Hz) of the audio data sent in the |
|
language_code |
string |
Required. The language of the supplied audio as a BCP-47 language tag. Example: “en-US”. |
|
max_alternatives |
int32 |
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of |
|
profanity_filter |
bool |
A custom field that enables profanity filtering for the generated transcripts. If set to ‘true’, the server filters out profanities, replacing all but the initial character in each filtered word with asterisks. For example, “x**”. If set to |
|
speech_contexts |
repeated |
Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see SpeechContext section |
|
audio_channel_count |
int32 |
The number of channels in the input audio data. If |
|
enable_word_time_offsets |
bool |
If |
|
enable_automatic_punctuation |
bool |
If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses. |
|
enable_separate_recognition_per_channel |
bool |
This needs to be set to |
|
model |
string |
Which model to select for the given request. If empty, Riva will select the right model based on the other RecognitionConfig parameters. The model should correspond to the name passed to |
|
verbatim_transcripts |
bool |
The verbatim_transcripts flag enables or disable inverse text normalization. ‘true’ returns exactly what was said, with no denormalization. ‘false’ applies inverse text normalization, also this is the default |
|
custom_configuration |
repeated |
Custom fields for passing request-level configuration options to plugins used in the model pipeline. |
|
endpointing_config |
optional |
Config for tuning start or end of utterance parameters. If empty, Riva will use default values or custom values if specified in riva-build arguments. |
RecognitionConfig.CustomConfigurationEntry#
Field |
Type |
Label |
Description |
---|---|---|---|
key |
string |
||
value |
string |
RecognizeRequest#
RecognizeRequest is used for batch processing of a single audio recording.
Field |
Type |
Label |
Description |
---|---|---|---|
config |
Provides information to recognizer that specifies how to process the request. |
||
audio |
bytes |
The raw audio data to be processed. The audio bytes must be encoded as specified in |
|
id |
The ID to be associated with the request. If provided, this will be returned in the corresponding response. |
RecognizeResponse#
The only message returned to the client by the Recognize
method. It
contains the result as zero or more sequential SpeechRecognitionResult
messages.
Field |
Type |
Label |
Description |
---|---|---|---|
results |
repeated |
Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript. |
|
id |
The ID associated with the request |
RivaSpeechRecognitionConfigRequest#
Field |
Type |
Label |
Description |
---|---|---|---|
model_name |
string |
If model is specified only return config for model, otherwise return all configs. |
RivaSpeechRecognitionConfigResponse#
Field |
Type |
Label |
Description |
---|---|---|---|
model_config |
repeated |
RivaSpeechRecognitionConfigResponse.Config#
Field |
Type |
Label |
Description |
---|---|---|---|
model_name |
string |
||
parameters |
repeated |
RivaSpeechRecognitionConfigResponse.Config.ParametersEntry#
Field |
Type |
Label |
Description |
---|---|---|---|
key |
string |
||
value |
string |
SpeechContext#
Provides “hints” to the speech recognizer to favor specific words and phrases in the results.
Field |
Type |
Label |
Description |
---|---|---|---|
phrases |
string |
repeated |
A list of strings containing words and phrases “hints” so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. |
boost |
float |
Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Though |
SpeechRecognitionAlternative#
Alternative hypotheses (a.k.a. n-best list).
Field |
Type |
Label |
Description |
---|---|---|---|
transcript |
string |
Transcript text representing the words that the user spoke. |
|
confidence |
float |
The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set. |
|
words |
repeated |
A list of word-specific information for each recognized word. Only populated if is_final=true |
SpeechRecognitionResult#
A speech recognition result corresponding to the latest transcript
Field |
Type |
Label |
Description |
---|---|---|---|
alternatives |
repeated |
May contain one or more recognition hypotheses (up to the maximum specified in |
|
channel_tag |
int32 |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’. |
|
audio_processed |
float |
Length of audio processed so far in seconds |
StreamingRecognitionConfig#
Provides information to the recognizer that specifies how to process the request
Field |
Type |
Label |
Description |
---|---|---|---|
config |
Provides information to the recognizer that specifies how to process the request |
||
interim_results |
bool |
If |
StreamingRecognitionResult#
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
Field |
Type |
Label |
Description |
---|---|---|---|
alternatives |
repeated |
May contain one or more recognition hypotheses (up to the maximum specified in |
|
is_final |
bool |
If |
|
stability |
float |
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results ( |
|
channel_tag |
int32 |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’. |
|
audio_processed |
float |
Length of audio processed so far in seconds |
|
pipeline_states |
optional |
Message for pipeline states |
StreamingRecognizeRequest#
A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.
Field |
Type |
Label |
Description |
---|---|---|---|
streaming_config |
Provides information to the recognizer that specifies how to process the request. The first |
||
audio_content |
bytes |
The audio data to be recognized. Sequential chunks of audio data are sent in sequential |
|
id |
The ID to be associated with the request. If provided, this will be returned in the corresponding responses. |
StreamingRecognizeResponse#
Field |
Type |
Label |
Description |
---|---|---|---|
results |
repeated |
This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives |
|
id |
The ID associated with the request |
WordInfo#
Word-specific information for recognized words.
Field |
Type |
Label |
Description |
---|---|---|---|
start_time |
int32 |
Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if |
|
end_time |
int32 |
Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if |
|
word |
string |
The word corresponding to this set of information. |
|
confidence |
float |
The confidence estimate. A higher number indicates an estimated greater likelihood that the recognized word is correct. This field is set only for a non-streaming result or, for a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. Although confidence can currently be roughly interpreted as a natural-log probability, the estimate computation varies with difference configurations, and is subject to change. The default of 0.0 is a sentinel value indicating confidence was not set. |
|
speaker_tag |
int32 |
Output only. Not available in this release |
RivaSpeechRecognition#
The RivaSpeechRecognition service provides two mechanisms for converting speech to text.
Method Name |
Request Type |
Response Type |
Description |
---|---|---|---|
Recognize |
Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned. |
||
StreamingRecognize |
StreamingRecognizeRequest stream |
StreamingRecognizeResponse stream |
StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client. |
GetRivaSpeechRecognitionConfig |
Enables clients to request the configuration of the current ASR service, or a specific model within the service. |
riva/proto/riva_audio.proto#
AudioEncoding#
AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.
Name |
Number |
Description |
---|---|---|
ENCODING_UNSPECIFIED |
0 |
Not specified. |
LINEAR_PCM |
1 |
Uncompressed 16-bit signed little-endian samples (Linear PCM). |
FLAC |
2 |
|
MULAW |
3 |
8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. |
OGGOPUS |
4 |
|
ALAW |
20 |
8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law. |
riva/proto/riva_common.proto#
RequestId#
Specifies the request ID of the request.
Field |
Type |
Label |
Description |
---|---|---|---|
value |
string |
Scalar Value Types#
.proto Type |
Notes |
C++ |
Java |
Python |
Go |
C# |
PHP |
Ruby |
---|---|---|---|---|---|---|---|---|
double |
double |
double |
float |
float64 |
double |
float |
Float |
|
float |
float |
float |
float |
float32 |
float |
float |
Float |
|
int32 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
int64 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
uint32 |
Uses variable-length encoding. |
uint32 |
int |
int/long |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
uint64 |
Uses variable-length encoding. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum or Fixnum (as required) |
sint32 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
sint64 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
fixed32 |
Always four bytes. More efficient than uint32 if values are often greater than 2^28. |
uint32 |
int |
int |
uint32 |
uint |
integer |
Bignum or Fixnum (as required) |
fixed64 |
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. |
uint64 |
long |
int/long |
uint64 |
ulong |
integer/string |
Bignum |
sfixed32 |
Always four bytes. |
int32 |
int |
int |
int32 |
int |
integer |
Bignum or Fixnum (as required) |
sfixed64 |
Always eight bytes. |
int64 |
long |
int/long |
int64 |
long |
integer/string |
Bignum |
bool |
bool |
boolean |
boolean |
bool |
bool |
boolean |
TrueClass/FalseClass |
|
string |
A string must always contain UTF-8 encoded or 7-bit ASCII text. |
string |
String |
str/unicode |
string |
string |
string |
String (UTF-8) |
bytes |
May contain any arbitrary sequence of bytes. |
string |
ByteString |
str |
[]byte |
ByteString |
string |
String (ASCII-8BIT) |