riva/proto/riva_asr.proto¶
-
service
RivaSpeechRecognition
The RivaSpeechRecognition service provides two mechanisms for converting speech to text.
-
rpc RecognizeResponse Recognize(RecognizeRequest)
Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.
-
rpc stream StreamingRecognizeResponse StreamingRecognize(stream StreamingRecognizeRequest)
StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.
-
-
message
RecognitionConfig
Provides information to the recognizer that specifies how to process the request
-
nvidia.riva.AudioEncoding encoding
The encoding of the audio data sent in the request.
All encodings support only 1 channel (mono) audio.
-
int32
sample_rate_hertz
Sample rate in Hertz of the audio data sent in all RecognizeAudio messages.
-
string
language_code
Required. The language of the supplied audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: “en-US”. Currently only en-US is supported
-
int32
max_alternatives
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognizeAlternative messages within each SpeechRecognizeResult. The server may return fewer than max_alternatives. If omitted, will return a maximum of one.
-
SpeechContext
speech_contexts
(repeated) Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see SpeechContext section
-
int32
audio_channel_count
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. Valid values for OGG_OPUS are ‘1’-‘254’. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only 1. If 0 or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default. To perform independent recognition on each channel set enable_separate_recognition_per_channel to ‘true’.
-
bool
enable_word_time_offsets
If true, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If false, no word-level time offset information is returned. The default is false.
-
bool
enable_automatic_punctuation
If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses.
-
bool
enable_separate_recognition_per_channel
This needs to be set to true explicitly and audio_channel_count > 1 to get each channel recognized separately. The recognition result will contain a channel_tag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audio_channel_count multiplied by the length of the audio.
-
string
model
Which model to select for the given request. Valid choices: Jasper, Quartznet
-
bool
verbatim_transcripts
The verbatim_transcripts flag enables or disable inverse text normalization. ‘true’ returns exactly what was said, with no denormalization. ‘false’ applies inverse text normalization, also this is the default
-
RecognitionConfig.CustomConfigurationEntry custom_configuration (repeated)
Custom fields for passing request-level configuration options to plugins used in the model pipeline.
-
-
message RecognitionConfig.CustomConfigurationEntry
-
string
key
-
string
value
-
string
-
message
RecognizeRequest
RecognizeRequest is used for batch processing of a single audio recording.
-
RecognitionConfig
config
Provides information to recognizer that specifies how to process the request.
-
bytes
audio
The raw audio data to be processed. The audio bytes must be encoded as specified in RecognitionConfig.
-
RecognitionConfig
-
message
RecognizeResponse
The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.
-
SpeechRecognitionResult
results
(repeated) Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.
-
SpeechRecognitionResult
-
message
SpeechContext
Provides “hints” to the speech recognizer to favor specific words and phrases in the results.
-
string
phrases
(repeated) A list of strings containing words so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words, for example, if specific commands are typically spoken by the user. Note that it is currently not possible to boost phrases or combination of words separated by spaces. This will be supported in a future version of Riva.
-
float
boost
Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Though boost can accept a wide range of positive values, most use cases are best served with values between 0 and 20. We recommend using a binary search approach to finding the optimal value for your use case.
-
string
-
message
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
-
string
transcript
Transcript text representing the words that the user spoke.
-
float
confidence
The non-normalized confidence estimate. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for a non-streaming result or, of a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided.
-
WordInfo
words
(repeated) A list of word-specific information for each recognized word. Only populated if is_final=true
-
string
-
message
SpeechRecognitionResult
A speech recognition result corresponding to the latest transcript
-
SpeechRecognitionAlternative
alternatives
(repeated) May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
-
int32
channel_tag
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
-
float
audio_processed
Length of audio processed so far in seconds
-
SpeechRecognitionAlternative
-
message
StreamingRecognitionConfig
Provides information to the recognizer that specifies how to process the request
-
RecognitionConfig
config
Provides information to the recognizer that specifies how to process the request
-
bool
interim_results
If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned.
-
RecognitionConfig
-
message
StreamingRecognitionResult
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
-
SpeechRecognitionAlternative
alternatives
(repeated) May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
-
bool
is_final
If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.
-
float
stability
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (is_final=false). The default of 0.0 is a sentinel value indicating stability was not set.
-
int32
channel_tag
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
-
float
audio_processed
Length of audio processed so far in seconds
-
SpeechRecognitionAlternative
-
message
StreamingRecognizeRequest
A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.
-
StreamingRecognitionConfig
streaming_config
Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.
-
bytes
audio_content
The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio data and all subsequent StreamingRecognizeRequest messages must contain audio data. The audio bytes must be encoded as specified in RecognitionConfig.
-
StreamingRecognitionConfig
-
message
StreamingRecognizeResponse
-
StreamingRecognitionResult
results
(repeated) This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives
-
StreamingRecognitionResult
-
message
WordInfo
Word-specific information for recognized words.
-
int32
start_time
Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.
-
int32
end_time
Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.
-
string
word
The word corresponding to this set of information.
-
int32