gRPC & Protocol Buffers¶
src/jarvis_proto/jarvis_asr.proto¶
-
service
JarvisASR
The JarvisASR service provides two mechanisms for converting speech to text.
-
rpc RecognizeResponse Recognize(RecognizeRequest)
Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.
-
rpc stream StreamingRecognizeResponse StreamingRecognize(stream StreamingRecognizeRequest)
StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.
-
-
message
RecognitionConfig
Provides information to the recognizer that specifies how to process the request
-
nvidia.jarvis.AudioEncoding encoding
The encoding of the audio data sent in the request.
All encodings support only 1 channel (mono) audio.
-
int32
sample_rate_hertz
Sample rate in Hertz of the audio data sent in all RecognizeAudio messages.
-
string
language_code
Required. The language of the supplied audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: “en-US”. Currently only en-US is supported
-
int32
max_alternatives
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognizeAlternative messages within each SpeechRecognizeResult. The server may return fewer than max_alternatives. If omitted, will return a maximum of one.
-
int32
audio_channel_count
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. Valid values for OGG_OPUS are ‘1’-‘254’. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only 1. If 0 or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default. To perform independent recognition on each channel set enable_separate_recognition_per_channel to ‘true’.
-
bool
enable_word_time_offsets
If true, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If false, no word-level time offset information is returned. The default is false.
-
bool
enable_automatic_punctuation
If ‘true’, adds punctuation to recognition result hypotheses. The default ‘false’ value does not add punctuation to result hypotheses.
-
bool
enable_separate_recognition_per_channel
This needs to be set to true explicitly and audio_channel_count > 1 to get each channel recognized separately. The recognition result will contain a channel_tag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audio_channel_count multiplied by the length of the audio.
-
string
model
Which model to select for the given request. Valid choices: Jasper, Quartznet
-
-
message
RecognizeRequest
RecognizeRequest is used for batch processing of a single audio recording.
-
RecognitionConfig
config
Provides information to recognizer that specifies how to process the request.
-
bytes
audio
The raw audio data to be processed. The audio bytes must be encoded as specified in RecognitionConfig.
-
RecognitionConfig
-
message
RecognizeResponse
The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.
-
SpeechRecognitionResult
results
(repeated) Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.
-
SpeechRecognitionResult
-
message
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
-
string
transcript
Transcript text representing the words that the user spoke.
-
float
confidence
The non-normalized confidence estimate. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for a non-streaming result or, of a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided.
-
WordInfo
words
(repeated) A list of word-specific information for each recognized word. Only populated if is_final=true
-
string
-
message
SpeechRecognitionResult
A speech recognition result corresponding to the latest transcript
-
SpeechRecognitionAlternative
alternatives
(repeated) May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
-
int32
channel_tag
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
-
float
audio_processed
Length of audio processed so far in seconds
-
SpeechRecognitionAlternative
-
message
StreamingRecognitionConfig
Provides information to the recognizer that specifies how to process the request
-
RecognitionConfig
config
Provides information to the recognizer that specifies how to process the request
-
bool
interim_results
If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned.
-
RecognitionConfig
-
message
StreamingRecognitionResult
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
-
SpeechRecognitionAlternative
alternatives
(repeated) May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
-
bool
is_final
If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.
-
int32
channel_tag
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from ‘1’ to ‘N’.
-
float
audio_processed
Length of audio processed so far in seconds
-
SpeechRecognitionAlternative
-
message
StreamingRecognizeRequest
A StreamingRecognizeRequest is used to configure and stream audio content to the Jarvis ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.
-
StreamingRecognitionConfig
streaming_config
Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.
-
bytes
audio_content
The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio data and all subsequent StreamingRecognizeRequest messages must contain audio data. The audio bytes must be encoded as specified in RecognitionConfig.
-
StreamingRecognitionConfig
-
message
StreamingRecognizeResponse
-
StreamingRecognitionResult
results
(repeated) This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives
-
StreamingRecognitionResult
-
message
WordInfo
Word-specific information for recognized words.
-
int32
start_time
Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.
-
int32
end_time
Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if enable_word_time_offsets=true and only in the top hypothesis.
-
string
word
The word corresponding to this set of information.
-
int32
src/jarvis_proto/jarvis_nlp.proto¶
-
service
JarvisNLP
Jarvis NLP Services implement task-specific APIs for popular NLP tasks including intent recognition (as well as slot filling), and entity extraction.
-
rpc TokenClassResponse AnalyzeEntities(AnalyzeEntitiesRequest)
AnalyzeEntities accepts an input string and returns all named entities within the text, as well as a category and likelihood.
-
rpc AnalyzeIntentResponse AnalyzeIntent(AnalyzeIntentRequest)
AnalyzeIntent accepts an input string and returns the most likely intent as well as slots relevant to that intent.
The model requires that a valid “domain” be passed in, and optionally supports including a previous intent classification result to provide context for the model.
-
rpc TextTransformResponse PunctuateText(TextTransformRequest)
PunctuateText takes text with no- or limited- punctuation and returns the same text with corrected punctuation and capitalization.
-
rpc NaturalQueryResponse NaturalQuery(NaturalQueryRequest)
NaturalQuery is a search function that enables querying one or more documents or contexts with a query that is written in natural language.
-
-
message
AnalyzeEntitiesOptions
AnalyzeEntitiesOptions is an optional configuration message to be sent as part of an AnalyzeEntitiesRequest with query metadata
-
string
lang
Optional language field. Assumed to be “en-US” if not specified.
-
string
-
message
AnalyzeEntitiesRequest
AnalyzeEntitiesRequest is the input message for the AnalyzeEntities service
-
string
query
The string to analyze for intent and slots
-
AnalyzeEntitiesOptions
options
Optional configuration for the request, including providing context from previous turns and hardcoding a domain/language
-
string
-
message
AnalyzeIntentContext
AnalyzeIntentContext is reserved for future use when we may send context back in a a variety of different formats (including raw neural network hidden states)
Reserved for future use
-
message
AnalyzeIntentOptions
AnalyzeIntentOptions is an optional configuration message to be sent as part of an AnalyzeIntentRequest with query metadata
-
string
previous_intent
-
AnalyzeIntentContext
vectors
-
string
domain
Optional domain field. Domain must be supported otherwise an error will be returned. If left blank, a domain detector will be run first and then the query routed to the appropriate intent classifier (if it exists)
-
string
lang
Optional language field. Assumed to be “en-US” if not specified.
-
string
-
message
AnalyzeIntentRequest
AnalyzeIntentRequest is the input message for the AnalyzeIntent service
-
string
query
The string to analyze for intent and slots
-
AnalyzeIntentOptions
options
Optional configuration for the request, including providing context from previous turns and hardcoding a domain/language
-
string
-
message
AnalyzeIntentResponse
AnalyzeIntentResponse is returned by the AnalyzeIntent service, and includes information related to the query’s intent, (optionally) slot data, and its domain.
-
Classification
intent
Intent classification result, including the label and score
-
TokenClassValue
slots
(repeated) List of tokens explicitly marked as filling a slot relevant to the intent, where the tokens may not exactly match the input (based on the recombined values after tokenization)
-
string
domain_str
Returns the inferred domain for the query if not hardcoded in the request. In the case where the domain was hardcoded in AnalyzeIntentRequest, the returned domain is an exact match to the request. In the case where no domain matches the query, intent and slots will be unset.
DEPRECATED, use Classification domain field.
-
Classification
domain
Returns the inferred domain for the query if not hardcoded in the request. In the case where the domain was hardcoded in AnalyzeIntentRequest, the returned domain is an exact match to the request. In the case where no domain matches the query, intent and slots will be unset.
-
Classification
-
message
NaturalQueryRequest
-
string
query
The natural language query
-
uint32
top_n
Maximum number of answers to return for the query. Defaults to 1 if not set.
-
string
context
Context to search with the above query
-
string
-
message
NaturalQueryResponse
-
NaturalQueryResult
results
(repeated)
-
NaturalQueryResult
-
message
NaturalQueryResult
-
string
answer
text which answers the query
-
float
score
Score representing confidence in result
-
string
src/jarvis_proto/jarvis_nlp_core.proto¶
-
service
JarvisCoreNLP
The Jarvis Core NLP Service provides generic NLP services for custom model use cases. The intent of this service is to allow users to design models for arbitrary use cases that conform simply with input and output types specified in the service. As an explicit example, the ClassifyText function could be used for sentiment classification, domain recognition, language identification, etc.
-
rpc TextClassResponse ClassifyText(TextClassRequest)
ClassifyText takes as input an input/query string and parameters related to the requested model to use to evaluate the text. The service evaluates the text with the requested model, and returns one or more classifications.
-
rpc TokenClassResponse ClassifyTokens(TokenClassRequest)
ClassifyTokens takes as input either a string or list of tokens and parameters related to which model to use. The service evaluates the text with the requested model, performing additional tokenization if necessary, and returns one or more class labels per token.
-
rpc TextTransformResponse TransformText(TextTransformRequest)
TransformText takes an input/query string and parameters related to the requested model and returns another string. The behavior of the function is defined entirely by the underlying model and may be used for tasks like translation, adding punctuation, augment the input directly, etc.
-
-
message
Classification
Classification messages return a class name and corresponding score
-
string
class_name
-
float
score
-
string
-
message
ClassificationResult
ClassificationResults contain zero or more Classification messages If the number of Classifications is > 1, top_n > 1 must have been specified.
-
Classification
labels
(repeated)
-
Classification
-
message
NLPModelParams
NLPModelParams is a metadata message that is included in every request message used by the Core NLP Service and is used to specify model characteristics/requirements
-
string
model_name
Requested model to use. If unavailable, the request will return an error
-
string
-
message
TextClassRequest
TextClassRequest is the input message to the ClassifyText service.
-
string
text
(repeated) Each repeated text element is handled independently for handling multiple input strings with a single request
-
uint32
top_n
Return the top N classification results for each input. 0 or 1 will return top class, otherwise N. Note: Current disabled.
-
NLPModelParams
model
-
string
-
message
TextClassResponse
TextClassResponse is the return message from the ClassifyText service.
-
ClassificationResult
results
(repeated)
-
ClassificationResult
-
message
TextTransformRequest
TextTransformRequest is a request type intended for services like TransformText which take an arbitrary text input
-
string
text
(repeated) Each repeated text element is handled independently for handling multiple input strings with a single request
-
uint32
top_n
-
NLPModelParams
model
-
string
-
message
TextTransformResponse
TextTransformResponse is returned by the TransformText method. Responses are returned in the same order as they were requested.
-
string
text
(repeated)
-
string
-
message
TokenClassRequest
TokenClassRequest is the input message to the ClassifyText service.
-
string
text
(repeated) Each repeated text element is handled independently for handling multiple input strings with a single request
-
uint32
top_n
Return the top N classification results for each input. 0 or 1 will return top class, otherwise N. Note: Current disabled.
-
NLPModelParams
model
-
string
-
message
TokenClassResponse
TokenClassResponse returns a single TokenClassSequence per input request
-
TokenClassSequence
results
(repeated)
-
TokenClassSequence
-
message
TokenClassSequence
TokenClassSequence is used for returning a sequence of TokenClassValue objects in the original order of input tokens
-
TokenClassValue
results
(repeated)
-
TokenClassValue
-
message
TokenClassValue
TokenClassValue is used to correlate an input token with its classification results
-
string
token
-
Classification
label
(repeated)
-
string
src/jarvis_proto/jarvis_tts.proto¶
-
service
JarvisTTS
-
rpc SynthesizeSpeechResponse Synthesize(SynthesizeSpeechRequest)
Used to request speech-to-text from the service. Submit a request containing the desired text and configuration, and receive audio bytes in the requested format.
-
rpc stream SynthesizeSpeechResponse SynthesizeOnline(SynthesizeSpeechRequest)
Used to request speech-to-text returned via stream as it becomes available. Submit a SynthesizeSpeechRequest with desired text and configuration, and receive stream of bytes in the requested format.
-
-
message
SynthesizeSpeechRequest
-
string
text
-
string
language_code
-
nvidia.jarvis.AudioEncoding encoding
audio encoding params
-
int32
sample_rate_hz
-
string
voice_name
voice params
-
string
-
message
SynthesizeSpeechResponse
-
bytes
audio
-
bytes
src/jarvis_proto/audio.proto¶
-
enum AudioEncoding
AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.
-
enumerator
ENCODING_UNSPECIFIED
= 0 Not specified.
-
enumerator
LINEAR_PCM
= 1 Uncompressed 16-bit signed little-endian samples (Linear PCM).
-
enumerator
FLAC
= 2 FLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless–therefore recognition is not compromised–and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.
-
enumerator
MULAW
= 3 8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.
-
enumerator
ALAW
= 20 8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law.
-
enumerator
src/jarvis_proto/health.proto¶
-
service
Health
-
rpc HealthCheckResponse Check(HealthCheckRequest)
-
rpc stream HealthCheckResponse Watch(HealthCheckRequest)
-
-
message
HealthCheckRequest
-
string
service
-
string
-
message
HealthCheckResponse
-
HealthCheckResponse.ServingStatus status
-
-
enum HealthCheckResponse.ServingStatus
-
enumerator
UNKNOWN
= 0
-
enumerator
SERVING
= 1
-
enumerator
NOT_SERVING
= 2
-
enumerator