Protocol Documentation

Table of Contents

ace_agent.proto

Top

APIStatusResponse

Generic API status response message

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

response_msg string

response message

status APIStatus

API response status code as defined in `APIStatus`

ASRResult

ASR Result

FieldTypeLabelDescription
results StreamingRecognitionResult

Complete ASR Response in Riva Skills ASR result schema

latency_ms float

start_time string

start time in ISO8601 format, e.g. 2024-03-08T13:33:30.736Z

stop_time string

stop time in ISO8601 format

ChatEngineResponse

Chat Engine Result json

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

result string

chat engine result

latency_ms float

ChatRequest

Request message for Chat API which will be sent to chat engine

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

bot_name string

bot name with version like {bot_name}_v{bot_version}, e.g chitchat_bot_v1.

query string

query

query_id string

unique id for identifying the query

user_id string

user id

source_language string

The language of the supplied query string as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en-US".

target_language string

The language of the response required from chat engine as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en-US".

is_standalone bool

Flag to send standalone text requests, when set true reponse is not sent to TTS when set to false reponse will be sent to TTS

user_context ChatRequest.UserContextEntry repeated

key-value pair for user context to be sent to chat engine

metadata ChatRequest.MetadataEntry repeated

key-value pair for meta data to be sent to chat engine

ChatRequest.MetadataEntry

FieldTypeLabelDescription
key string

value string

ChatRequest.UserContextEntry

FieldTypeLabelDescription
key string

value string

ChatResponse

Response message from chat engine for Chat API invocation

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

query string

query

query_id string

unique id for identifying the query

user_id string

user id

session_id string

session id if generated by chat engine

text string

chat engine response for the query passed in `ChatRequest`

cleaned_text string

chat engine cleaned up response text after markdown language tags removal

is_final bool

flag to indicate whether this is final response or intermediate response, when true there will be no more responses for the requested `ChatRequest`

json_response string

chat engine response in json format

ConversationHistory

FieldTypeLabelDescription
bot_name string

bot name with version like {bot_name}_v{bot_version}, e.g chitchat_bot_v1.

conversation ConversationInstance repeated

ConversationInstance

FieldTypeLabelDescription
role Role

content string

EventRequest

Request message for Event API which will be sent to chat engine

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

bot_name string

bot name with version like {bot_name}_v{bot_version}, e.g chitchat_bot_v1.

event_type string

event type

event_id string

unique event id

user_id string

user id

user_context EventRequest.UserContextEntry repeated

key-value pair for user context to be sent to chat engine

metadata EventRequest.MetadataEntry repeated

key-value pair for meta data to be sent to chat engine

EventRequest.MetadataEntry

FieldTypeLabelDescription
key string

value string

EventRequest.UserContextEntry

FieldTypeLabelDescription
key string

value string

EventResponse

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

event_type string

event type

event_id string

unique event id

user_id string

user id

text string

text response

cleaned_text string

is_final bool

json_response string

events string repeated

GetStatusRequest

GetStatusRequest used to get on demand Chat controller pipeline status

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

GetStatusResponse

Chat controller pipeline status response

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

pipeline_state PipelineStateResponse

PipelineRequest

PipelineRequest is used to create/free pipeline specified using stream_id

FieldTypeLabelDescription
stream_id string

A unique id sent by the client to identify the client connection. It is mapped to a unique pipeline on the Chat Controller server.

user_id string

user id

PipelineStateResponse

Chat controller pipeline state response

FieldTypeLabelDescription
state PipelineState

ReceiveAudioRequest

ReceiveAudioRequest is used to request audio data for specified stream_id.

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

ReceiveAudioResponse

Receive Audio API Response

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

audio_content bytes

synthesized audio data

encoding AudioEncoding

The encoding of the audio data

sample_rate_hertz int32

The sample rate in hertz (Hz) of the audio data

audio_channel_count int32

The number of channels in the audio data. Only mono is supported

frame_size int32

frame size of audio data

ReloadSpeechConfigsRequest

Reload Speech Configs Request

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

SendAudioRequest

The SendAudioRequest is used to send either StreamingRecognitionConfig message

or audio content. The first SendAudioRequest message must contain a

StreamingRecognitionConfig message, followed by the audio content messages.

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

streaming_config StreamingRecognitionConfig

Provides information to the recognizer that specifies how to process the request. The first `SendAudioRequest` message must contain a `streaming_config` message.

audio_content bytes

The audio data to be recognized. Sequential chunks of audio data are streamed from client.

source_id string

source id of the audio data

create_time string

audio buffer creation timestamp in ISO8601 format

SpeechRecognitionAlternative

Alternative hypotheses (a.k.a. n-best list).

FieldTypeLabelDescription
transcript string

Transcript text representing the words that the user spoke.

confidence float

The non-normalized confidence estimate. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for a non-streaming result or, of a streaming result where `is_final=true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided.

words WordInfo repeated

A list of word-specific information for each recognized word. Only populated if is_final=true

SpeechRecognitionControlRequest

SpeechRecognitionControlRequest is used for controlling input to

ASR internally muting ASR.

It is also used to disable DM-TTS flow for the incoming ASR input

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

is_standalone bool

Flag to mention whether asr transcripts to be passed to DM-TTS or get only transcripts

StreamingRecognitionConfig

Provides information to the ASR recognizer about incoming audio data

FieldTypeLabelDescription
encoding AudioEncoding

The encoding of the audio data sent in the request. All encodings support only 1 channel (mono) audio.

sample_rate_hertz int32

The sample rate in hertz (Hz) of the audio data sent in the `SendAudioRequest` message.

language_code string

The language of the supplied audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en-US". Default is en-US.

audio_channel_count int32

The number of channels in the input audio data.

model string

Which model to select for the given request.

StreamingRecognitionResult

A streaming speech recognition result corresponding to a portion of the audio

that is currently being processed.

FieldTypeLabelDescription
alternatives SpeechRecognitionAlternative repeated

May contain one or more recognition hypotheses (up to the maximum specified in `max_alternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

is_final bool

If `false`, this `StreamingRecognitionResult` represents an interim result that may change. If `true`, this is the final time the speech service will return this particular `StreamingRecognitionResult`, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.

stability float

An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (`is_final=false`). The default of 0.0 is a sentinel value indicating `stability` was not set.

channel_tag int32

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.

audio_processed float

Length of audio processed so far in seconds

StreamingSpeechResultsRequest

StreamingSpeechResultsRequest is used to request various results from chat

controller.

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

request_id string

uuid to identify concurrent client request

StreamingSpeechResultsResponse

Chat controller Metadata streaming response

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

message_type MessageType

message type as defined in `MessageType`

asr_result ASRResult

chat_engine_response ChatEngineResponse

tts_result TTSResult

pipeline_state PipelineStateResponse

display_text string

SynthesizeSpeechRequest

Request message for standalone TTS synthesis of provided text transcript

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

transcript string

transcript text to be synthesized

TTSResult

TTS result metadata

FieldTypeLabelDescription
latency_ms float

TTS latency in milliseconds

time_till_eos_ms int32

time in millisecond remained to complete tts audio rendering. This is applicable when tts is set to streaming and realtime from pipeline graph. In non-streaming mode this is expected to be 0.

UserContext

UserContext data containing user specific information.

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

user_id string

user id

bot_name string

bot name with version like {bot_name}_v{bot_version}, e.g chitchat_bot_v1.

conversation_history ConversationHistory repeated

conversation history of user

context_json string

json formatted data of user context

UserContextRequest

UserContextRequest used to request user context

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

user_id string

user id

UserParametersRequest

UserParametersRequest is used to set user parameters

FieldTypeLabelDescription
stream_id string

unique id to identify the client connection

user_id string

used id

bot_name string

bot name with version like {bot_name}_v{bot_version}, e.g chitchat_bot_v1.

WordInfo

Word-specific information for recognized words.

FieldTypeLabelDescription
start_time int32

Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.

end_time int32

Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.

word string

The word corresponding to this set of information.

confidence float

The non-normalized confidence estimate. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

APIStatus

Generic Chat controller API status

NameNumberDescription
UNKNOWN_STATUS 0

SUCCESS 1

PIPELINE_AVAILABLE 2

PIPELINE_NOT_AVAILABLE 3

BUSY 4

ERROR 5

INFO 6

AudioEncoding

AudioEncoding specifies the encoding of the audio bytes in the encapsulating message.

NameNumberDescription
UNKNOWN 0

Not specified.

LINEAR_PCM 1

Uncompressed 16-bit signed little-endian samples (Linear PCM).

FLAC 2

`FLAC` (Free Lossless Audio Codec) is the recommended encoding because it is lossless--therefore recognition is not compromised--and requires only about half the bandwidth of `LINEAR16`. `FLAC` stream encoding supports 16-bit and 24-bit samples, however, not all fields in `STREAMINFO` are supported.

MULAW 3

8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.

ALAW 5

8-bit samples that compand 13-bit audio samples using G.711 PCMU/a-law.

MessageType

Message type field for Chat controller metadata streaming

NameNumberDescription
UNKNOWN_RESPONSE 0

ASR_RESPONSE 1

CHAT_ENGINE_RESPONSE 2

TTS_RESPONSE 3

PIPELINE_STATE_RESPONSE 4

DISPLAY_TEXT 5

PipelineState

Chat controller Pipeline States

NameNumberDescription
INIT 0

IDLE 1

WAIT_FOR_TRIGGER 2

ASR_ACTIVE 3

DM_ACTIVE 4

TTS_ACTIVE 5

Role

Used in storing conversation history for user and bot

NameNumberDescription
UNDEFINED 0

USER 1

BOT 2

SYSTEM 3

AceAgentGrpc

The AceAgentGrpc service provides apis to interact with chat engine and speech

components.

Method NameRequest TypeResponse TypeDescription
CreatePipeline PipelineRequest APIStatusResponse

CreatePipeline API is used to create new pipeline with Chat controller, It creates a Chat controller pipeline with a unique stream_id populated by the client in PipelineRequest.

FreePipeline PipelineRequest APIStatusResponse

FreePipeline API is used to free up a pipeline with Chat controller, created by using CreatePipeline API. Client needs to pass same stream_id in PipelineRequest as used in CreatePipeline.

SendAudio SendAudioRequest stream APIStatusResponse

SendAudio API is used to stream audio content to ASR from Chat controller. This is a client side streaming API.

ReceiveAudio ReceiveAudioRequest ReceiveAudioResponse stream

ReceiveAudio API is used to receive synthesized audio from TTS through Chat controller. This is a server side streaming API.

StreamSpeechResults StreamingSpeechResultsRequest StreamingSpeechResultsResponse stream

StreamSpeechResults API is used to receive all the meta data from Chat controller like ASR transcripts, Chat engine responses, Pipeline states etc. This is a broadcasting API i.e it can fan out responses to multiple concurrent client instances using same stream_id. This is a server side streaming API.

StartRecognition SpeechRecognitionControlRequest APIStatusResponse

StartRecognition API is used to start the ASR recognition in Chat controller for the audio content streamed from SendAudio API. This API also provides a flag to mark the ASR recognition as standalone, i.e Chat Engine and TTS will not be invoked for the ASR transcript.

StopRecognition SpeechRecognitionControlRequest APIStatusResponse

StopRecognition API is used to stop the ASR recognition for the audio content streamed from SendAudio API.

SetUserParameters UserParametersRequest APIStatusResponse

SetUserParameters API can be used to set the runtime user parameters like user_id on Chat controller pipeline.

GetStatus GetStatusRequest GetStatusResponse

GetStatus API can be used to get the latest state of Chat controller pipeline. This API is not valid if UMIM is enabled

ReloadSpeechConfigs ReloadSpeechConfigsRequest APIStatusResponse

ReloadSpeechConfigs API can be used to reload the ASR word boosting and TTS Arpbet configs in Chat controller.

SynthesizeSpeech SynthesizeSpeechRequest APIStatusResponse

SynthesizeSpeech API is used to send text transcript directly to the TTS for standalone TTS audio synthesis. The generated audio will be routed to the path specified in the pipeline graph provided in Chat controller. e.g. if the TTS audio is routed to A2F in the graph, the audio will be sent to A2F server. If the TTS audio is routed to Grpc client then it will be available through the server side streaming ReceiveAudio API.

GetUserContext UserContextRequest UserContext

GetUserContext API is used to get the current user context from Chat Engine. The API returns a UserContext message containing the current conversation history and any context attached to the active user_id. This API is not valid if UMIM is enabled

SetUserContext UserContext APIStatusResponse

SetUserContext API is used to set the current user context in Chat Engine. The API accepts a UserContext message containing the conversation history and any context to be attached to the active user_id. This API is not valid if UMIM is enabled

UpdateUserContext UserContext APIStatusResponse

UpdateUserContext API is used to update the current user context from Chat Engine. The API accepts a UserContext message containing any context to be attached to the active user_id. This API is not valid if UMIM is enabled

DeleteUserContext UserContextRequest APIStatusResponse

DeleteUserContext API is used to delete the current user context attached to a user_id in Chat Engine. This API is not valid if UMIM is enabled

Chat ChatRequest ChatResponse stream

Chat API is used to send text queries to Chat Engine via Chat controller. This API also provides a flag to disable TTS synthesis for the response generated by Chat Engine. This can be used for a text in and text out type of scenario. This API is not valid if UMIM is enabled

Event EventRequest EventResponse stream

Event API is used to send events to Chat Engine via Chat controller. This API is not valid if UMIM is enabled

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)