Speech#

This section defines events and actions related to dialog management using the speech modality. Both the bot as well as the user can use this modality. We distinguish between BotSpeech and UserSpeech to refer to the respective modality.

Utterance User Action#

The user makes an utterance that is recognized by the interactive system. Examples of this action include the user typing into a text interface to interact with the bot or the user speaking to an interactive avatar.

UtteranceUserActionStarted()#

The user started to produce an utterance. The user could have started talking or typing for example.

Implementation guidance

This event should be sent out as soon as the system is able to detect the start of a user utterance. In an interactive system that supports voice activity detection (VAD) this should be sent out as soon as we detect voice activity.
action_started_at : The time stamp should match the time the utterance started as closely as possible. For most systems, voice activity detection will introduce a small delay. However the timestamp action_started_at should represent the moment in time the user started talking/typing, not the timestamp of when this event was created (for this purpose there is a separate field event_created_at in the payload).

Parameters:: ... – Additional parameters/payload inherited from UserActionStarted().

UtteranceUserActionActivityUpdated(activity: float)#

Whenever the interactive system detects a change in the utterance activity of the user this event may be sent out by the interactive system. Utterance activity can relate to different events in an interactive system. For a chatbot the activity can relate to the typing speed of the user whereas in a voice-enabled system activity reflects the user’s voice activity. Utterance activity can typically be detected much faster compared to the end of an utterance. This event can allow interaction designers to react to brief periods of no activity (e.g., silence for a voice bot) during an user utterance.

Implementation guidance

action_updated_at : The time stamp should match the time the user changed the utterance activity (e.g. when they became silent) as closely as possible. For most systems, activity detection will introduce a small delay. However the timestamp action_update_at should represent the moment in time the user changed activity, not the timestamp of when this event was created (for this purpose there is a separate field event_created_at in the payload).

Parameters:

activity (float) – Float between 0-1. Represents the user’s current utterance activity. An activity of 0 corresponds to no activity by the user. In a chatbot system for example this can mean ‘no typing’ whereas in a voice bot system this would correspond to “silence”. An activity of 1.0 represents the maximum activity during a user utterance that the system can detect. Many systems might only support the boundary values 0 (no activity, e.g., silence) and 1.0 (activity, e.g., user is talking).
... – Additional parameters/payload inherited from UserActionUpdated().

UtteranceUserActionIntensityUpdated(intensity: float)#

Provides updated speaking intensity levels if the interactive system supports it.

Parameters:

intensity (float) – A value from 0-1 that indicates the intensity of the utterance. A value of 0.5 means an “average” intensity. The intensity of an utterance action can correspond to different metrics depending on the interactive system. For a chatbot system the intensity could relate to the typing rate. In a speech-enabled system intensity could be computed based on the volume and pitch variation of the user’s voice.
... – Additional parameters/payload inherited from UserActionUpdated().

UtteranceUserActionTranscriptUpdated( interim_transcript: str, stability: float | None, )#

Provides updated transcripts during a UtteranceUserAction

Parameters:

interim_transcript (str) – Partial transcript of the user utterance up to this point in time
stability (Optional[float]) – Value between 0.0 and 1.0. Stability 1.0 means a very stable interim transcript that is not likely to be changed (e.g due to a language model cleanup step). Low stability means that future events likely will change parts of the transcripts (e.g. exchanging words based on additional context).
... – Additional parameters/payload inherited from UserActionUpdated().

StopUtteranceUserAction()#

Indicate that the IM has received the information needed and that the Action Server should consider the Utterance as finished as soon as possible. This could for example instruct the Action Server to decrease the hold time (duration of silence in the user speech until we consider the end of speech has been reached.

Parameters:: ... – Additional parameters/payload inherited from StopUserAction().

UtteranceUserActionFinished(final_transcript: str)#

The user utterance has finished.

Implementation guidance

Since this event is sent out when the final transcript has been computed, the event is typically delayed compared to the actual moment in time the user utterance stopped.
action_finished_at : The timestamp action_finished_at should represent the moment in time the user finished talking/typing, not the timestamp of when this event was created (for this purpose there is a separate field event_created_at in the payload). Example: If an interactive system can detect both voice activity (VAD) and transcribe speech (ASR), the timestamp should correspond to the detected utterance end time from VAD and not be related to any delays that ASR processing introduces.

Parameters:

final_transcript (str) – Final transcript of the user utterance
... – Additional parameters/payload inherited from UserActionFinished().

Implementation Guidance for Timestamps and Timing of Events#

UMIM was designed to abstract the technical details from the interactive system and to provide a robust representation of the interaction between users and bots. For real time systems that provide interactive responses, such as speech bots or interactive avatars it is very important that:

Action events are sent out as soon as possible when an event is detected and the required payload information has been collected
Action timestamps (action_started_at, action_updated_at, action_finished_at) should represent as closely as possible the time the user action actually started, was updated, or finished. Ideally this timestamp should not include any system dependent delays (e.g., from a component that needs time to detect it). Example: An interactive system might process audio to detect speech and transcribe it using an automatic speech recognition (ASR) model. When the user stops talking the ASR system will introduce a small delay until the final_transcript of the UtteranceUserActionFinished() event will be available. It is important that the action_finished_at timestamp corresponds to the time the user stopped talking and not to the time when the ASR processing is completed. This ensures that system delays don’t add up and that interaction designers can relate actions with each other (e.g., a vision based action with a speech based action).

To better illustrate this have a look at this lifetime overview of a UtteranceUserAction. While the events of the action might be delayed compared to the actual timestamps at when a user starts talking, becomes silent or finishes the utterance. The timestamps in the action events should always match the wall clock time of those events as closely as possible:

../_images/utterance_user_action_timing.png

In a typical system the signals received by the interactive system and the models that interpret the signals might provide

Voice intensity levels. Typically at regular intervals -> transformed into intensity update events
Voice activity detection. Translated to UtteranceUserActionActivityUpdated events (irregular intervals, does not mean audio voice intensity is 0)
Transcripts. Typically at regular intervals -> UtteranceUserActionTranscriptUpdated events

To put this into perspective, here is an overview of how UtteranceUserAction events relate to system signals in an idealized voice bot setup.

../_images/utterance_user_action_timing_details.png

Action Sequence Diagrams#

In the following we provide example sequence diagrams for the sequence of events for two typical interactive systems.

First we look at the flow in a chatbot system:

../_images/utterance_user_action_chat.png

Next you can compare this to the flow of events in typical interactive avatar system:

Utterance Bot Action#

The bot is producing an utterance (saying something) to the user. Depending on the interactive system this can mean different things, but this action always represents verbal communication with the user through a speech-like interface (e.g. chat interface, actual voice interface, brain-to-machine communication 😀)

StartUtteranceBotAction(script: str, intensity: float | None)#

The bot should start to produce an utterance. Depending on the interactive system this could be a bot sending a text message or an avatar talking to the user.

Parameters:

script (str) – The utterance of the bot, supporting SSML
intensity (Optional[float]) – A value from 0-1 that indicates the intensity of the utterance. A value of 0.5 means an “average” intensity. The intensity of an utterance action should change how the utterance is delivered to the user base on the type of interactive system For a chatbot system the intensity could relate to the typing rate in the UI. In a speech-enabled system intensity could change the volume and pitch variation of generated speech.
... – Additional parameters/payload inherited from StartBotAction().

UtteranceBotActionStarted()#

The bot started to produce the utterance. This event should align as close as possible with the moment in time the user is receiving the utterance. For example in an Interactive Avatar system, the event is sent out by the Action Server once the text-to-speech (TTS) stream is sent to the user.

Parameters:: ... – Additional parameters/payload inherited from BotActionStarted().

ChangeUtteranceBotAction(intensity: float)#

Adjusts the intended volume while the action has already been started.

Parameters:

intensity (float) – A value from 0-1 that indicates the intensity of the utterance. A value of 0.5 means an “average” intensity. The intensity of an utterance action should change how the utterance is delivered to the user base on the type of interactive system For a chatbot system the intensity could relate to the typing rate in the UI. In a speech-enabled system intensity could change the volume and pitch variation of generated speech.
... – Additional parameters/payload inherited from ChangeBotAction().

UtteranceBotActionScriptUpdated(interim_script: str)#

Provides updated transcripts during a UtteranceBotAction. These events correspond to the time that a certain part of the utterance is delivered to the user. In a interactive system that supports voice output these events should align with when the user hears the partial transcript

Parameters:

interim_script (str) – Partial script of the bot utterance up to this point in time
... – Additional parameters/payload inherited from BotActionUpdated().

StopUtteranceBotAction()#

Stops the bot utterance.The action is stopped only once the UtteranceBotActionFinished has been received. For interactive systems that do not support this event, the action will continue to run normally until finished. The interaction manager is expected to handle arbitrary delays between the time stopping the utterance and the time the utterance actually finished.

Parameters:: ... – Additional parameters/payload inherited from StopBotAction().

UtteranceBotActionFinished(final_script: str)#

The bot utterance finished, either because the utterance has been delivered to the user or the action was stopped.

Parameters:

final_script (str) – Final transcript of the bot utterance
... – Additional parameters/payload inherited from BotActionFinished().

Speech#

Utterance User Action#

Implementation Guidance for Timestamps and Timing of Events#

Action Sequence Diagrams#

Utterance Bot Action#

Side note: Managing Bot Expectations related to speech input and ASR#