Architectural Pattern

The architectural pattern introduced in this section proposes a solution for building an interactive system that interacts with a user. The main architectural challenge tackled is how to integrate the “decisional” component i.e. the high level logic that decides how the system should react to various events related to the user interaction. This decisional component is critical to how “smart” the system is.

Note

An architectural pattern is a way of solving a recurring architectural problem. MVC (Model-View-Controller), for instance, solves the problem of separating the UI from the model.

Conceptual Framework

The concepts relevant to the proposed architectural pattern are defined below.

User

A user is a human that wants to interact with an interactive system in order to achieve something e.g. order food, book an appointment, get some information, etc.

Interactive System

An interactive system is a computer system that can interact with a human to assist them in achieving their goals. Interactive systems can also be characterized by a purpose/goal, e.g. advertise or sell a product, checking-in a new hotel guest in the lobby, etc.

Event

An event is something that happens related to the user,the bot, the system or the environment in which both the user and the system exist e.g. a state change of the environment, an action performed by the user, an action performed by the system.

Action

An action is something performed by the user or the interactive system, proactively or in response to other actions and events. Each action is associated with a single modality. Multiple actions can map to the same modality (e.g. speech, shout, whisper, hum, map all to BotVoice). Actions cannot occupy multiple modalities. Every action is represented in UMIM by a sequence of events to represent its lifetime and execution. Events are generated to start and stop an action and to provide updates on the action execution. See section Base Action Specification for more information.

Context

The context contains all the information available to the interactive system and which is relevant to the interaction. Typically, the context will contain various types of information like configuration information, session data, environment data (e.g. objects, variants), user information, etc.

Sensor

Sensors capture the raw input from the interaction with the user e.g. microphone, camera, touch screen, temperature, motion, etc.

Session

A session (or interaction session) is completely characterized by a sequence of events that represent actions performed by the user, actions performed by the system and other changes related to the state of the system.

Modality

In multimodal interactions information between the user and the interactive system can be exchanged on multiple modalities. A modality is an independent “interaction channel” between the bot and the user. Modalities are not referring to technical channels but “interaction channels”. Examples for modalities are UserVoice, BotVoice. Every action is affecting a single modality that in turn can have an effect on multiple system channels (see below). An example: The BotUtterance action (letting the bot verbally communicate with the user) is tied to the BotVoice modality. In an interactive system where the bot is represented by a 3D Avatar this modality is mapped to multiple system channels: audio out (synthesized speech), lip movement (lip synchronization to speech), text on user interface (subtitles of utterance).

System channel

System channels are technical channels that allow sending information from the interactive system to the user and vice versa. Examples for system channels include, audio output, microphone input, camera input or the user interface. It is the responsibility of the interactive system to map different modalities to system channels. Depending on the interactive system a single UMIM modality can cover multiple system channels.The BotMovement action for example affects the modality BotLowerBody but can be mapped to the system channels: lower body animation (walking animation), audio output (foot steps).

The Interaction Manager (IM) Architectural Pattern

The Interaction Manager pattern aims to separate the decision logic from the rest of the interactive system and defines three architectural constraints:

  1. Interaction Manager as a distinct event-driven component;

  2. Events, Context and Actions as the core interface for the Interaction Manager;

  3. Sensors Server and Actions Server as separate components.

The sections below describe each of these architectural constraints in more detail.

1. Distinct Event-driven Component

The architecture of the Interactive System should have a distinct component called the Interaction Manager responsible for deciding what actions the system should perform in response to user actions or other events, by taking into account the current context. The IM should interact with the rest of the system only through an event-driven mechanism. There should be no shared state between the IM and the rest of the Interactive System.

../../_images/system_overview.png

Interaction Manager (IM) as a distinct event-driven component in the architecture.

2. Events, Actions and Context

The interaction between the IM and the rest of the system should be done through three main types of events:

  1. General Events: represent anything that “happens” and is relevant to the interaction e.g. the user says something (UserSaid), a user gesture (UserGesture), click using a UI element (UserSelection), etc.

  2. Actions Events: are related to what the interaction system should do e.g. say something, play a sound, show something on a display, change avatar, call a third-party API, etc.; These events should mark the various relevant points in the lifecycle of an action e.g. the decision to do something (StartAction), when the action is started (ActionStarted) or finished (ActionFinished), etc.;

  3. Context Events: represent changes to any data contained in the Interaction System (ContextUpdate) that is relevant to the interaction e.g. user name, user rights, selected product, device information, etc.

3. Sensors Server and Actions Server

The Interaction Manager should not execute any of the actions nor should it process any sensor input directly. A Sensors Server (SS) component is responsible for processing the raw input data (audio, visual, text, custom events) and producing the events which are the input for the Interaction Manager. An Actions Server (AS) is responsible for the execution of the actions. The AS can also generate additional events for the IM e.g. to signal the start or the successful/failed execution of an action.

../../_images/sensor_and_action_server.png

Sensors Server (SS) and Actions Servers (AS) as separate components.

A few notes on the architectural pattern

This section comments on some additional aspects relevant to the IM pattern and the overall architecture of a system implementing this pattern.

Interaction Sessions

An interaction session is fully described by the stream of events. The stream of events for an interaction is immutable i.e. once events are added, they cannot be removed.

Synchronizing Action Execution

An important part of the IM pattern is that the execution of actions generates events that can be used to trigger additional actions.

For example, the IM can decide that the system should say “Hello!” and only when the Say action has finished make a specific gesture e.g. point to a screen and ask something. In this case, the ActionFinished(Say) event will be used by the IM to send StartAction(MakeGesture).

As another example, the IM can decide to start a waving animation when the Say(hello) action has started, and stop the animation when Say(hello) has finished. In this case, the ActionStarted(Say) and ActionFinished(Say) can be used as the triggers for StartAction(MakeGesture(Wave)) and StopAction(MakeGesture(Wave)).

Having Multiple Interaction Managers

The IM pattern does not impose that there should be only one interaction manager. An architecture can have multiple IMs and there are two main scenarios:

A primary IM with internal (secondary) IMs; Multiple peer IMs each dealing with different types of events.

As a concrete example for scenario 1, in an interactive avatar experience, a primary IM can manage the high level flow of the interaction (e.g., the various stages like greeting, gathering data, providing data, get confirmation, etc.) and hand over to more specific IMs when needed (e.g., for a complex authentication flow, for an IRQA scenario).

As a concrete example for scenario 2, one IM can deal with the conversational logic (i.e. what the bot should say), while a second IM can deal with animating the avatar based on what it says.

Statefulness

An interaction manager can be implemented both as a stateful component or as a stateless one.

  • In a stateful approach, an explicit “sessions management” capability needs to be supported by the IM.

  • In a stateless approach, the full history of the interaction would have to be provided with every new event.

Multiple Events at Once

In practice, while the IM is busy processing an event (i.e. deciding on the next action), multiple events could be generated by other parts of the interactive system. Whether the events are processed one by one or all at once is up to the interactive system architecture (the latter has additional challenges, so a one by one approach is recommended).

Handling NLU

There are three main patterns for integrating the NLU in an interactive system:

  1. As a separate component. In practice, in most systems the NLU is separated from the dialog management component. By following the same pattern, a separate NLU component in the architecture would process the UserSaid events and generate UserIntent events which would then be processed by the IM.

  2. As a “Interpret” action. The IM would explicitly trigger an Interpret action which will interpret the message and return the structured representation and confidence level. Based on that, the IM would decide on the next action.

  3. Internal to IM. The IM itself could handle the NLU as well. For example, the IM can call a ML model to interpret the result, and based on that decide what to do next. This message has the disadvantage that the processing of events will be blocked during the interpretation of the message. This might have undesirable consequences in a scenario that involves multiple modalities.