Architecture#
The Tokkio system is an integration of multiple pipelines, each of which consists of multiple microservices. Tokkio is a typical event-driven system facilitated by a message bus. The streaming pipeline drives the lifecycle of a Tokkio transaction by firing events when a WebRTC stream has been connected or disconnected. Subsequently, all pipelines start running in parallel and serve the incoming requests from the user.
The diagram below depicts an overview of the interconnections between various microservices within Tokkio system. Each component within Tokkio is modeled as a microservice, giving it the flexibility for changes or customizations as suited.
Tokkio can be divided into 6 pipelines: Streaming, Vision, Speech, Interaction, Fulfillment, and Animation/Rendering.
Streaming Pipeline
The streaming pipeline facilitates seamless video and audio communication between a user’s webcam and the cloud, and subsequently from the cloud to the Tokkio UI. This process begins with a web client, specifically supported by Google Chrome, where the Tokkio UI is rendered. Users must grant permissions for both camera and microphone access to initiate the pipeline. Once permissions are granted, video and audio data are transmitted bidirectionally between the user’s device and the Video Storage Toolkit (VST) using the WebRTC protocol. This transmission occurs after an initial signaling procedure is completed via a REST API at the ingress gateway. To ensure reliable data transmission, especially in environments with stringent security measures that might block direct connections, a reverse proxy or TURN server is employed.
Vision Pipeline
The vision pipeline processes video streams in real-time, providing immediate analysis of user presence and attention levels. It generates vision alerts to notify relevant systems or users based on detected activity and engagement, ensuring timely and accurate responses to user interactions.
Speech Pipeline
The Chat Controller Action Server is a sophisticated audio inference microservice designed to operate in real-time. It utilizes NVIDIA’s Riva technology to perform Voice Activity Detection (VAD), Automated Speech Recognition (ASR), and Text-to-Speech (TTS). This microservice seamlessly integrates with the system by providing events that comply with the UMIM (Unified Multimodal Interaction Management) standards, ensuring compatibility and efficient communication across different components.
Interaction Pipeline
The interaction management system orchestrates the avatar’s responses by intelligently interpreting and reacting to user events detected through vision and speech analysis. This dynamic process ensures that the avatar engages in meaningful and contextually appropriate interactions.
Fulfillment Pipeline
The fulfillment pipeline is designed to facilitate interactions with third-party applications or APIs via a REST interface, as detailed in the Tokkio Plugin Server documentation (Plugin Server). This component seamlessly integrates with the Tokkio UI through the UI-server, enabling direct communication and interaction. The pipeline is highly flexible and can be customized or entirely replaced to suit specific use cases, offering adaptability to meet diverse requirements.
Rendering Pipeline
The rendering pipeline is responsible for generating and animating avatars based on inputs from the interaction management system. Tokkio offers multiple out-of-the-box rendering pipelines that can be switched according to specific needs:
3D avatar rendering
The default Nvidia Omniverse-based rendering engine
The animation graph microservice animates the avatar based on the inputs from interaction manager microservice and Audio2Face microservice
A2F-2D avatar rendering
Visit Tokkio LLM-RAG - A2F-2D for more details
Unreal Engine renderer
Visit Tokkio LLM-RAG - Unreal Engine for more details