LLM Bot#
This is an example chatbot that showcases how LLM can be hooked in the ACE Agent Speech pipeline. It uses local LLM deployment of the LLama3 8b model using NVIDIA NIM.
The LLM bot showcases the following ACE Agent features:
Integrating any LLM model with ACE Agent
Deploying local LLM model using NVIDIA NIM
Handling conversation history in actions
Low latency using ASR 2 pass End of Utterance (EOU)
Alway-on Barge-In support
Support deployment using Event Architecture
Note
We will send an early trigger user query for LLM API call and might need to retrigger if the user takes more than 240 ms pause between words. On average, you might do 2 extra LLM calls for each user query which will require extra compute/cost for deploying on scale.
Docker-based bot deployment
Set the
NGC_CLI_API_KEY
environment variable with your NGC Personal API key before launching the bot.export NGC_CLI_API_KEY=...
Deploy the Llama3 8b model locally. The model deployment will require an A100 or H100 GPU devices. You can skip to step 3 to use the hosted LLM model instead.
Create a directory to cache the models and export the path to the cache as an environment variable:
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
Deploy the NeMo LLM inference microservice.
USERID=$(id -u) docker compose -f ./samples/llm_bot/docker-compose-nim-ms.yaml up -d
Update the LLM
BASE_URL
inactions.py
if you are using a different system for model deployment.
Optionally, you can use the Hosted NIM model from
build.nvidia.com
.Update
actions.py
to useBASE_URL
ashttps://integrate.api.nvidia.com/v1
.Set
NVIDIA_API_KEY
with the Personal NGC API key that has the AI Foundation Models and Endpoints service or key generated using the NVIDIA API Catalog. Get your NVIDIA API key.Go to the NVIDIA API Catalog.
Select any model.
Click Get API Key.
Optionally, you can use the OpenAI models.
Update
AsyncOpenAI
client inactions.py
to not usebase_url
.client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Set the
OPENAI_API_KEY
environment variable.
Prepare the environment for the Docker compose commands.
export BOT_PATH=./samples/llm_bot/ source deploy/docker/docker_init.sh
Deploy the Riva ASR (Automatic Speech Recognition) and TTS (Text to Speech) models.
docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
Deploy the ACE Agent microservices. Deploy the Chat Engine and Chat controller containers.
docker compose -f deploy/docker/docker-compose.yml up --build speech-event-bot -d
Interact with the bot using the URL
http://<workstation IP>:7006/
.For accessing the mic on the browser, we need to either convert
http
tohttps
endpoint by adding SSL validation or update yourchrome://flags/
oredge://flags/
to allowhttp://<workstation IP>:7006
as a secure endpoint.Sample question - What is the best GPU for gaming?