LLM Bot#

This is an example chatbot that showcases how LLM can be hooked in the ACE Agent Speech pipeline. It uses local LLM deployment of the LLama3 8b model using NVIDIA NIM.

The LLM bot showcases the following ACE Agent features:

  • Integrating any LLM model with ACE Agent

  • Deploying local LLM model using NVIDIA NIM

  • Handling conversation history in actions

  • Low latency using ASR 2 pass End of Utterance (EOU)

  • Alway-on Barge-In support

  • Support deployment using Event Architecture

    Note

    We will send an early trigger user query for LLM API call and might need to retrigger if the user takes more than 240 ms pause between words. On average, you might do 2 extra LLM calls for each user query which will require extra compute/cost for deploying on scale.

Docker-based bot deployment

  1. Set the NGC_CLI_API_KEY environment variable with your NGC Personal API key before launching the bot.

    export NGC_CLI_API_KEY=...
    
  2. Deploy the Llama3 8b model locally. The model deployment will require an A100 or H100 GPU devices. You can skip to step 3 to use the hosted LLM model instead.

    1. Create a directory to cache the models and export the path to the cache as an environment variable:

      mkdir -p ~/.cache/model-cache
      export MODEL_DIRECTORY=~/.cache/model-cache
      
    2. Deploy the NeMo LLM inference microservice.

      USERID=$(id -u) docker compose -f ./samples/llm_bot/docker-compose-nim-ms.yaml up -d
      
    3. Update the LLM BASE_URL in actions.py if you are using a different system for model deployment.

  3. Optionally, you can use the Hosted NIM model from build.nvidia.com.

    1. Update actions.py to use BASE_URL as https://integrate.api.nvidia.com/v1.

    2. Set NVIDIA_API_KEY with the Personal NGC API key that has the AI Foundation Models and Endpoints service or key generated using the NVIDIA API Catalog. Get your NVIDIA API key.

      1. Go to the NVIDIA API Catalog.

      2. Select any model.

      3. Click Get API Key.

  4. Optionally, you can use the OpenAI models.

    1. Update AsyncOpenAI client in actions.py to not use base_url.

      client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
      
    2. Set the OPENAI_API_KEY environment variable.

  5. Prepare the environment for the Docker compose commands.

    export BOT_PATH=./samples/llm_bot/
    source deploy/docker/docker_init.sh
    
  6. Deploy the Riva ASR (Automatic Speech Recognition) and TTS (Text to Speech) models.

    docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
    
  7. Deploy the ACE Agent microservices. Deploy the Chat Engine and Chat controller containers.

    docker compose -f deploy/docker/docker-compose.yml up --build speech-event-bot -d
    
  8. Interact with the bot using the URL http://<workstation IP>:7006/.

    For accessing the mic on the browser, we need to either convert http to https endpoint by adding SSL validation or update your chrome://flags/ or edge://flags/ to allow http://<workstation IP>:7006 as a secure endpoint.

    Sample question - What is the best GPU for gaming?