Customizing a Bot#

Selecting an LLM Model#

ACE Agent allows you to select LLM models to control the flow and generate responses. You can utilize models from OpenAI, NIM hosted, or local NIM models in the bot configs. Check the LLM Model Configurationse sections for a full supported model list.

In the bot config file, you can update the engine of the model type main to your model provider and add the model of your choice.

models:
- type: main
    engine: openai
    model: gpt-4-turbo

The following models have been tested with the Colang 2.0-beta version.

OpenAI models:

  • gpt-3.5-turbo-instruct

  • gGpt-3.5-turbo

  • gpt-4-turbo

  • gpt-4o

  • gpt-4o-mini

NIM models:

  • meta/llama3-8b-instruct

  • meta/llama3-70b-instruct

  • meta/llama-3.1-8b-instruct

  • meta/llama-3.1-70b-instruct

Using an On-Premise LLM Model Deployed via NIM#

In this tutorial, we will showcase how you can utilize the LLM model deployed using NVIDIA NIM for Large Language Models with vLLM or TensorRT-LLM optimizations and update the sample Stock bot to use a locally deployed LLM model.

  1. You can deploy the meta/llama3-8b-instruct model locally by following instructions from Llama3 8B Instruct NIM using an A100 or H100 GPU.

  2. Update the stock_bot_config.yaml file present in ./samples/stock_bot in the Quick Start resource directory to utilize the locally deployed meta/llama3-8b-instruct LLM model.

    models:
      - type: main
        engine: nim
        model: meta/llama3-8b-instruct
        parameters:
          stop: ["\n"]
          max_tokens: 100
          base_url: "http://0.0.0.0:8000/v1"  # Use this to use NIM model
    
  3. Try the sample Stock bot by deploying it using the steps from the Quick Start Guide. You can explore other LLM NIMs for deployment at build.nvidia.com.

    Note

    If you are deploying LLM models with TensorRT LLM optimizations using Triton, you might see port conflicts during NIM deployment with the ACE Agent deployed using Triton or the Riva Skills server. Updated NIM deployment commands to not use --host network and expose the OpenAI port and other ports as needed.

Creating a New Custom Action#

ACE Agent bots which are based on Colang use actions to accomplish tasks such as Intent Generation, Bot Message generation, calling the Plugin Server, and so on. ACE Agent allows you to create your own custom actions and override some custom actions defined by ACE Agent.

  1. Create a new file in the bot directory called actions.py. This is a special file that is initialized during bot startup. Any actions defined here are registered in Colang, and can be used in Colang flows.

  2. Create a simple action that will check if the question has any words from a block list. Update actions.py with the following.

    from nemoguardrails.actions.actions import action
    from typing import Dict, Any
    
    BLOCK_LIST = ["stupid", "moron"]
    
    @action(name="isBlockWordPresentAction")
    async def check_block_list(context: Dict[str, Any] = {}):
        question = context.get("last_user_transcript")
        if any(word in question for word in BLOCK_LIST):
            return True
        return False
    
  3. Update the flow related to user queried about stocks to call the newly created block_word_present custom action in the sample stock bot and make a decision based on the action’s response.

    flow stock faq
        global $last_user_transcript
        user queried about stocks
        $should_block = await isBlockWordPresentAction()
        if $should_block
            bot say "Please do not use blocked words"
        else
            $retrieval_results = await RetrieveRelevantChunksAction()
            $response = ..."{$last_user_transcript}. You can take context from following section: {$retrieval_results}. Enclose the response in quotes."
            bot say "{$response}"
    

Asking a question like Is it stupid to invest in an IPO? will result in the fallback response Please do not use blocked words, whereas questions without blocked words will be accepted.

Similarly, it is possible to create custom actions which can accept arguments from Colang flows.

Using a Custom NLP Model#

In this section, we will focus on how to deploy a custom NLP model.

Deploying a Custom NLP Model with the NLP Server#

The NLP server allows you to easily deploy custom NLP models such as Hugging Face models and integrate it with the dialog pipeline. It mainly relies on the @model_api and @pytriton decorators functions.

Custom Model Integration using the @model_api Decorator#

Let’s go through an example of how we can integrate a custom model inference for /nlp/model/generate_embedding endpoint which returns embeddings for given queries.

  1. Get familiar with the required request and response schema for /nlp/model/generate_embedding endpoint on the NLP server Swagger for any existing deployment.

  2. Create a Python function which can take input in similar format as request schema, do model inference, and return output in a similar format as response schema.

    For this example, let’s simulate model inference by returning random embeddings for the given queries, your Python function should look similar to the following:

    import numpy as np
    
    async def random_embeddings(input_request):
        """
        Generate Random Embedding
        """
        return {"queries": input_request.queries,
                "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
    

    The NLP server exposes the @model_api decorator for mapping inference functions to particular API endpoints internally. The server uses a combination of model_name and model_version as unique identifiers during API requests to execute required inference functions.

  3. Add the @model_api decorator to the random embedding generation inference function.

    import numpy as np
    from nlp_server.decorators import model_api
    
    @model_api(endpoint="/nlp/model/generate_embedding", model_name="random_embedding", model_version="1")
    async def random_embeddings(input_request):
        """
        Generate Random Embedding
        """
        return {"queries": input_request.queries,
                "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
    
  4. Start the NLP server with the random embedding inference client integrated.

    aceagent nlp-server deploy --custom_model_dir random_embedding.py
    

    Optionally, you can specify the client module in the model_config.yaml file.

    model_servers:
    - name: custom
        nlp_models:
            - random_embedding.py # Absolute or relative path from model_config.yaml
    

    To start the NLP server with the client, run:

    aceagent nlp-server deploy --config model_config.yaml
    
  5. Verify the changes by querying with model_name as "random_embedding" on the NLP server Swagger for the /nlp/model/generate_embedding endpoint or use the following CURL command:

    curl -X 'POST' \
        'http://0.0.0.0:9003/nlp/model/generate_embedding' \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        -d '{
        "queries": [
            "Random query"
        ],
        "model_name": "random_embedding"
    }'
    

Custom Model Integration using the @pytriton Decorator#

We can integrate any custom model client using the @model_api decorator by following the steps from the previous section. If our custom model client is loading a model as part of @model_api function or Python module, and the NLP server is running with multiple workers, then the model will be loaded on all the workers separately. Loading models multiple times will result in higher memory usage and degradation in performance.

For production use cases, where the aim is to get higher throughput, we recommend to offload GPU and heavy processing code to model servers such as the NVIDIA Triton Inference Server and the inference client should be lightweight implementation using async support. Model servers such as the NVIDIA Triton Inference Server are designed to handle GPU and CPU resources better and allow us to batch parallel requests. The @pytriton decorator allows you to offload heavy compute to Triton and avoid multiple loading of the models.

To utilize the @pytriton decorator to deploy the facebook/bart-large-mnli model from Hugging Face, perform the following steps.

  1. Create an example inference client. This is needed for the facebook/bart-large-mnli model from Hugging Face.

    from transformers import pipeline
    CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
    LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
    input_queries = ["I love cricket", "Lets watch movie today"]
    classification_result = CLASSIFIER(input_queries, LABELS)
    result_labels = [res["labels"][0] for res in classification_result]
    print(result_labels)
    
  2. Arrange your inference client in the following format. We will be using PyTriton for hosting the model on the NLP server. You might want to get familiar with PyTriton by following the Quick Start steps from the PyTriton GitHub repository.

    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    
    triton = Triton()
    
    # Model Initialization / Loading Code
    ...
    
    # Creating Inference function
    @batch
    def infer_fn(**inputs: np.ndarray):
        # Inference for batch of inputs using already loaded model and returning outputs
        ...
        return outputs
    
    # Connecting inference callable with Triton Inference Server
    triton.bind(
        model_name="<custom_model_name>",
        infer_func=infer_fn,
        inputs=[
            ...
        ],
        outputs=[
            ...
        ],
        config=ModelConfig(...)
    )
    
    # Serving model
    triton.serve()
    
  3. Convert our inference client in PyTriton compatible format.

    import numpy as np
    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    
    triton = Triton()
    
    # Model Initialization / Loading Code
    from transformers import pipeline
    CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
    LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
    # Creating Inference function
    @batch
    def infer_fn(queries: np.ndarray):
        # Inference for batch of inputs using already loaded model and returning outputs
        input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
        classification_result = CLASSIFIER(input_queries, LABELS)
        result_labels = [res["labels"][0] for res in classification_result]
        return {"labels": np.char.encode(result_labels, "utf-8")}
    
    # Connecting inference callable with Triton Inference Server
    triton.bind(
        model_name="facebook-bart-large-mnli",
        infer_func=infer_fn,
        inputs=[
            Tensor(name="queries", dtype=bytes, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="labels", dtype=bytes, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=4)
    )
    
    # Serving model
    triton.serve()
    
  4. Test your code by running the following commands:

    1. Install PyTriton locally.

      pip install -U "nvidia-pytriton<0.4.0"
      
    2. Save the code in a Python file and run:

      python custom_model.py
      

      You should see the NVIDIA Triton Inference Server hosted at localhost:8000. You can interact with the model using the following code:

      import numpy as np
      from pytriton.client import ModelClient
      
      with ModelClient("localhost:8000", "facebook-bart-large-mnli") as client:
          result_dict = client.infer_batch(np.char.encode([["Lets watch movie today"]], "utf-8"))
      
      print(result_dict)
      
  5. Integrate the above code into the NLP server for easier and automated deployment via the @pytriton decorator. The @pytriton decorated function is executed only once during startup even for multiple workers, so we don’t load the model multiple times in GPU memory.

    import numpy as np
    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    from nlp_server.decorators import pytriton
    
    # @pytriton decorator
    @pytriton()
    def custom_pytriton_model(triton: Triton):
        # Model Initialization / Loading Code
        from transformers import pipeline
        CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
        LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
        # Creating Inference function
        @batch
        def infer_fn(queries: np.ndarray):
            # Inference for batch of inputs using already loaded model and returning outputs
            input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
            classification_result = CLASSIFIER(input_queries, LABELS)
            result_labels = [res["labels"][0] for res in classification_result]
            return {"labels": np.char.encode(result_labels, "utf-8")}
    
        # Connecting inference callable with Triton Inference Server
        triton.bind(
            model_name="facebook-bart-large-mnli",
            infer_func=infer_fn,
            inputs=[
                Tensor(name="queries", dtype=bytes, shape=(-1,)),
            ],
            outputs=[
                Tensor(name="labels", dtype=bytes, shape=(-1,)),
            ],
            config=ModelConfig(max_batch_size=4)
        )
    
  6. Create the @model_api decorated function to override the NLP server API endpoint of your choice. Using this model, we will override the /nlp/model/text_classification endpoint.

    import numpy as np
    from pytriton.client import ModelClient
    from nlp_server.decorators import model_api
    
    @model_api(endpoint="/nlp/model/text_classification", model_name="facebook-bart-large-mnli", model_type="triton")
    def bart_mnli_model_api(input_request):
        # NLP Server embed model metadata in the function meta as model_info attribute, you can access url for Triton server using bart_mnli_model_api.model_info.url
        with ModelClient(
            url=f"grpc://{bart_mnli_model_api.model_info.url}",
            model_name=bart_mnli_model_api.model_info.model_name,
            model_version=bart_mnli_model_api.model_info.model_version,
        ) as client:
            result_dict = client.infer_batch(np.array([[np.char.encode(input_request.query, "utf-8")]]))
        return {"class_name": result_dict["labels"][0].decode("utf-8"), "score": 1}
    
  7. Save both the @pytriton and @model_api clients in a single Python file and start the NLP server.

    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --custom_model_dir <custom_model.py>
    

    Optionally, specify the client modules in the model_config.yaml file.

    model_servers:
      - name: custom
        nlp_models:
          - <custom_model.py> # Absolute or relative path from model_config.yaml
    

    To start the NLP server with the client, run:

    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --config model_config.yaml
    
  8. Verify the changes by querying with model_name as "facebook-bart-large-mnli" on the NLP server Swagger for the /nlp/model/text_classification endpoint or use the following CURL command:

    curl -X 'POST' \
      'http://localhost:9003/nlp/model/text_classification' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "query": "I love cricket",
      "model_name": "facebook-bart-large-mnli"
    }'
    

Customizing ASR Recognition with Word Boosting#

Word boosting allows you to bias the ASR to recognize particular words of interest or domain specific words at request time; by giving them a higher score when decoding the output of the acoustic model.

For more information about word boosting, refer to the Riva Word Boosting documentation.

  1. For our sample bot, let’s add a speech_config.yaml file.

    samples/
    └── stock_bot
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
  2. Create a word boosting file asr_words_to_boost.txt containing the words to boost along with boosting score.

    samples/
    └── stock_bot
        ├── asr_words_to_boost.txt
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
    {
     "comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.",
     "speech_context": [
      {
      "boost": 40,
      "phrases": [
        "Nvidia"
      ]
      }
     ]
    }
    
  3. Add the path of this word boosting file in the ASR component of the speech_config.yaml file.

    riva_asr:
      RivaASR:
        word_boost_file_path: "/workspace/config/asr_words_to_boost.txt"
    
  4. Deploy the bot with the above ASR customization.

Customizing ASR Recognition for Long Pause Handling#

Pauses in human speech can serve several important functions such as allowing the speaker to gather their thoughts, choose words carefully, and give the audience a chance to process what has been said. In conversational speech, research suggests that most pauses between words are short (0.20 seconds), medium (0.60 seconds), and long (over 1 second).

ASR recognition models need to wait for some silence to detect the end of utterance (EOU) in user audio. Riva ASR uses 800 ms of silence as the default EOU, which means the user transcript might get split into multiple queries if more than 800 ms of silence is observed in between the words. Setting a higher EOU can have a negative impact of increased bot response latency. Depending on the bot use case, you can update the EOU value in speech_config.yaml to adjust silence for handling a longer pause.

riva_asr:
   RivaASR:
       server: "localhost:50051"
       # final transcript silence threshold
       endpointing_stop_history: 800 #  End of User Utterance (EOU)

As humans, we start thinking while the other person is speaking. To get a human-level user experience, we need to start processing transcripts as soon as received and reevaluate our response if more words are added to the transcript. LLM and RAG sample bots utilize this technique also referred to as 2 pass End of Utterance EOU. We start processing the user transcript after the first pass (240 ms) silence, stop LLM and TTS processing if the user audio is detected after the first pass but before the second pass EOU (800 ms), and retrigger the pipeline with the new transcript.

riva_asr:
   RivaASR:
       server: "localhost:50051"
       # final transcript silence threshold
       endpointing_stop_history: 800 #  Second pass End of User Utterance (EOU)
       endpointing_stop_history_eou: 240 # First pass End of User Utterance (EOU)

Customizing TTS Pronunciation using IPA#

Providing an IPA pronunciation file enables you to tune the TTS model with custom pronunciation for some domain-specific words, or words that are not pronounced as expected. The model uses this pronunciation for the specified word, while synthesizing audio for the same.

To use IPA mapping in TTS, we need to create a dictionary file containing the word and its IPA Pronunciation.

  1. Create an IPA dictionary file ipa.dict in the bot config.

    samples/
    └── stock_bot
        ├── ipa.dict
        ├── asr_words_to_boost.txt
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
  2. Add the following example IPA pronunciation, any custom words can be added here (word must be in UPPERCASE letters). For example:

    GPU<SPACE><SPACE>'dʒi'pi'ju
    
  3. Add the path of this IPA file in the TTS component of the speech_config.yaml file.

    riva_tts:
      RivaTTS:
        ipa_dict: "/workspace/config/ipa.dict"
    
  4. Deploy the bot with the above TTS customization.

Using 3rd Party Text-to-Speech (TTS) Solutions#

The ACE Agent pipeline supports Riva TTS as the default option.

For speech bots, you might want to customize the voice for speech response. You can train your own TTS model, clone the TTS voice, or use any 3rd party provider. In this example, we will showcase how you can integrate ElevenLabs text to speech APIs. By default, we use NVIDIA Riva TTS models.

  1. Add the required dependencies in the NLP Server dockerfile present in the Quick Start resource at deploy/docker/dockerfiles/nlp_server.Dockerfile.

    ##############################
    # Install custom dependencies
    ##############################
    
    RUN pip install elevenlabs==1.4.1
    
  2. Register with ElevenLabs - Generative AI Text to Speech & Voice Cloning and get an API key. Add the ElevenLabs API key in the .env file present in the Quick Start directory with the variable name ELEVENLABS_API_KEY, so it can be passed to the NLP server container.

    from elevenlabs.client import ElevenLabs
    from elevenlabs import Voice, VoiceSettings
    
    client = ElevenLabs(
            api_key=os.getenv("ELEVENLABS_API_KEY"),
        )
    
    audio_stream = client.generate(
            text=input_request.text,
            voice="Brian",
            model=input_request.model_name,
            stream=True,
            optimize_streaming_latency=3,
            output_format="pcm_44100"
        )
    
  3. Override the API endpoint using @model-api decorator; since the NLP server exposes the /speech/text_to_speech API endpoint. Refer to the full code of the NLP server custom client.

    import io
    import os
    import functools
    from elevenlabs.client import ElevenLabs
    from elevenlabs import Voice
    from dataclasses import dataclass
    from fastapi import HTTPException
    from fastapi.responses import StreamingResponse
    
    from nlp_server.decorators import model_api
    
    
    def do_chunking(audio_stream, min_chunk_size=4410):
        buffer = b""
        for chunk in audio_stream:
            buffer += chunk
            if len(buffer) >= min_chunk_size:
                yield buffer
                buffer=b""
        if len(buffer) != 0:
            yield buffer
    
    @dataclass
    class TTSRequest:
        text: str
        voice_name: str
        model_name: str = ""
        model_version: str = ""
        language_code: str = "en-US"
        sample_rate_hz: int = 44100
    
    @model_api(
        endpoint="/speech/text_to_speech",
        model_name=["eleven_monolingual_v1", "eleven_multilingual_v1", "eleven_multilingual_v2", "eleven_turbo_v2", "eleven_turbo_v2_5"],
    )
    
    async def eleven_tts(input_request: TTSRequest):
        client = ElevenLabs(
            api_key=os.getenv("ELEVENLABS_API_KEY"),
        )
    
        audio_stream = client.generate(
            text=input_request.text,
            voice=input_request.voice_name,
            model=input_request.model_name,
            stream=True,
            optimize_streaming_latency=3,
            output_format="pcm_44100"
        )
    
        return StreamingResponse(do_chunking(audio_stream))
    
  4. For trying out the custom TTS, you can use any of the sample bots present in the ./bots directory in the Quick Start resource or custom bots created as part of the tutorials. Save the code in a Python file with the name elevenlabs_tts.py inside the ./bots directory.

    samples/
    └── stock_bot
        ├── bot_config.yaml
        ├── main.co
        └── elevenlabs_tts.py
    
  5. If you don’t have an existing model_config.yaml, then let’s create model_config.yaml.

    samples/
    └── stock_bot
        ├── bot_config.yaml
        ├── main.co
        ├── elevenlabs_tts.py
        └── model_config.yaml
    

    Add a custom TTS client and a Riva ASR model.

    model_servers:
        - name: riva
            speech_models:
            - nvidia/ace/rmir_asr_parakeet_1-1b_en_us_str_vad:2.17.0
            url: localhost:8001
        - name: custom
            nlp_models:
            - elevenlabs_tts.py
    
  6. For integrating the 3rd party TTS with the Chat Controller, we need to add some parameters for TTS in speech_config.yaml. If you haven’t already created speech_config.yaml, then let’s create it in the sample bot directory.

    samples/
    └── stock_bot
        ├── bot_config.yaml
        ├── main.co
        ├── elevenlabs_tts.py
        └── model_config.yaml
        └── speech_config.yaml
    
  7. Add the following parameters inside the speech_config.yaml file.

    riva_tts:
      RivaTTS:
        tts_mode: "http"
        voice_name: "Brian"
        server: "http://0.0.0.0:9003/speech/text_to_speech"
        language: "en-US"
        ipa_dict: ""
        sample_rate: 44100
        model_name: "eleven_monolingual_v1"
    
  8. Deploy the bot with the above TTS customization.

    Set the OPENAI key if it is not already set.

    export OPENAI_API_KEY=...
    
    export BOT_PATH="samples/stock_bot"
    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
    docker compose -f deploy/docker/docker-compose.yml up speech-event-bot --build