Customizing a Bot

Selecting an LLM Model

ACE Agent allows you to select LLM models to control the flow and generate responses. We will modify our Stock bot to use openai. In the bot config file, you can update the engine of the model type main to openai and add the model of your choice. Additionally, refer to the Stock Market Bot, which uses OpenAI and NVIDIA AI endpoints respectively as the engine.

models:
- type: main
    engine: openai
    model: gpt-3.5-turbo-instruct

Using an On-Premise LLM Model Deployed via NIM

In this tutorial, we will showcase how you can utilize the LLM model deployed using NVIDIA NIM for Large Language Models with vLLM or TensorRT-LLM optimizations and update the sample Stock bot to use a locally deployed LLM model.

  1. The stock sample bot uses the mixtral-8x7b-instruct-v0.1 model deployed via the NVIDIA API Catalog as the main model. You can deploy the same model locally by following instructions from Mixtral 8x7b Instruct NIM using four A100 or H100 GPUs. If you are constrained by GPU, you can use a single 80 GB A100 or H100 to deploy smaller models such as Mistral 7b Instruct using NIM.

  2. Update the stock_bot_config.yaml file present in ./samples/stock_bot in the Quick Start resource directory to utilize the locally deployed Mistral 7b Instruct LLM model.

    models:
      - type: main
        engine: nvidia-ai-endpoints
        model: mistral-7b-v2-instruct
        parameters:
        stop: ["\n"]
        max_tokens: 100
        base_url: "http://0.0.0.0:9999/v1"  # Use this to use NIM model
    
  3. Try the sample Stock bot by deploying it using the steps from the Quick Start Guide. As we are using a smaller Mistral 7b instruct model, you might observe lower accuracy compared to the stock sample bots shared as part of the release. You can either deploy Mixtral 8x7b Instruct NIM or try tuning the bot to improve the performance.

    Note

    If you are deploying LLM models with TensorRT LLM optimizations using Triton, you might see port conflicts during NIM deployment with the ACE Agent deployed using Triton or the Riva Skills server. Updated NIM deployment commands to not use --host network and expose the OpenAI port and other ports as needed.

Custom Actions

ACE Agent bots which are based on Colang use actions to accomplish tasks such as Intent Generation, Bot Message generation, calling the Plugin Server, and so on. ACE Agent allows you to create your own custom actions and override some custom actions defined by ACE Agent. In this tutorial, let’s look at how we can achieve this with the Stock bot we created in the Colang tutorial.

Creating a New Custom Action

  1. Create a new file in the bot directory called actions.py. This is a special file that is initialized during bot startup. Any actions defined here are registered in Colang, and can be used in Colang flows.

  2. Create a simple action that will check if the question has any words from a block list. Update actions.py with the following.

    from nemoguardrails.actions.actions import action
    from typing import Dict, Any
    
    BLOCK_LIST = ["stupid", "moron"]
    
    @action(name="block_word_present")
    async def check_block_list(context: Dict[str, Any] = {}):
        question = context.get("last_user_message")
        if any(word in question for word in BLOCK_LIST):
            return True
        return False
    
  3. Update the flow related to ask stock question to call the newly created block_word_present custom action and make a decision based on the action’s response.

    define flow
      user ask stock question
      $should_block = execute block_word_present()
      if $should_block
        bot "Please do not use blocked words"
      else
        bot responds with answer
    

Asking a question like Is it stupid to invest in an IPO will result in the fallback response Please do not use blocked words, whereas questions without blocked words will be accepted.

Similarly, it is possible to create custom actions which can accept arguments from Colang flows.

Overriding a System Action

ACE Agent has overridden three System Actions in Colang - generate_user_intent (responsible for intent generation), generate_next_steps (responsible for predicting next steps when a flow is not defined for the generated intent), and generate_bot_message (responsible for formulating the bot message). These overridden actions can serve most common use cases, but it is possible to override these with your own implementations of these system actions.

To override generate_user_intent, create your own generate_user_intent method in actions.py using the template mentioned below.

from nemoguardrails.actions.actions import ActionResult, action
from nemoguardrails.utils import new_event_dict
from typing import List, Dict, Any, Optional

@action(is_system_action=True)
async def generate_user_intent(
    events: List[dict],
    query: Optional[str] = "",
    model_name: Optional[str] = "",
    endpoint: Optional[str] = "",
    confidence_threshold: Optional[float] = 0.5,
    request_timeout: Optional[int] = 5,
    context: Optional[dict] = None,
    skip_llm: bool = False,
) -> Any:

    """Insert your custom logic for intent generation and context updates (if any)"""
    question = query if query else context.get("last_user_message", "")
    if "politics" in question:
    user_intent = "ask off topic"
    else:
    user_intent = "asks stock price"
    context_slots = {}  # Mapping of variable names to variable values that you wish to create

    return ActionResult(
        events=[new_event_dict("UserIntent", intent=user_intent)],
        context_updates=context_slots,
    )

Add a config in your bot_config.yaml to skip registering ACE Agent’s implementation of generate_user_intent.

bot: enola

configs:
  register_ace_agent_intent_generation: false

Now, any questions that contain the word politics in it will be classified as ask off topic, and all other queries will be classified as asks stock price.

In a similar way, you can:

  • Override generate_next_steps by defining it in actions.py and setting the register_ace_agent_next_step_generation config as false in configs.

  • Override generate_bot_message by defining it in actions.py and setting the register_ace_agent_bot_message_generation config as false in configs.

For more information on the purpose of these actions, refer to the Colang Documentation.

Using a Custom NLP Model

In this section, we will focus on how to deploy a custom NLP model. For integrations with a full dialog pipeline, refer to the Building a Bot section.

Deploying a Custom NLP Model with the NLP Server

The NLP server allows you to easily deploy custom NLP models such as Hugging Face models and integrate it with the dialog pipeline. It mainly relies on the @model_api and @pytriton decorators functions.

Custom Model Integration using the @model_api Decorator

Let’s go through an example of how we can integrate a custom model inference for /nlp/model/generate_embedding endpoint which returns embeddings for given queries.

  1. Get familiar with the required request and response schema for /nlp/model/generate_embedding endpoint on the NLP server Swagger for any existing deployment.

  2. Create a Python function which can take input in similar format as request schema, do model inference, and return output in a similar format as response schema.

    For this example, let’s simulate model inference by returning random embeddings for the given queries, your Python function should look similar to the following:

    import numpy as np
    
    async def random_embeddings(input_request):
        """
        Generate Random Embedding
        """
        return {"queries": input_request.queries,
                "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
    

    The NLP server exposes the @model_api decorator for mapping inference functions to particular API endpoints internally. The server uses a combination of model_name and model_version as unique identifiers during API requests to execute required inference functions.

  3. Add the @model_api decorator to the random embedding generation inference function.

    import numpy as np
    from nlp_server.decorators import model_api
    
    @model_api(endpoint="/nlp/model/generate_embedding", model_name="random_embedding", model_version="1")
    async def random_embeddings(input_request):
        """
        Generate Random Embedding
        """
        return {"queries": input_request.queries,
                "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
    
  4. Start the NLP server with the random embedding inference client integrated.

    aceagent nlp-server deploy --custom_model_dir random_embedding.py
    

    Optionally, you can specify the client module in the model_config.yaml file.

    model_servers:
    - name: custom
        nlp_models:
            - random_embedding.py # Absolute or relative path from model_config.yaml
    

    To start the NLP server with the client, run:

    aceagent nlp-server deploy --config model_config.yaml
    
  5. Verify the changes by querying with model_name as "random_embedding" on the NLP server Swagger for the /nlp/model/generate_embedding endpoint or use the following CURL command:

    curl -X 'POST' \
        'http://0.0.0.0:9003/nlp/model/generate_embedding' \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        -d '{
        "queries": [
            "Random query"
        ],
        "model_name": "random_embedding"
    }'
    

Custom Model Integration using the @pytriton Decorator

We can integrate any custom model client using the @model_api decorator by following the steps from the previous section. If our custom model client is loading a model as part of @model_api function or Python module, and the NLP server is running with multiple workers, then the model will be loaded on all the workers separately. Loading models multiple times will result in higher memory usage and degradation in performance.

For production use cases, where the aim is to get higher throughput, we recommend to offload GPU and heavy processing code to model servers such as the NVIDIA Triton Inference Server and the inference client should be lightweight implementation using async support. Model servers such as the NVIDIA Triton Inference Server are designed to handle GPU and CPU resources better and allow us to batch parallel requests. The @pytriton decorator allows you to offload heavy compute to Triton and avoid multiple loading of the models.

To utilize the @pytriton decorator to deploy the facebook/bart-large-mnli model from Hugging Face, perform the following steps.

  1. Create an example inference client. This is needed for the facebook/bart-large-mnli model from Hugging Face.

    from transformers import pipeline
    CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
    LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
    input_queries = ["I love cricket", "Lets watch movie today"]
    classification_result = CLASSIFIER(input_queries, LABELS)
    result_labels = [res["labels"][0] for res in classification_result]
    print(result_labels)
    
  2. Arrange your inference client in the following format. We will be using PyTriton for hosting the model on the NLP server. You might want to get familiar with PyTriton by following the Quick Start steps from the PyTriton GitHub repository.

    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    
    triton = Triton()
    
    # Model Initialization / Loading Code
    ...
    
    # Creating Inference function
    @batch
    def infer_fn(**inputs: np.ndarray):
        # Inference for batch of inputs using already loaded model and returning outputs
        ...
        return outputs
    
    # Connecting inference callable with Triton Inference Server
    triton.bind(
        model_name="<custom_model_name>",
        infer_func=infer_fn,
        inputs=[
            ...
        ],
        outputs=[
            ...
        ],
        config=ModelConfig(...)
    )
    
    # Serving model
    triton.serve()
    
  3. Convert our inference client in PyTriton compatible format.

    import numpy as np
    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    
    triton = Triton()
    
    # Model Initialization / Loading Code
    from transformers import pipeline
    CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
    LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
    # Creating Inference function
    @batch
    def infer_fn(queries: np.ndarray):
        # Inference for batch of inputs using already loaded model and returning outputs
        input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
        classification_result = CLASSIFIER(input_queries, LABELS)
        result_labels = [res["labels"][0] for res in classification_result]
        return {"labels": np.char.encode(result_labels, "utf-8")}
    
    # Connecting inference callable with Triton Inference Server
    triton.bind(
        model_name="facebook-bart-large-mnli",
        infer_func=infer_fn,
        inputs=[
            Tensor(name="queries", dtype=bytes, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="labels", dtype=bytes, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=4)
    )
    
    # Serving model
    triton.serve()
    
  4. Test your code by running the following commands:

    1. Install PyTriton locally.

      pip install -U "nvidia-pytriton<0.4.0"
      
    2. Save the code in a Python file and run:

      python custom_model.py
      

      You should see the NVIDIA Triton Inference Server hosted at localhost:8000. You can interact with the model using the following code:

      import numpy as np
      from pytriton.client import ModelClient
      
      with ModelClient("localhost:8000", "facebook-bart-large-mnli") as client:
          result_dict = client.infer_batch(np.char.encode([["Lets watch movie today"]], "utf-8"))
      
      print(result_dict)
      
  5. Integrate the above code into the NLP server for easier and automated deployment via the @pytriton decorator. The @pytriton decorated function is executed only once during startup even for multiple workers, so we don’t load the model multiple times in GPU memory.

    import numpy as np
    from pytriton.triton import Triton
    from pytriton.decorators import batch
    from pytriton.model_config import ModelConfig, Tensor
    from nlp_server.decorators import pytriton
    
    # @pytriton decorator
    @pytriton()
    def custom_pytriton_model(triton: Triton):
        # Model Initialization / Loading Code
        from transformers import pipeline
        CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
        LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]
    
        # Creating Inference function
        @batch
        def infer_fn(queries: np.ndarray):
            # Inference for batch of inputs using already loaded model and returning outputs
            input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
            classification_result = CLASSIFIER(input_queries, LABELS)
            result_labels = [res["labels"][0] for res in classification_result]
            return {"labels": np.char.encode(result_labels, "utf-8")}
    
        # Connecting inference callable with Triton Inference Server
        triton.bind(
            model_name="facebook-bart-large-mnli",
            infer_func=infer_fn,
            inputs=[
                Tensor(name="queries", dtype=bytes, shape=(-1,)),
            ],
            outputs=[
                Tensor(name="labels", dtype=bytes, shape=(-1,)),
            ],
            config=ModelConfig(max_batch_size=4)
        )
    
  6. Create the @model_api decorated function to override the NLP server API endpoint of your choice. Using this model, we will override the /nlp/model/text_classification endpoint.

    import numpy as np
    from pytriton.client import ModelClient
    from nlp_server.decorators import model_api
    
    @model_api(endpoint="/nlp/model/text_classification", model_name="facebook-bart-large-mnli", model_type="triton")
    def bart_mnli_model_api(input_request):
        # NLP Server embed model metadata in the function meta as model_info attribute, you can access url for Triton server using bart_mnli_model_api.model_info.url
        with ModelClient(
            url=f"grpc://{bart_mnli_model_api.model_info.url}",
            model_name=bart_mnli_model_api.model_info.model_name,
            model_version=bart_mnli_model_api.model_info.model_version,
        ) as client:
            result_dict = client.infer_batch(np.array([[np.char.encode(input_request.query, "utf-8")]]))
        return {"class_name": result_dict["labels"][0].decode("utf-8"), "score": 1}
    
  7. Save both the @pytriton and @model_api clients in a single Python file and start the NLP server.

    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --custom_model_dir <custom_model.py>
    

    Optionally, specify the client modules in the model_config.yaml file.

    model_servers:
      - name: custom
        nlp_models:
          - <custom_model.py> # Absolute or relative path from model_config.yaml
    

    To start the NLP server with the client, run:

    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --config model_config.yaml
    
  8. Verify the changes by querying with model_name as "facebook-bart-large-mnli" on the NLP server Swagger for the /nlp/model/text_classification endpoint or use the following CURL command:

    curl -X 'POST' \
      'http://localhost:9003/nlp/model/text_classification' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "query": "I love cricket",
      "model_name": "facebook-bart-large-mnli"
    }'
    

Customizing ASR Recognition with Word Boosting

Word boosting allows you to bias the ASR to recognize particular words of interest or domain specific words at request time; by giving them a higher score when decoding the output of the acoustic model.

For more information about word boosting, refer to the Riva Word Boosting documentation.

  1. For our sample bot, let’s add a speech_config.yaml file.

    my_bots/
    └── stock_bot
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
  2. Create a word boosting file asr_words_to_boost.txt containing the words to boost along with boosting score.

    my_bots/
    └── stock_bot
        ├── asr_words_to_boost.txt
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
    {
     "comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.",
     "speech_context": [
      {
      "boost": 40,
      "phrases": [
        "Nvidia"
      ]
      }
     ]
    }
    
  3. Add the path of this word boosting file in the ASR component of the speech_config.yaml file.

    riva_asr:
      RivaASR:
        word_boost_file_path: "/workspace/config/asr_words_to_boost.txt"
    
  4. Deploy the bot with the above ASR customization.

Customizing TTS Pronunciation using IPA

Providing an IPA pronunciation file enables you to tune the TTS model with custom pronunciation for some domain-specific words, or words that are not pronounced as expected. The model uses this pronunciation for the specified word, while synthesizing audio for the same.

To use IPA mapping in TTS, we need to create a dictionary file containing the word and its IPA Pronunciation.

  1. Create an IPA dictionary file ipa.dict in the bot config.

    my_bots/
    └── stock_bot
        ├── ipa.dict
        ├── asr_words_to_boost.txt
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
  2. Add the following example IPA pronunciation, any custom words can be added here (word must be in UPPERCASE letters). For example:

    GPU<SPACE><SPACE>'dʒi'pi'ju
    
  3. Add the path of this IPA file in the TTS component of the speech_config.yaml file.

    riva_tts:
      RivaTTS:
        ipa_dict: "/workspace/config/ipa.dict"
    
  4. Deploy the bot with the above TTS customization.

Using 3rd Party Text-to-Speech (TTS) Solutions

For speech bots, you might want to customize the voice for speech response. You can train your own TTS model, clone the TTS voice, or use any 3rd party provider. In this example, we will showcase how you can integrate ElevenLabs text to speech APIs. By default, we use NVIDIA Riva TTS models.

  1. Add the required dependencies in the NLP Server dockerfile present in the Quick Start resource at deploy/docker/dockerfiles/nlp_server.Dockerfile.

    ##############################
    # Install custom dependencies
    ##############################
    
    RUN apt-get -y install ffmpeg
    RUN pip install pydub elevenlabs==0.2.24
    
  2. Register with ElevenLabs - Generative AI Text to Speech & Voice Cloning and get an API key. Add the ElevenLabs API key in the .env file present in the Quick Start directory with the variable name ELEVENLABS_API_KEY, so it can be passed to the NLP server container.

    from elevenlabs import generate
    audio_stream = generate(
            text=input_request.text,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice=input_request.voice_name,
            model=input_request.model_name,
            stream=True,
            stream_chunk_size=None,
        )
    
  3. Convert the provided ElevenLabs MPS audio into raw PCM audio format in the Chat Controller using pydub. Ideally, we should convert chunk by chunk for supporting streaming, but for a simple example we will convert the full audio stream and return as a single raw PCM output.

    import io
    import functools
    from pydub import AudioSegment
    
    audio = AudioSegment.from_mp3(io.BytesIO(functools.reduce(lambda a, b: a + b, audio_stream)))
    raw_pcm_audio = audio.raw_data
    
  4. Override the API endpoint using @model-api decorator; since the NLP server exposes the /speech/text_to_speech API endpoint. Refer to the full code of the NLP server custom client.

    import io
    import os
    import functools
    from pydub import AudioSegment
    from elevenlabs import generate
    from dataclasses import dataclass
    from fastapi import HTTPException
    from fastapi.responses import StreamingResponse
    
    from nlp_server.decorators import model_api
    
    @dataclass
    class TTSRequest:
        text: str
        voice_name: str
        model_name: str = ""
        model_version: str = ""
        language_code: str = "en-US"
        sample_rate_hz: int = 44100
    
    def get_raw_pcm(audio_stream, sample_rate_hz):
        ## TODO Implement chunk wise pcm conversion for streaming TTS
        try:
            audio = AudioSegment.from_mp3(io.BytesIO(functools.reduce(lambda a, b: a + b, audio_stream)))
        except:
            raise ValueError("Unable to convert MP3 data to raw PCM")
        if audio.frame_rate != sample_rate_hz:
            raise HTTPException(
                status_code=422,
                detail=f"Expected sample rate hz {sample_rate_hz}, supported sample rate is {audio.frame_rate}",
            )
        yield audio.raw_data
    
    @model_api(
        endpoint="/speech/text_to_speech",
        model_name=["eleven_monolingual_v1", "eleven_multilingual_v1", "eleven_multilingual_v2"],
    )
    async def eleven_tts(input_request: TTSRequest):
    
        if not os.getenv("ELEVENLABS_API_KEY", None):
            raise HTTPException(status_code=500, detail="Elevenlabs API key not provided during NLP Server Startup")
    
        audio_stream = generate(
            text=input_request.text,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice=input_request.voice_name,
            model=input_request.model_name,
            stream=True,
            stream_chunk_size=None,
        )
        return StreamingResponse(get_raw_pcm(audio_stream, input_request.sample_rate_hz))
    
  5. For trying out the custom TTS, you can use any of the sample bots present in the ./bots directory in Quick the Start resource or custom bots created as part of the tutorials. Save the code in a Python file with the name elevenlabs_tts.py inside the ./bots directory.

    my_bots/
    └── stock_bot
        ├── bot_config.yaml
        ├── flows.co
        └── elevenlabs_tts.py
    
  6. If you don’t have an existing model_config.yaml, then let’s create model_config.yaml.

    my_bots/
    └── stock_bot
        ├── bot_config.yaml
        ├── flows.co
        ├── elevenlabs_tts.py
        └── model_config.yaml
    

    Add a custom TTS client and a Riva ASR model.

    model_servers:
        - name: riva
            speech_models:
            - nvidia/ucs-ms/rmir_asr_parakeet_1-1b_en_us_str_vad:2.15.0
            url: localhost:8001
        - name: custom
            nlp_models:
            - elevenlabs_tts.py
    
  7. For integrating the 3rd party TTS with the Chat Controller, we need to add some parameters for TTS in speech_config.yaml. If you haven’t already created speech_config.yaml, then let’s create it in the sample bot directory.

    my_bots/
    └── stock_bot
        ├── bot_config.yaml
        ├── flows.co
        └── speech_config.yaml
    
  8. Add the following parameters inside the speech_config.yaml file.

    riva_tts:
      RivaTTS:
        tts_mode: "http"
        voice_name: "Bella"
        server: "http://0.0.0.0:9003/speech/text_to_speech"
        language: "en-US"
        ipa_dict: ""
        sample_rate: 44100
        model_name: "eleven_monolingual_v1"
    
  9. Deploy the bot with the above TTS customization.

    Set the NVIDIA API key if it is not already set.

    export NVIDIA_API_KEY=...
    
    export BOT_PATH="my_bots/stock_bot"
    source deploy/docker/docker_init.sh
    docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
    docker compose -f deploy/docker/docker-compose.yml up speech-bot