Customizing a Bot#
Selecting an LLM Model#
ACE Agent allows you to select LLM models to control the flow and generate responses. We will modify our Stock bot to use openai
. In the bot config file, you can update the engine of the model type main
to openai
and add the model of your choice. Additionally, refer to the Stock Market Bot, which uses OpenAI and NVIDIA AI endpoints respectively as the engine.
models: - type: main engine: openai model: gpt-3.5-turbo-instruct
Using an On-Premise LLM Model Deployed via NIM#
In this tutorial, we will showcase how you can utilize the LLM model deployed using NVIDIA NIM for Large Language Models with vLLM or TensorRT-LLM optimizations and update the sample Stock bot to use a locally deployed LLM model.
The stock sample bot uses the
mixtral-8x7b-instruct-v0.1
model deployed via the NVIDIA API Catalog as the main model. You can deploy the same model locally by following instructions from Mixtral 8x7b Instruct NIM using four A100 or H100 GPUs. If you are constrained by GPU, you can use a single 80 GB A100 or H100 to deploy smaller models such as Mistral 7b Instruct using NIM.Update the
stock_bot_config.yaml
file present in./samples/stock_bot
in the Quick Start resource directory to utilize the locally deployed Mistral 7b Instruct LLM model.models: - type: main engine: nvidia-ai-endpoints model: mistral-7b-v2-instruct parameters: stop: ["\n"] max_tokens: 100 base_url: "http://0.0.0.0:9999/v1" # Use this to use NIM model
Try the sample Stock bot by deploying it using the steps from the Quick Start Guide. As we are using a smaller Mistral 7b instruct model, you might observe lower accuracy compared to the stock sample bots shared as part of the release. You can either deploy Mixtral 8x7b Instruct NIM or try tuning the bot to improve the performance.
Note
If you are deploying LLM models with TensorRT LLM optimizations using Triton, you might see port conflicts during NIM deployment with the ACE Agent deployed using Triton or the Riva Skills server. Updated NIM deployment commands to not use
--host
network and expose the OpenAI port and other ports as needed.
Custom Actions#
ACE Agent bots which are based on Colang use actions to accomplish tasks such as Intent Generation, Bot Message generation, calling the Plugin Server, and so on. ACE Agent allows you to create your own custom actions and override some custom actions defined by ACE Agent. In this tutorial, let’s look at how we can achieve this with the Stock bot we created in the Colang tutorial.
Creating a New Custom Action#
Create a new file in the bot directory called
actions.py
. This is a special file that is initialized during bot startup. Any actions defined here are registered in Colang, and can be used in Colang flows.Create a simple action that will check if the question has any words from a block list. Update
actions.py
with the following.from nemoguardrails.actions.actions import action from typing import Dict, Any BLOCK_LIST = ["stupid", "moron"] @action(name="block_word_present") async def check_block_list(context: Dict[str, Any] = {}): question = context.get("last_user_message") if any(word in question for word in BLOCK_LIST): return True return False
Update the flow related to
ask stock question
to call the newly createdblock_word_present
custom action and make a decision based on the action’s response.define flow user ask stock question $should_block = execute block_word_present() if $should_block bot "Please do not use blocked words" else bot responds with answer
Asking a question like Is it stupid to invest in an IPO
will result in the fallback response Please do not use blocked words
, whereas questions without blocked words will be accepted.
Similarly, it is possible to create custom actions which can accept arguments from Colang flows.
Overriding a System Action#
ACE Agent has overridden three System Actions in Colang - generate_user_intent
(responsible for intent generation), generate_next_steps
(responsible for predicting next steps when a flow is not defined for the generated intent), and generate_bot_message
(responsible for formulating the bot message). These overridden actions can serve most common use cases, but it is possible to override these with your own implementations of these system actions.
To override generate_user_intent
, create your own generate_user_intent
method in actions.py
using the template mentioned below.
from nemoguardrails.actions.actions import ActionResult, action from nemoguardrails.utils import new_event_dict from typing import List, Dict, Any, Optional @action(is_system_action=True) async def generate_user_intent( events: List[dict], query: Optional[str] = "", model_name: Optional[str] = "", endpoint: Optional[str] = "", confidence_threshold: Optional[float] = 0.5, request_timeout: Optional[int] = 5, context: Optional[dict] = None, skip_llm: bool = False, ) -> Any: """Insert your custom logic for intent generation and context updates (if any)""" question = query if query else context.get("last_user_message", "") if "politics" in question: user_intent = "ask off topic" else: user_intent = "asks stock price" context_slots = {} # Mapping of variable names to variable values that you wish to create return ActionResult( events=[new_event_dict("UserIntent", intent=user_intent)], context_updates=context_slots, )
Add a config in your bot_config.yaml
to skip registering ACE Agent’s implementation of generate_user_intent
.
bot: enola configs: register_ace_agent_intent_generation: false
Now, any questions that contain the word politics in it will be classified as ask off topic, and all other queries will be classified as asks stock price.
In a similar way, you can:
Override
generate_next_steps
by defining it inactions.py
and setting theregister_ace_agent_next_step_generation
config asfalse
in configs.Override
generate_bot_message
by defining it inactions.py
and setting theregister_ace_agent_bot_message_generation
config asfalse
in configs.
For more information on the purpose of these actions, refer to the Colang Documentation.
Using a Custom NLP Model#
In this section, we will focus on how to deploy a custom NLP model. For integrations with a full dialog pipeline, refer to the Building a Bot section.
Deploying a Custom NLP Model with the NLP Server#
The NLP server allows you to easily deploy custom NLP models such as Hugging Face models and integrate it with the dialog pipeline. It mainly relies on the @model_api
and @pytriton
decorators functions.
Custom Model Integration using the @model_api
Decorator#
Let’s go through an example of how we can integrate a custom model inference for /nlp/model/generate_embedding
endpoint which returns embeddings for given queries.
Get familiar with the required request and response schema for
/nlp/model/generate_embedding
endpoint on the NLP server Swagger for any existing deployment.Create a Python function which can take input in similar format as request schema, do model inference, and return output in a similar format as response schema.
For this example, let’s simulate model inference by returning random embeddings for the given queries, your Python function should look similar to the following:
import numpy as np async def random_embeddings(input_request): """ Generate Random Embedding """ return {"queries": input_request.queries, "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
The NLP server exposes the
@model_api
decorator for mapping inference functions to particular API endpoints internally. The server uses a combination ofmodel_name
andmodel_version
as unique identifiers during API requests to execute required inference functions.Add the
@model_api
decorator to the random embedding generation inference function.import numpy as np from nlp_server.decorators import model_api @model_api(endpoint="/nlp/model/generate_embedding", model_name="random_embedding", model_version="1") async def random_embeddings(input_request): """ Generate Random Embedding """ return {"queries": input_request.queries, "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
Start the NLP server with the random embedding inference client integrated.
aceagent nlp-server deploy --custom_model_dir random_embedding.py
Optionally, you can specify the client module in the
model_config.yaml
file.model_servers: - name: custom nlp_models: - random_embedding.py # Absolute or relative path from model_config.yaml
To start the NLP server with the client, run:
aceagent nlp-server deploy --config model_config.yaml
Verify the changes by querying with
model_name
as"random_embedding"
on the NLP server Swagger for the/nlp/model/generate_embedding
endpoint or use the following CURL command:curl -X 'POST' \ 'http://0.0.0.0:9003/nlp/model/generate_embedding' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "queries": [ "Random query" ], "model_name": "random_embedding" }'
Custom Model Integration using the @pytriton
Decorator#
We can integrate any custom model client using the @model_api
decorator by following the steps from the previous section. If our custom model client is loading a model as part of @model_api
function or Python module, and the NLP server is running with multiple workers, then the model will be loaded on all the workers separately. Loading models multiple times will result in higher memory usage and degradation in performance.
For production use cases, where the aim is to get higher throughput, we recommend to offload GPU and heavy processing code to model servers such as the NVIDIA Triton Inference Server and the inference client should be lightweight implementation using async support. Model servers such as the NVIDIA Triton Inference Server are designed to handle GPU and CPU resources better and allow us to batch parallel requests. The @pytriton
decorator allows you to offload heavy compute to Triton and avoid multiple loading of the models.
To utilize the @pytriton
decorator to deploy the facebook/bart-large-mnli model from Hugging Face, perform the following steps.
Create an example inference client. This is needed for the facebook/bart-large-mnli model from Hugging Face.
from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] input_queries = ["I love cricket", "Lets watch movie today"] classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] print(result_labels)
Arrange your inference client in the following format. We will be using PyTriton for hosting the model on the NLP server. You might want to get familiar with PyTriton by following the Quick Start steps from the PyTriton GitHub repository.
from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor triton = Triton() # Model Initialization / Loading Code ... # Creating Inference function @batch def infer_fn(**inputs: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs ... return outputs # Connecting inference callable with Triton Inference Server triton.bind( model_name="<custom_model_name>", infer_func=infer_fn, inputs=[ ... ], outputs=[ ... ], config=ModelConfig(...) ) # Serving model triton.serve()
Convert our inference client in PyTriton compatible format.
import numpy as np from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor triton = Triton() # Model Initialization / Loading Code from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] # Creating Inference function @batch def infer_fn(queries: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist() classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] return {"labels": np.char.encode(result_labels, "utf-8")} # Connecting inference callable with Triton Inference Server triton.bind( model_name="facebook-bart-large-mnli", infer_func=infer_fn, inputs=[ Tensor(name="queries", dtype=bytes, shape=(-1,)), ], outputs=[ Tensor(name="labels", dtype=bytes, shape=(-1,)), ], config=ModelConfig(max_batch_size=4) ) # Serving model triton.serve()
Test your code by running the following commands:
Install PyTriton locally.
pip install -U "nvidia-pytriton<0.4.0"
Save the code in a Python file and run:
python custom_model.py
You should see the NVIDIA Triton Inference Server hosted at
localhost:8000
. You can interact with the model using the following code:import numpy as np from pytriton.client import ModelClient with ModelClient("localhost:8000", "facebook-bart-large-mnli") as client: result_dict = client.infer_batch(np.char.encode([["Lets watch movie today"]], "utf-8")) print(result_dict)
Integrate the above code into the NLP server for easier and automated deployment via the
@pytriton
decorator. The@pytriton
decorated function is executed only once during startup even for multiple workers, so we don’t load the model multiple times in GPU memory.import numpy as np from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor from nlp_server.decorators import pytriton # @pytriton decorator @pytriton() def custom_pytriton_model(triton: Triton): # Model Initialization / Loading Code from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] # Creating Inference function @batch def infer_fn(queries: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist() classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] return {"labels": np.char.encode(result_labels, "utf-8")} # Connecting inference callable with Triton Inference Server triton.bind( model_name="facebook-bart-large-mnli", infer_func=infer_fn, inputs=[ Tensor(name="queries", dtype=bytes, shape=(-1,)), ], outputs=[ Tensor(name="labels", dtype=bytes, shape=(-1,)), ], config=ModelConfig(max_batch_size=4) )
Create the
@model_api
decorated function to override the NLP server API endpoint of your choice. Using this model, we will override the/nlp/model/text_classification
endpoint.import numpy as np from pytriton.client import ModelClient from nlp_server.decorators import model_api @model_api(endpoint="/nlp/model/text_classification", model_name="facebook-bart-large-mnli", model_type="triton") def bart_mnli_model_api(input_request): # NLP Server embed model metadata in the function meta as model_info attribute, you can access url for Triton server using bart_mnli_model_api.model_info.url with ModelClient( url=f"grpc://{bart_mnli_model_api.model_info.url}", model_name=bart_mnli_model_api.model_info.model_name, model_version=bart_mnli_model_api.model_info.model_version, ) as client: result_dict = client.infer_batch(np.array([[np.char.encode(input_request.query, "utf-8")]])) return {"class_name": result_dict["labels"][0].decode("utf-8"), "score": 1}
Save both the
@pytriton
and@model_api
clients in a single Python file and start the NLP server.source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --custom_model_dir <custom_model.py>
Optionally, specify the client modules in the
model_config.yaml
file.model_servers: - name: custom nlp_models: - <custom_model.py> # Absolute or relative path from model_config.yaml
To start the NLP server with the client, run:
source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --config model_config.yaml
Verify the changes by querying with
model_name
as"facebook-bart-large-mnli"
on the NLP server Swagger for the/nlp/model/text_classification
endpoint or use the following CURL command:curl -X 'POST' \ 'http://localhost:9003/nlp/model/text_classification' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "query": "I love cricket", "model_name": "facebook-bart-large-mnli" }'
Customizing ASR Recognition with Word Boosting#
Word boosting allows you to bias the ASR to recognize particular words of interest or domain specific words at request time; by giving them a higher score when decoding the output of the acoustic model.
For more information about word boosting, refer to the Riva Word Boosting documentation.
For our sample bot, let’s add a
speech_config.yaml
file.my_bots/ └── stock_bot ├── bot_config.yaml ├── flows.co └── speech_config.yaml
Create a word boosting file
asr_words_to_boost.txt
containing the words to boost along with boosting score.my_bots/ └── stock_bot ├── asr_words_to_boost.txt ├── bot_config.yaml ├── flows.co └── speech_config.yaml
{ "comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.", "speech_context": [ { "boost": 40, "phrases": [ "Nvidia" ] } ] }
Add the path of this word boosting file in the ASR component of the
speech_config.yaml
file.riva_asr: RivaASR: word_boost_file_path: "/workspace/config/asr_words_to_boost.txt"
Deploy the bot with the above ASR customization.
Customizing TTS Pronunciation using IPA#
Providing an IPA pronunciation file enables you to tune the TTS model with custom pronunciation for some domain-specific words, or words that are not pronounced as expected. The model uses this pronunciation for the specified word, while synthesizing audio for the same.
To use IPA mapping in TTS, we need to create a dictionary file containing the word and its IPA Pronunciation.
Create an IPA dictionary file
ipa.dict
in the bot config.my_bots/ └── stock_bot ├── ipa.dict ├── asr_words_to_boost.txt ├── bot_config.yaml ├── flows.co └── speech_config.yaml
Add the following example IPA pronunciation, any custom words can be added here (word must be in UPPERCASE letters). For example:
GPU<SPACE><SPACE>'dʒi'pi'ju
Add the path of this IPA file in the TTS component of the
speech_config.yaml
file.riva_tts: RivaTTS: ipa_dict: "/workspace/config/ipa.dict"
Deploy the bot with the above TTS customization.
Using 3rd Party Text-to-Speech (TTS) Solutions#
For speech bots, you might want to customize the voice for speech response. You can train your own TTS model, clone the TTS voice, or use any 3rd party provider. In this example, we will showcase how you can integrate ElevenLabs text to speech APIs. By default, we use NVIDIA Riva TTS models.
Add the required dependencies in the NLP Server
dockerfile
present in the Quick Start resource atdeploy/docker/dockerfiles/nlp_server.Dockerfile
.############################## # Install custom dependencies ############################## RUN apt-get -y install ffmpeg RUN pip install pydub elevenlabs==0.2.24
Register with ElevenLabs - Generative AI Text to Speech & Voice Cloning and get an API key. Add the ElevenLabs API key in the
.env
file present in the Quick Start directory with the variable nameELEVENLABS_API_KEY
, so it can be passed to the NLP server container.from elevenlabs import generate audio_stream = generate( text=input_request.text, api_key=os.getenv("ELEVENLABS_API_KEY"), voice=input_request.voice_name, model=input_request.model_name, stream=True, stream_chunk_size=None, )
Convert the provided ElevenLabs MPS audio into raw PCM audio format in the Chat Controller using
pydub
. Ideally, we should convert chunk by chunk for supporting streaming, but for a simple example we will convert the full audio stream and return as a single raw PCM output.import io import functools from pydub import AudioSegment audio = AudioSegment.from_mp3(io.BytesIO(functools.reduce(lambda a, b: a + b, audio_stream))) raw_pcm_audio = audio.raw_data
Override the API endpoint using @model-api decorator; since the NLP server exposes the
/speech/text_to_speech
API endpoint. Refer to the full code of the NLP server custom client.import io import os import functools from pydub import AudioSegment from elevenlabs import generate from dataclasses import dataclass from fastapi import HTTPException from fastapi.responses import StreamingResponse from nlp_server.decorators import model_api @dataclass class TTSRequest: text: str voice_name: str model_name: str = "" model_version: str = "" language_code: str = "en-US" sample_rate_hz: int = 44100 def get_raw_pcm(audio_stream, sample_rate_hz): ## TODO Implement chunk wise pcm conversion for streaming TTS try: audio = AudioSegment.from_mp3(io.BytesIO(functools.reduce(lambda a, b: a + b, audio_stream))) except: raise ValueError("Unable to convert MP3 data to raw PCM") if audio.frame_rate != sample_rate_hz: raise HTTPException( status_code=422, detail=f"Expected sample rate hz {sample_rate_hz}, supported sample rate is {audio.frame_rate}", ) yield audio.raw_data @model_api( endpoint="/speech/text_to_speech", model_name=["eleven_monolingual_v1", "eleven_multilingual_v1", "eleven_multilingual_v2"], ) async def eleven_tts(input_request: TTSRequest): if not os.getenv("ELEVENLABS_API_KEY", None): raise HTTPException(status_code=500, detail="Elevenlabs API key not provided during NLP Server Startup") audio_stream = generate( text=input_request.text, api_key=os.getenv("ELEVENLABS_API_KEY"), voice=input_request.voice_name, model=input_request.model_name, stream=True, stream_chunk_size=None, ) return StreamingResponse(get_raw_pcm(audio_stream, input_request.sample_rate_hz))
For trying out the custom TTS, you can use any of the sample bots present in the
./bots
directory in Quick the Start resource or custom bots created as part of the tutorials. Save the code in a Python file with the nameelevenlabs_tts.py
inside the./bots
directory.my_bots/ └── stock_bot ├── bot_config.yaml ├── flows.co └── elevenlabs_tts.py
If you don’t have an existing
model_config.yaml
, then let’s createmodel_config.yaml
.my_bots/ └── stock_bot ├── bot_config.yaml ├── flows.co ├── elevenlabs_tts.py └── model_config.yaml
Add a custom TTS client and a Riva ASR model.
model_servers: - name: riva speech_models: - nvidia/ucs-ms/rmir_asr_parakeet_1-1b_en_us_str_vad:2.15.0 url: localhost:8001 - name: custom nlp_models: - elevenlabs_tts.py
For integrating the 3rd party TTS with the Chat Controller, we need to add some parameters for TTS in
speech_config.yaml
. If you haven’t already createdspeech_config.yaml
, then let’s create it in the sample bot directory.my_bots/ └── stock_bot ├── bot_config.yaml ├── flows.co └── speech_config.yaml
Add the following parameters inside the
speech_config.yaml
file.riva_tts: RivaTTS: tts_mode: "http" voice_name: "Bella" server: "http://0.0.0.0:9003/speech/text_to_speech" language: "en-US" ipa_dict: "" sample_rate: 44100 model_name: "eleven_monolingual_v1"
Deploy the bot with the above TTS customization.
Set the NVIDIA API key if it is not already set.
export NVIDIA_API_KEY=... export BOT_PATH="my_bots/stock_bot" source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml up model-utils-speech docker compose -f deploy/docker/docker-compose.yml up speech-bot