Customizing a Bot#
Selecting an LLM Model#
ACE Agent allows you to select LLM models to control the flow and generate responses. You can utilize models from OpenAI, NIM hosted, or local NIM models in the bot configs. Check the LLM Model Configurationse sections for a full supported model list.
In the bot config file, you can update the engine of the model type main
to your model provider and add the model of your choice.
models: - type: main engine: openai model: gpt-4-turbo
The following models have been tested with the Colang 2.0-beta version.
OpenAI models:
gpt-3.5-turbo-instruct
gGpt-3.5-turbo
gpt-4-turbo
gpt-4o
gpt-4o-mini
NIM models:
meta/llama3-8b-instruct
meta/llama3-70b-instruct
meta/llama-3.1-8b-instruct
meta/llama-3.1-70b-instruct
Using an On-Premise LLM Model Deployed via NIM#
In this tutorial, we will showcase how you can utilize the LLM model deployed using NVIDIA NIM for Large Language Models with vLLM or TensorRT-LLM optimizations and update the sample Stock bot to use a locally deployed LLM model.
You can deploy the
meta/llama3-8b-instruct
model locally by following instructions from Llama3 8B Instruct NIM using an A100 or H100 GPU.Update the
stock_bot_config.yaml
file present in./samples/stock_bot
in the Quick Start resource directory to utilize the locally deployedmeta/llama3-8b-instruct
LLM model.models: - type: main engine: nim model: meta/llama3-8b-instruct parameters: stop: ["\n"] max_tokens: 100 base_url: "http://0.0.0.0:8000/v1" # Use this to use NIM model
Try the sample Stock bot by deploying it using the steps from the Quick Start Guide. You can explore other LLM NIMs for deployment at build.nvidia.com.
Note
If you are deploying LLM models with TensorRT LLM optimizations using Triton, you might see port conflicts during NIM deployment with the ACE Agent deployed using Triton or the Riva Skills server. Updated NIM deployment commands to not use
--host
network and expose the OpenAI port and other ports as needed.
Creating a New Custom Action#
ACE Agent bots which are based on Colang use actions to accomplish tasks such as Intent Generation, Bot Message generation, calling the Plugin Server, and so on. ACE Agent allows you to create your own custom actions and override some custom actions defined by ACE Agent.
Create a new file in the bot directory called
actions.py
. This is a special file that is initialized during bot startup. Any actions defined here are registered in Colang, and can be used in Colang flows.Create a simple action that will check if the question has any words from a block list. Update
actions.py
with the following.from nemoguardrails.actions.actions import action from typing import Dict, Any BLOCK_LIST = ["stupid", "moron"] @action(name="isBlockWordPresentAction") async def check_block_list(context: Dict[str, Any] = {}): question = context.get("last_user_transcript") if any(word in question for word in BLOCK_LIST): return True return False
Update the flow related to
user queried about stocks
to call the newly createdblock_word_present
custom action in the sample stock bot and make a decision based on the action’s response.flow stock faq global $last_user_transcript user queried about stocks $should_block = await isBlockWordPresentAction() if $should_block bot say "Please do not use blocked words" else $retrieval_results = await RetrieveRelevantChunksAction() $response = ..."{$last_user_transcript}. You can take context from following section: {$retrieval_results}. Enclose the response in quotes." bot say "{$response}"
Asking a question like Is it stupid to invest in an IPO?
will result in the fallback response Please do not use blocked words
, whereas questions without blocked words will be accepted.
Similarly, it is possible to create custom actions which can accept arguments from Colang flows.
Using a Custom NLP Model#
In this section, we will focus on how to deploy a custom NLP model.
Deploying a Custom NLP Model with the NLP Server#
The NLP server allows you to easily deploy custom NLP models such as Hugging Face models and integrate it with the dialog pipeline. It mainly relies on the @model_api
and @pytriton
decorators functions.
Custom Model Integration using the @model_api
Decorator#
Let’s go through an example of how we can integrate a custom model inference for /nlp/model/generate_embedding
endpoint which returns embeddings for given queries.
Get familiar with the required request and response schema for
/nlp/model/generate_embedding
endpoint on the NLP server Swagger for any existing deployment.Create a Python function which can take input in similar format as request schema, do model inference, and return output in a similar format as response schema.
For this example, let’s simulate model inference by returning random embeddings for the given queries, your Python function should look similar to the following:
import numpy as np async def random_embeddings(input_request): """ Generate Random Embedding """ return {"queries": input_request.queries, "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
The NLP server exposes the
@model_api
decorator for mapping inference functions to particular API endpoints internally. The server uses a combination ofmodel_name
andmodel_version
as unique identifiers during API requests to execute required inference functions.Add the
@model_api
decorator to the random embedding generation inference function.import numpy as np from nlp_server.decorators import model_api @model_api(endpoint="/nlp/model/generate_embedding", model_name="random_embedding", model_version="1") async def random_embeddings(input_request): """ Generate Random Embedding """ return {"queries": input_request.queries, "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
Start the NLP server with the random embedding inference client integrated.
aceagent nlp-server deploy --custom_model_dir random_embedding.py
Optionally, you can specify the client module in the
model_config.yaml
file.model_servers: - name: custom nlp_models: - random_embedding.py # Absolute or relative path from model_config.yaml
To start the NLP server with the client, run:
aceagent nlp-server deploy --config model_config.yaml
Verify the changes by querying with
model_name
as"random_embedding"
on the NLP server Swagger for the/nlp/model/generate_embedding
endpoint or use the following CURL command:curl -X 'POST' \ 'http://0.0.0.0:9003/nlp/model/generate_embedding' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "queries": [ "Random query" ], "model_name": "random_embedding" }'
Custom Model Integration using the @pytriton
Decorator#
We can integrate any custom model client using the @model_api
decorator by following the steps from the previous section. If our custom model client is loading a model as part of @model_api
function or Python module, and the NLP server is running with multiple workers, then the model will be loaded on all the workers separately. Loading models multiple times will result in higher memory usage and degradation in performance.
For production use cases, where the aim is to get higher throughput, we recommend to offload GPU and heavy processing code to model servers such as the NVIDIA Triton Inference Server and the inference client should be lightweight implementation using async support. Model servers such as the NVIDIA Triton Inference Server are designed to handle GPU and CPU resources better and allow us to batch parallel requests. The @pytriton
decorator allows you to offload heavy compute to Triton and avoid multiple loading of the models.
To utilize the @pytriton
decorator to deploy the facebook/bart-large-mnli model from Hugging Face, perform the following steps.
Create an example inference client. This is needed for the facebook/bart-large-mnli model from Hugging Face.
from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] input_queries = ["I love cricket", "Lets watch movie today"] classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] print(result_labels)
Arrange your inference client in the following format. We will be using PyTriton for hosting the model on the NLP server. You might want to get familiar with PyTriton by following the Quick Start steps from the PyTriton GitHub repository.
from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor triton = Triton() # Model Initialization / Loading Code ... # Creating Inference function @batch def infer_fn(**inputs: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs ... return outputs # Connecting inference callable with Triton Inference Server triton.bind( model_name="<custom_model_name>", infer_func=infer_fn, inputs=[ ... ], outputs=[ ... ], config=ModelConfig(...) ) # Serving model triton.serve()
Convert our inference client in PyTriton compatible format.
import numpy as np from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor triton = Triton() # Model Initialization / Loading Code from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] # Creating Inference function @batch def infer_fn(queries: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist() classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] return {"labels": np.char.encode(result_labels, "utf-8")} # Connecting inference callable with Triton Inference Server triton.bind( model_name="facebook-bart-large-mnli", infer_func=infer_fn, inputs=[ Tensor(name="queries", dtype=bytes, shape=(-1,)), ], outputs=[ Tensor(name="labels", dtype=bytes, shape=(-1,)), ], config=ModelConfig(max_batch_size=4) ) # Serving model triton.serve()
Test your code by running the following commands:
Install PyTriton locally.
pip install -U "nvidia-pytriton<0.4.0"
Save the code in a Python file and run:
python custom_model.py
You should see the NVIDIA Triton Inference Server hosted at
localhost:8000
. You can interact with the model using the following code:import numpy as np from pytriton.client import ModelClient with ModelClient("localhost:8000", "facebook-bart-large-mnli") as client: result_dict = client.infer_batch(np.char.encode([["Lets watch movie today"]], "utf-8")) print(result_dict)
Integrate the above code into the NLP server for easier and automated deployment via the
@pytriton
decorator. The@pytriton
decorated function is executed only once during startup even for multiple workers, so we don’t load the model multiple times in GPU memory.import numpy as np from pytriton.triton import Triton from pytriton.decorators import batch from pytriton.model_config import ModelConfig, Tensor from nlp_server.decorators import pytriton # @pytriton decorator @pytriton() def custom_pytriton_model(triton: Triton): # Model Initialization / Loading Code from transformers import pipeline CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0) LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"] # Creating Inference function @batch def infer_fn(queries: np.ndarray): # Inference for batch of inputs using already loaded model and returning outputs input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist() classification_result = CLASSIFIER(input_queries, LABELS) result_labels = [res["labels"][0] for res in classification_result] return {"labels": np.char.encode(result_labels, "utf-8")} # Connecting inference callable with Triton Inference Server triton.bind( model_name="facebook-bart-large-mnli", infer_func=infer_fn, inputs=[ Tensor(name="queries", dtype=bytes, shape=(-1,)), ], outputs=[ Tensor(name="labels", dtype=bytes, shape=(-1,)), ], config=ModelConfig(max_batch_size=4) )
Create the
@model_api
decorated function to override the NLP server API endpoint of your choice. Using this model, we will override the/nlp/model/text_classification
endpoint.import numpy as np from pytriton.client import ModelClient from nlp_server.decorators import model_api @model_api(endpoint="/nlp/model/text_classification", model_name="facebook-bart-large-mnli", model_type="triton") def bart_mnli_model_api(input_request): # NLP Server embed model metadata in the function meta as model_info attribute, you can access url for Triton server using bart_mnli_model_api.model_info.url with ModelClient( url=f"grpc://{bart_mnli_model_api.model_info.url}", model_name=bart_mnli_model_api.model_info.model_name, model_version=bart_mnli_model_api.model_info.model_version, ) as client: result_dict = client.infer_batch(np.array([[np.char.encode(input_request.query, "utf-8")]])) return {"class_name": result_dict["labels"][0].decode("utf-8"), "score": 1}
Save both the
@pytriton
and@model_api
clients in a single Python file and start the NLP server.source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --custom_model_dir <custom_model.py>
Optionally, specify the client modules in the
model_config.yaml
file.model_servers: - name: custom nlp_models: - <custom_model.py> # Absolute or relative path from model_config.yaml
To start the NLP server with the client, run:
source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --config model_config.yaml
Verify the changes by querying with
model_name
as"facebook-bart-large-mnli"
on the NLP server Swagger for the/nlp/model/text_classification
endpoint or use the following CURL command:curl -X 'POST' \ 'http://localhost:9003/nlp/model/text_classification' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "query": "I love cricket", "model_name": "facebook-bart-large-mnli" }'
Customizing ASR Recognition with Word Boosting#
Word boosting allows you to bias the ASR to recognize particular words of interest or domain specific words at request time; by giving them a higher score when decoding the output of the acoustic model.
For more information about word boosting, refer to the Riva Word Boosting documentation.
For our sample bot, let’s add a
speech_config.yaml
file.samples/ └── stock_bot ├── bot_config.yaml ├── flows.co └── speech_config.yaml
Create a word boosting file
asr_words_to_boost.txt
containing the words to boost along with boosting score.samples/ └── stock_bot ├── asr_words_to_boost.txt ├── bot_config.yaml ├── flows.co └── speech_config.yaml
{ "comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.", "speech_context": [ { "boost": 40, "phrases": [ "Nvidia" ] } ] }
Add the path of this word boosting file in the ASR component of the
speech_config.yaml
file.riva_asr: RivaASR: word_boost_file_path: "/workspace/config/asr_words_to_boost.txt"
Deploy the bot with the above ASR customization.
Customizing ASR Recognition for Long Pause Handling#
Pauses in human speech can serve several important functions such as allowing the speaker to gather their thoughts, choose words carefully, and give the audience a chance to process what has been said. In conversational speech, research suggests that most pauses between words are short (0.20 seconds), medium (0.60 seconds), and long (over 1 second).
ASR recognition models need to wait for some silence to detect the end of utterance (EOU) in user audio. Riva ASR uses 800 ms of silence as the default EOU, which means the user transcript might get split into multiple queries if more than 800 ms of silence is observed in between the words. Setting a higher EOU can have a negative impact of increased bot response latency. Depending on the bot use case, you can update the EOU value in speech_config.yaml
to adjust silence for handling a longer pause.
riva_asr: RivaASR: server: "localhost:50051" # final transcript silence threshold endpointing_stop_history: 800 # End of User Utterance (EOU)
As humans, we start thinking while the other person is speaking. To get a human-level user experience, we need to start processing transcripts as soon as received and reevaluate our response if more words are added to the transcript. LLM and RAG sample bots utilize this technique also referred to as 2 pass End of Utterance EOU. We start processing the user transcript after the first pass (240 ms) silence, stop LLM and TTS processing if the user audio is detected after the first pass but before the second pass EOU (800 ms), and retrigger the pipeline with the new transcript.
riva_asr: RivaASR: server: "localhost:50051" # final transcript silence threshold endpointing_stop_history: 800 # Second pass End of User Utterance (EOU) endpointing_stop_history_eou: 240 # First pass End of User Utterance (EOU)
Customizing TTS Pronunciation using IPA#
Providing an IPA pronunciation file enables you to tune the TTS model with custom pronunciation for some domain-specific words, or words that are not pronounced as expected. The model uses this pronunciation for the specified word, while synthesizing audio for the same.
To use IPA mapping in TTS, we need to create a dictionary file containing the word and its IPA Pronunciation.
Create an IPA dictionary file
ipa.dict
in the bot config.samples/ └── stock_bot ├── ipa.dict ├── asr_words_to_boost.txt ├── bot_config.yaml ├── flows.co └── speech_config.yaml
Add the following example IPA pronunciation, any custom words can be added here (word must be in UPPERCASE letters). For example:
GPU<SPACE><SPACE>'dʒi'pi'ju
Add the path of this IPA file in the TTS component of the
speech_config.yaml
file.riva_tts: RivaTTS: ipa_dict: "/workspace/config/ipa.dict"
Deploy the bot with the above TTS customization.
Using 3rd Party Text-to-Speech (TTS) Solutions#
The ACE Agent pipeline supports Riva TTS as the default option.
For speech bots, you might want to customize the voice for speech response. You can train your own TTS model, clone the TTS voice, or use any 3rd party provider. In this example, we will showcase how you can integrate ElevenLabs text to speech APIs. By default, we use NVIDIA Riva TTS models.
Add the required dependencies in the NLP Server
dockerfile
present in the Quick Start resource atdeploy/docker/dockerfiles/nlp_server.Dockerfile
.############################## # Install custom dependencies ############################## RUN pip install elevenlabs==1.4.1
Register with ElevenLabs - Generative AI Text to Speech & Voice Cloning and get an API key. Add the ElevenLabs API key in the
.env
file present in the Quick Start directory with the variable nameELEVENLABS_API_KEY
, so it can be passed to the NLP server container.from elevenlabs.client import ElevenLabs from elevenlabs import Voice, VoiceSettings client = ElevenLabs( api_key=os.getenv("ELEVENLABS_API_KEY"), ) audio_stream = client.generate( text=input_request.text, voice="Brian", model=input_request.model_name, stream=True, optimize_streaming_latency=3, output_format="pcm_44100" )
Override the API endpoint using @model-api decorator; since the NLP server exposes the
/speech/text_to_speech
API endpoint. Refer to the full code of the NLP server custom client.import io import os import functools from elevenlabs.client import ElevenLabs from elevenlabs import Voice from dataclasses import dataclass from fastapi import HTTPException from fastapi.responses import StreamingResponse from nlp_server.decorators import model_api def do_chunking(audio_stream, min_chunk_size=4410): buffer = b"" for chunk in audio_stream: buffer += chunk if len(buffer) >= min_chunk_size: yield buffer buffer=b"" if len(buffer) != 0: yield buffer @dataclass class TTSRequest: text: str voice_name: str model_name: str = "" model_version: str = "" language_code: str = "en-US" sample_rate_hz: int = 44100 @model_api( endpoint="/speech/text_to_speech", model_name=["eleven_monolingual_v1", "eleven_multilingual_v1", "eleven_multilingual_v2", "eleven_turbo_v2", "eleven_turbo_v2_5"], ) async def eleven_tts(input_request: TTSRequest): client = ElevenLabs( api_key=os.getenv("ELEVENLABS_API_KEY"), ) audio_stream = client.generate( text=input_request.text, voice=input_request.voice_name, model=input_request.model_name, stream=True, optimize_streaming_latency=3, output_format="pcm_44100" ) return StreamingResponse(do_chunking(audio_stream))
For trying out the custom TTS, you can use any of the sample bots present in the
./bots
directory in the Quick Start resource or custom bots created as part of the tutorials. Save the code in a Python file with the nameelevenlabs_tts.py
inside the./bots
directory.samples/ └── stock_bot ├── bot_config.yaml ├── main.co └── elevenlabs_tts.py
If you don’t have an existing
model_config.yaml
, then let’s createmodel_config.yaml
.samples/ └── stock_bot ├── bot_config.yaml ├── main.co ├── elevenlabs_tts.py └── model_config.yaml
Add a custom TTS client and a Riva ASR model.
model_servers: - name: riva speech_models: - nvidia/ace/rmir_asr_parakeet_1-1b_en_us_str_vad:2.17.0 url: localhost:8001 - name: custom nlp_models: - elevenlabs_tts.py
For integrating the 3rd party TTS with the Chat Controller, we need to add some parameters for TTS in
speech_config.yaml
. If you haven’t already createdspeech_config.yaml
, then let’s create it in the sample bot directory.samples/ └── stock_bot ├── bot_config.yaml ├── main.co ├── elevenlabs_tts.py └── model_config.yaml └── speech_config.yaml
Add the following parameters inside the
speech_config.yaml
file.riva_tts: RivaTTS: tts_mode: "http" voice_name: "Brian" server: "http://0.0.0.0:9003/speech/text_to_speech" language: "en-US" ipa_dict: "" sample_rate: 44100 model_name: "eleven_monolingual_v1"
Deploy the bot with the above TTS customization.
Set the OPENAI key if it is not already set.
export OPENAI_API_KEY=... export BOT_PATH="samples/stock_bot" source deploy/docker/docker_init.sh docker compose -f deploy/docker/docker-compose.yml up model-utils-speech docker compose -f deploy/docker/docker-compose.yml up speech-event-bot --build