Speech Configurations

To enable speech I/O for the bot, a Chat Controller is used. Chat Controller is a gRPC server that uses a graph to enable speech I/O and consists of various components. These components expose various parameters. You can set parameters of the Chat Controller graph components as needed. These parameters can be modified in file samples/{bot-name}/speech_config.yaml located under the respective bot directory.

BotRuntimeGrpc

The BotRuntimeGrpc component runs the gRPC server and exposes the gRPC API. Clients can connect to this server to access the services of the Chat Controller. You can set the following parameters of the gRPC server component.

  • port_number - Port number on which gRPC server listens for clients request, default is 50055.

  • virtual_assistant_num_instances - Number of simultaneous client instances supported by the gRPC server. Each instance is mapped to a unique client using the client’s stream_id.

  • virtual_assistant_pipeline_idle_threshold_secs - Idle pipeline detection threshold time in seconds. If the pipeline is idle for this duration, it will be released/freed automatically. Default is 600 seconds.

  • virtual_assistant_pipeline_idle_handler_wakeup_rate_secs - Idle pipeline handler thread wake-up rate in seconds. This thread will wake-up after the specified number of seconds to check for an idle pipeline and if some pipeline is idle for more than virtual_assistant_pipeline_idle_threshold_secs, then that pipeline will be freed. Default is 10 seconds.

    grpc_server:
    nvidia::rrt::BotRuntimeGrpc:
        port_number: 50055
        virtual_assistant_num_instances: 10
        virtual_assistant_pipeline_idle_threshold_secs: 600
        virtual_assistant_pipeline_idle_handler_wakeup_rate_secs: 10
    

SpeechPipelineManager

The SpeechPipelineManager component is responsible for maintaining the conversation flow. It coordinates the conversation flow between ASR, the Chat Engine, and TTS. It accepts speech input, passes it to ASR, gets an ASR transcript, which it sends to the Chat Engine and receives a Chat Engine response and sends it to TTS.

The following parameters are supported by this component for customization.

  • asr_idle_timeout_ms - Idle timeout in milliseconds for ASR. If ASR is active and no speech input is received for this amount of time, then the ASR component is instructed to close the connection with the Riva Skills server.

  • tts_eos_delay_ms - When TTS end of stream (EOS) is received, its reporting will be delayed by tts_eos_delay_ms time. Default value is 0. This parameter can be used to avoid feeding of tts played data.

  • enable_barge_in - Flag to enable/disable barge-in support. Default is false, which means disabled. This parameter is valid only when use_umim=false.

  • barge_in_words_list - List of comma separated words to be used for barge-in. Whenever any word in the list is detected the ongoing TTS playback is stopped. This parameter is valid only when use_umim=false.

  • always_on - Flag to keep ASR always active, if it’s true, then ASR is always active, when it’s false, ASR will be active only after TTS is completed.

  • use_umim - Flag to enable umim mode, if it’s true, then umim bus will be used to send/receive events. Default is false.

speech_pipeline_manager:
SpeechPipelineManager:
    asr_idle_timeout_ms: 200000
    tts_eos_delay_ms: 2000
    always_on: true
    enable_barge_in: true
    barge_in_words_list: ["please stop", "cancel", "abandon", "break"]
    use_umim: false

Riva ASR

The Riva ASR component accepts audio data and produces transcripts. It calls the Riva Skills server to get the asr transcripts. Following parameters are supported by this component.

  • language - The language code, only en-US is supported.

  • server - The server address with the port where the Riva ASR server is hosted.

  • ratelimit_log_period_ms - Time period in millisecond to set the rate limited debug logs frequency. Default is set to 0, which disables the debug logs. Debug logs for ASR also prints ASR buffer energy.

  • enable_profanity_filter - Flag to enable/disable profanity filter.

  • word_boost_file_path - JSON file path specifying words to boost and boost strength value. For example, a JSON file is shown below.

{
"comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.",
"speech_context": [
{
"boost": 50,
"phrases": [
    "wraps"
]
},
{
"boost": 40,
"phrases": [
    "Tops",
    “sides”
]
}
]

Word Boosting

Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time; by giving them a higher score when decoding the output of the acoustic model. For our sample bot, this is useful when certain items in the menu have a unique pronunciation or pronunciation of some domain specific word that closely matches with some generic common word.

Note

The recommended boosting score values are between 20 and 100. A higher score increases the likelihood that the boosted words appear in the transcript if the words occurred in the audio. However, it can also increase the likelihood that the boosted words appear in the transcription even though they did not occur in the audio.

For more information, refer to the Riva ASR documentation.

riva_asr:
RivaASR:
    server: "localhost:50051"
    language: "en-US"
    word_boost_file_path: "/workspace/config/asr_words_to_boost_conformer.txt"
    enable_profanity_filter: false

Chat Engine

The Chat Engine component interacts with the Chat Engine REST API. It passes requests to the Chat Engine server and gets the response from it. You can set the following parameters of the Chat Engine component:

  • server - The server address with the port where the Chat Engine is hosted.

  • ratelimit_log_period_ms - Time period in millisecond to set the rate limited debug logs frequency. Default is set to 0, which disables the debug logs.

  • enable - Flag to enable/disable this component. Default value is true, which means it’s enabled.

  • http_timeout_ms - Timeout value in milliseconds to be used when accessing the REST API of the Chat Engine. Default is 20000.

  • use_streaming - Flag to receive the Chat Engine response in streaming mode. If the Chat Engine doesn’t provide a response in streaming mode then set this to false. Default is true.

  • sentence_breaker_pattern - This parameter is used when use_streaming=true. As the Chat Engine sends a response, token by token, whereas tts needs a sentence, then this parameter can be used to form a sentence if any of these characters are found in a streaming response. Default pattern is ".!,?".

  • min_chars_in_sentence - This parameter is used when use_streaming=true. As the Chat Engine sends responses, token by token, whereas tts needs a sentence, then this parameter can be used to mention the minimum number of characters to be available in a sentence. Default is 20.

    dialog_manager:
      DialogManager:
        server: "http://localhost:9000"
        enable: true
    

Riva TTS

The Riva TTS component interacts with the Riva Skills TTS server. It takes text input and gives audio data as output. It supports the following parameters:

  • voice_name - The voice name supported by the TTS server.

  • sample_rate - The desired sample rate of synthesized audio, default is 16000 Hz.

  • server - The server address with the port where the TTS server is hosted.

  • language - The language code, only en-US is supported.

  • ipa_dict - The custom word to IPA to pronunciation mappings dictionary for custom pronunciations.

  • ratelimit_log_period_ms - Time period in milliseconds to set the rate limited debug logs frequency. Default is set to 0, which disables the debug logs.

  • chunk_duration_ms - The audio output chunk size in milliseconds.

  • send_audio_in_realtime - If true, the audio data will be sent in real time, else it will be sent in bursts. The default is false.

  • audio_start_threshold_ms - The duration for which audio data will be sent in bursts and rest of the data will be sent in real time, must be in multiple chunks of chunk_duration_ms. Default value is 400. This is needed when barge-in is enabled.

  • tts_mode - Mode to be used to connect to the TTS server, supported modes are grpc and http. The default is grpc. The http mode is used when the TTS service (other than Riva) is used. You can deploy a 3rd party TTS service as part of the NLP server. For more information, refer to the Using 3rd Party Text to Speech (TTS) Solutions section.

  • model_name - TTS inference model name. The default is empty.

    riva_tts:
      RivaTTS:
        server: "localhost:32001"
        voice_name: "English-US.Female-1"
        language: "en-US"
        ipa_dict: ""
        sample_rate: 16000
        ratelimit_log_period_ms: 0
        chunk_duration_ms: 100
        audio_start_threshold_ms: 400
        send_audio_in_realtime: true
        tts_mode: "grpc"
        model_name: ""
    

The following example shows the parameters that can be used when a 3rd party TTS service is used.

riva_tts:
  RivaTTS:
    tts_mode: "http"
    voice_name: "Bella"
    server: "http://<ip>:9003/speech/text_to_speech"
    language: "en-US"
    sample_rate: 44100
    model_name: "eleven_monolingual_v1"

IPA Dictionary

IPA tuning enables you to tune the TTS model with custom pronunciation for some domain-specific words, or words that are not pronounced as expected. The model uses this pronunciation for the specified word, while synthesizing audio for the same.

To use IPA mapping in TTS, we need to create a dictionary file containing the word and its IPA representation.

Each line should contain a single entry corresponding to a word and its IPA representation in the format <WORD IN UPPER CASE><SPACE><SPACE><IPA Representation>.

3D  'θɹiˌdi
GPU  'dʒi'pi'ju
AI  ˌeɪ'aɪ

Riva Logger

The Riva Logger component logs the audio and text data to a file. It dumps input audio to ASR, synthesized output from TTS and the Chat Engine responses into a file. The following parameters are supported by this component:

  • data_dump_path - The path where logged data is to be dumped. Under this path directory, a <date-time> is created. The default is /tmp/. For example, /tmp/2022-01-31_13-40-40/.

  • test_mode - If true, the new file is created for each turn, otherwise files are overwritten. The default is false.

  • ratelimit_log_period_ms - Time period in milliseconds to set the rate limited debug logs frequency. The default is set to 0, which disables the debug logs.

  • enable_logging - If true, enables logging of data. If false, no data will be dumped. The default is true.

    riva_logger:
      RivaLogger:
        data_dump_path: "/workspace/log"
        enable_logging: true
        ratelimit_log_period_ms: 0
        enable_logging: true
    

NvMsgBrokerC2DReceiver

The NvMsgBrokerC2DReceiver component is used to receive messages from message brokers like Redis. The following parameters are supported:

  • conn_str - The connection information containing the IP address and port of the Redis server separated with a semicolon. The default is 127.0.0.1;6379.

  • topics - The list of topics that the component should subscribe to. Specified as a semicolon separated list of strings. For example, topic1;topic2;topic3.

    nvcloudmsg_receiver:
      NvMsgBrokerC2DReceiver:
        topics: emdat_alert_events
        conn_str: 127.0.0.1;6379
    

NvMsgBrokerD2CTransmitter

The NvMsgBrokerD2CTransmitter component transmits messages to the message brokers like Redis. The following parameters are supported:

  • conn_str - The connection information containing the IP address and port of the Redis server separated with a semicolon. The default is 127.0.0.1;6379.

  • payload_key - The key name to be used for the payload. This key will be used for the JSON object. The default is metadata.

    d2c_transmitter:
      NvMsgBrokerD2CTransmitter:
        conn_str: 127.0.0.1;6379
    

Audio2FaceGrpc

The Audio2FaceGrpc component sends audio data to the Audio2Face microservice. The following parameters are supported:

  • server - The server address with the port where the Audio2Face server is hosted.

  • avatar_model - The name of the avatar model or identifier.

  • rpc_timeout_ms - The timeout value for RPC communication with the gRPC server in milliseconds. The default is 10000.

  • ratelimit_log_period_ms - The time period in milliseconds to set the rate limited debug logs frequency. The default is set to 0, which disables the debug logs.

  • renderer_latency_ms - The render latency in milliseconds. This is useful for sending end of stream, if the renderer has considerable latency. The default is 0.

  • simulate_eos - Simulate the EOS by waiting till audio duration plus renderer latency. Default is true.

    a2f_grpc:
      Audio2FaceGrpc:
        server: "localhost:50000"
        rpc_timeout_ms: 300000
        renderer_latency_ms: 0
        Simulate_eos: true
    

HttpClient

The HttpClient component is used to send animation commands to REST endpoints. The following parameters are supported:

  • server - The server address with the port where the HTTP server is hosted.

  • enable - The flag to enable/disable this component. The default value is true, which means it’s enabled.

    http_client:
      HttpClient:
        server: "http://localhost:8020"
        enable: true
    

BotControllerUMIM

The BotControllerUMIM component uses the umim bus to send and receive umim events. The following parameters are supported:

  • enable - The flag to enable/ disable this component. The default value is false, which means it’s disabled.

    entity_umim:
      BotControllerUMIM:
        enable: true