Troubleshooting#

Installing ACE Agent Wheel in Python 3.11

Q: While trying to install aceagent wheel on Python 3.11, the installation crashes with an error about building a wheel file for annoy. How do I fix this?

A: If you encounter issues related to the failure of building a wheel file for annoy, run the following command and rebuild the virtual environment.

sudo apt-get install python3.11-dev
python3.11 -m venv ace
source ace/bin/activate

Installing ACE Agent Wheel in Python 3.12

Q: How can I create a virtual environment with Python 3.12 to install aceagent wheel?

A: You can create a virtual environment with Python 3.12 using the following command.

sudo apt-get install python3.12-distutils python3.12-dev python3.12-venv

python3.12 -m ensurepip --upgrade
python3.12 -m venv ace
source ace/bin/activate

Building Helm Chart using UCS tools

Q: Why am I getting the following error on building an application using UCS tools?

AppBuilder - ERROR - Failed to find a suitable version for microservice dependency 'ucf.svc.riva.speech-skills' that matches ...

A: Generate an NGC_PERSONAL_KEY (or an NVIDIA API Key) with org as nv-ucf and run:

export NGC_PERSONAL_KEY=nvapi-...

ucf_app_builder_cli registry repo set-api-key -a "${NGC_PERSONAL_KEY}"

Try building the application again using UCS tools.

Web UI stuck with “connecting to server…” message

Q: Why is the web UI stuck with the message connecting to server…?

A: When the web UI runs for the first time, it precomputes cached values. This process can take up to 5 minutes. Once the process is completed, the web UI should load within seconds. If you see this error, wait for a few minutes and try again. If the error persists, ensure that the UI server is running, and, if accessing the UI from a remote browser, that the port 7007 is not blocked by a firewall.

Bot responding with the message “I have encountered some technical issue!”

Q: Why is the bot responding with I have encountered some technical issue! response?

A: Most sample bots and tutorials use this message as fallback when Colang errors are observed or an undefined Colang flow was observed. Check the logs directory created in the current working directory. The usual common issues include:

  • There was a syntax error in the Colang files.

  • There was an error in the Action or Plugin calls.

  • For the LLM sample bot, if the local LLM is not up or the API key is not correct, the ExternalLLMAction fails and the Colang flow returns this fallback message.

If the external API calls are failing and you want to use different error messages, check if the Action call was successful using the if else statement.

$price = await InvokeFulfillmentAction(request_type="get",
endpoint="/stock/get_stock_price", company_name=$company_name)
if not $price
bot say "Could not find the stock price!"
else
bot say "Stock price of {$company_name} is {$price}"

If you want to change error message for all failures, edit main.co with the updated message for the following flow activations.

activate notification of undefined flow start "I have encountered some technical issue!"
activate notification of colang errors "I have encountered some technical issue!"

Why I am not getting any bot responses

Q: Why are there no bot responses in both text or speech modalities?

A: You might observe one of the following cases if no bot response was received.

  • The Chat Engine container is not up. Check logs for exit reasons for the chat-engine container. Most likely this indicates an issue in the bot configurations.

If you haven’t provided the correct bot configurations path via BOT_PATH or the UCS app params, you might observe the following error:

No bot config files found in the provided directory! Please provide a yaml file with name "bot: <bot_name>"
  • The Chat Engine hasn’t received any user queries. Check the chat-controller logs to confirm if ASR requests were received. Check Riva Skills logs for any errors. Make sure the Riva Speech server is up and the ASR model is deployed.

  • If you are able to get a text response, but not able to receive a speech response, check if the chat-controller container is up and there are no errors observed in the chat controller logs. Check if the configured TTS provider is functional and no errors are observed.

  • Check Redis UMIM bus for particular stream_id to confirm if the user utterance and bot utterance are received.

  • If you don’t observe any errors in the logs and no bot utterance received in Redis, check if the Colang bot configurations are expected to return any responses for the given query and review the NeMo Guardrails logs from the log directory in the current working directory.

Bot responding with the message “Error in getting response from Chat Engine”

Q: Why am I getting the bot response Error in getting response from Chat Engine?

A: This error is observed in the Chat Engine Server Architecture and Plugin Server Architecture when the chat-controller microservice is not able to call the Chat Engine and Plugin Server endpoints or requests are failing due to timeout. Check if you have correctly configured the dialog_manager server URL in the speech_config.yaml file and confirm if the Chat Engine and Plugin Server endpoints are up and accessible.

Unable to deploy local NIM LLM model

Q: Why is my local LLM deployment container nemollm-inference-microservice exiting?

A: Check the logs of the NIM container. You might observe the following error if other processes are already running on the target GPU device during startup. If you want to run Riva Speech models on the same GPU device, it might be possible after the NIM container is up, but it is not recommended to run other models on the same GPU device to avoid OOM errors.

If you haven’t set NGC_CLI_API_KEY correctly, you might observe Error: NGC_API_KEY is not set error message in the log files.

RAG server is not responding correctly

Q: Why am I getting No response generated from LLM, make sure your query is relavent to the ingested document. bot response from the RAG bot.

A: Make sure you uploaded the relevant documents for your use case to the RAG server using Playground by visiting http://<your-ip>:3001/kb or using document ingestions from http://<your-ip>:8081/docs.

If you observed errors while uploading documents, review the logs in the rag-application-text-chatbot-langchain container. Commonly observed errors include:

  • Using hosted LLM and embedding models and haven’t set the correct NVIDIA_API_KEY.

    ERROR:example:Failed to ingest document due to exception [401] Unauthorized
    invalid response from UAM
    Please check or regenerate your API key.
    ERROR:RAG.src.chain_server.server:Error from POST /documents endpoint. Ingestion of file: /tmp/gradio/cf9b8b2a9072611545f0b2dc20e454edb82650bf2bfc93e8567d803dcc0e49b7/2022 Delta Dental FAQs.pdf failed with error: Failed to upload document. Please upload an unstructured text document.
    INFO:     172.21.0.6:40894 - "POST /documents HTTP/1.1" 500 Internal Server Error
    
  • Using local embedding and NIM LLM microservices and services are not ready or have exited.

    requests.exceptions.ConnectionError: HTTPConnectionPool(host='nemollm-embedding', port=8000): Max retries exceeded with url: /v1/models (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x742c0280ccd0>: Failed to resolve 'nemollm-embedding' ([Errno -3] Temporary failure in name resolution)"))
    

Bot responding with the message “Sorry I could not connect to the RAG endpoint”

Q: Why am I getting Sorry I could not connect to the RAG endpoint bot response from the RAG bot?

A: This error is observed when the RAG server is not up, the URL is not correctly configured, or the RAG server returns an empty response. Check the chat-engine-event-speech container logs to see if you observe the following error:

  File "/usr/local/lib/python3.10/dist-packages/chat_engine/policies/actions/colang2_actions.py", line 149, in perform_fulfillment_call
    logger.warning("Could not connect to fulfillment endpoint=%s. Error %e", url, e)
Message: 'Could not connect to fulfillment endpoint=%s. Error %e'
Arguments: ('http://localhost:9002/rag/chat', ClientPayloadError("Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"))

Review the plugin-server container logs and the plugin_config.yaml file to ensure the correct RAG_SERVER_URL is configured. If the RAG server is not accessible or is not up, you will observe the the following error:

aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8081 ssl:default [Multiple exceptions: [Errno 111] Connect call failed ('::1', 8081, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 8081)]
  • Using NIM Hhosted LLM and embedding models

  • Using local LLM and embedding models

Commonly observed errors in rag-application-text-chatbot-langchain container logs include:

  • Milvus is not up.

    WARNING:pymilvus.decorators:[query] retry:75, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host>
    ERROR:pymilvus.decorators:RPC error: [query], <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:
            status = StatusCode.UNAVAILABLE
            details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host"
            debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host", grpc_status:14, created_time:"2024-10-24T08:18:14.14407801+00:00"}"
    >>, message=Retry run out of 75 retry times, message=failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host)>, <Time:{'RPC start': '2024-10-24 08:14:43.305372', 'RPC error': '2024-10-24 08:18:14.144369'}>
    ERROR:RAG.src.chain_server.utils:Error occurred while retrieving documents: <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:
    
  • Local NIM LLM is not up. Check the logs for nemollm-inference-microservice. Refer to Unable to deploy local NIM LLM Model Troubleshooting section for more information.

  • Local NIM embedding model is not up. Check the logs for nemo-retriever-embedding-microservice. If you haven’t set NGC_CLI_API_KEY correctly, you might observe the Error: NGC_API_KEY is not set error message.

Unable to pull Docker images

Q: Why am I not able to pull images in Docker deployment? I’m receiving the following error:

Unable to find image 'nvcr.io/nvidia/riva/riva-speech:2.17.0-servicemaker' locally
model-utils-speech | docker: Error response from daemon: Head "https://nvcr.io/v2/nvidia/riva/riva-speech/manifests/2.17.0-servicemaker": unauthorized:

A: Log into the nvcr.io Docker registry to pull the ACE Agent and Riva Skills containers.

export NGC_CLI_API_KEY=<your-api-key>
echo ${NGC_CLI_API_KEY} | docker login nvcr.io --username '$oauthtoken' --password-stdin

Kubernetes pods are failing

Q: Why are my Kubernetes pods failing with Failed to pull image error? I’m receiving the following error:

A: Ensure imagePullSecrets is correctly set in the UCS app. Ensure the Kubernetes secret is correctly configured. Delete and recreate the ngc-docker-reg-secret Docker registry Kubernetes secret.

export NGC_CLI_API_KEY=...

kubectl delete secret ngc-docker-reg-secret
kubectl create secret docker-registry ngc-docker-reg-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password="${NGC_CLI_API_KEY}"

Why speech models deployment is failing

Q: When deploying speech models using the docker compose -f deploy/docker/docker-compose.yml up model-utils-speech command, why is it failing with the following log?

model-utils-speech  | Waiting for Riva server to load all models...retrying in 10 seconds

A: This error usually indicates that the Riva Speech server container has exited with some error. Check the Riva Speech server container logs. The most commonly observed issues are:

  • Out of Memory - The GPU doesn’t have enough free VRAM for the Riva server deployment.

  • Ports clash - Some service is already utilizing one of the port from 8000, 8001, 8002 and 50051 exposed by the Riva Speech server.

  • TensorRT Conversion failed - During the TensorRT conversion, some error occured for the particular model and that model fails to load. Check logs for the model-utils-speech container to review TensorRT conversion logs for the root cause.