Troubleshooting¶
Limitations On Memory Usage¶
Before diving into this, it is key to understand the two types of memories: global memory and shared memory (SM). cudaMalloc always allocates global memory that resides on the GPU. The contents of global memory are visible to all the threads running in each kernel; any thread can read and write to any location of the global memory. Global memory is limited by the total memory available to the GPU; for instance, an A100 GPU (40 GB) offers 40 GB of device memory.
While in general cuOpt memory usage is problem size dependent, it is equally important to note that cuOpt memory usage is very sensitive to the constraints specified. As an approximate estimate, a 10,000 locations CVRPTW test case with challenging constraints can execute on a single A100 GPU (40 GB) without any out-of-memory issue.
Docker Sanity Check¶
If Docker is installed and configured correctly, the following will run nvidia-smi in a container and display information about the GPU. If this command fails, check the items below.
$ docker run --rm --gpus all nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
Docker Not Installed or the Wrong Version¶
This page shows the supported Linux distributions and container runtime versions.
Check the Docker version.
$ docker –version
If Docker is not installed, or the Docker version is not supported, follow this guide to install a supported version. We recommend at least Docker 19.03 because it includes the –gpus flag.
Docker Does Not Start¶
The current user does not have permission to run containers.
Docker may return an error like this:
docker: Got permission denied while trying to connect to the Docker daemon socket…
This means that the current user is not part of the Docker group. To enable the current user to run containers:
$ sudo usermod -aG docker $USER
Logout and then login again for the changes to take effect.
Docker service is not running.
Check the status of the Docker service.
$ systemctl status docker
If the status shows that Docker is not running, try restarting it.
$ sudo systemctl restart docker
If Docker still does not start, check the status report for error messages.
NVIDIA container toolkit is not installed.
If the NVIDIA container toolkit is not installed, Docker may return an error like to this:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Follow this guide to set up the NVIDIA Container Toolkit.
NVIDIA GPU driver is not installed.
If the NVIDIA driver is not installed, Docker may return an error similar to this:
1docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' 2 3nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
There are various ways to install the NVIDIA GPU driver. One simple method is to install the CUDA Toolkit. For other methods, search the NVIDIA documentation.
cuOpt Service Not Starting¶
Check the logs for the container (see cuOpt service monitoring below).
Is port 5000 already in use?
If port 5000 is unavailable, the logs will contain an error like this
ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use”
Try to locate the process that is using port 5000 and stop it if possible. A tool like netstat run as the root user can help identify ports mapped to processes, and docker ps will show running containers.
Alternatively, use port mapping to launch cuOpt on a different port such as 5001 (note the omission of –network=host flag):
$ docker run -d --rm --gpus all -p 5001:5000 <CUOPT_IMAGE>
cuOpt Service Not Responding¶
cuOpt microservice health check on the cuOpt host.
Perform a health check locally on the host running cuOpt:
$ curl -s -o /dev/null -w '%{http_code}\\n' localhost:5000/cuopt/health 200
If this command returns 200, cuOpt is running and listening on the specified port.
If this command returns something other than 200, check the following:
Check that a cuOpt container is running with docker -ps.
Examine the cuOpt container log for errors.
Did you include the –network=host or a -p port-mapping flag to docker when you launched cuOpt? If you used port mapping, did you perform the health check using the correct port?
Restart cuOpt and see if that corrects the problem.
cuOpt microservice health check from a remote host.
If you are trying to reach cuOpt from a remote host, run the health check from the remote host and specify the IP address of the cuOpt host, for example:
1$ curl -s -o /dev/null -w '%{http_code}\\n' 34.23.145.121::5000/cuopt/health 2 3200
If this command does not return 200, but a health check locally on the cuOpt host does return 200, the problem is a network configuration or firewall issue. The host is not reachable, or the cuOpt port is not open to incoming traffic.
cuOpt Service Monitoring¶
Checking the cuOpt container log.
Look for the cuOpt container id, for example:
1$ docker ps --format 'table {{.ID}} {{.Image}}' | grep cuopt 2 322a1726778ee nvcr.io/nvidia/cuopt/cuopt:22.12
Print the logs using the container id:
1$ docker logs 22a1726778ee 2 3INFO: Started server process [1045] 4INFO: Waiting for application startup. 5INFO: Application startup complete. 6INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit) 7INFO: 127.0.0.1:36870 - "GET /cuopt/health HTTP/1.1" 200 OK 8INFO: 12.22.141.131:16016 - "GET /cuopt/health HTTP/1.1" 200 OK
cuOpt microservice health check.
Perform a health check on the host running cuOpt:
1$ curl -s -o /dev/null -w '%{http_code}\\n' localhost:5000/cuopt/health 2 3200
How to Check if My Configuration Is Valid for a cuOpt Variant (Constraints, Inputs) When Using the cuOpt Service?¶
The cuOpt service will validate the data passed for the following general data categories before running the solver:
Cost matrix
Waypoint graph
Fleet data
Includes time windows, breaks, capacities, and so on.
Task data
Includes time windows, service times, demands and so on.
Solver configuration
If the data passed is not valid, the service will return an error without running the solver. The HTTP status code of the response will be >= 400, and the response will contain a detailed error message. See the Server API documentation for more detail on the HTTP response.
To run the validation check without running the solver (even if the data is valid) see the
-ov
flag on the cuOpt CLI, or theonly_validate
argument in the cuOpt client, or the Server API documentation for more details if using the API directly.
Common Misconfigurations Reported by the cuOpt managed Service, and How to Fix Them (Missing Data, Incompatible Constraints)¶
If the authentication is failing for the managed service:
Make sure the
sak
value is set correctly in the client. Thesak
value is an NVIDIA Identity Federation API Key covered in the Quickstart Guide and the client documentation.Make sure you are using an unexpired API key. Generate a new key if necessary.
If you are failing to connect to the endpoint:
You may have a local cache that is pointing to the wrong function; delete the
version_cache.json
file or runcuopt_cli -g
to unblock.Alternatively, it is possible that the NVIDIA cloud service infrastructure is down.
If the client stops polling and returns without a result:
If using the
cuopt_cli
: - The CLI will print a message showing how to restart polling with the CLI - The CLI option-p
may be used to set the polling timeout (default is 120 seconds).If using the client library: - The client will raise a TimeoutError exception containing a JSON object with request id and asset id values. These values may be passed to the client’s
repoll
method to restart polling. - The time to poll may be set with therequest_excess_timeout
argument in the client (default is 120 seconds). Setting toNone
will cause the client to poll forever.In either case, if the time taken is excessive for a simple problem and a result cannot be retrieved by repolling, engage cuOpt support.
If you are getting HTTP errors 500 or 409 or validation errors you do not understand:
Capture any output from the program and send it to us via a bug or incident report. The dataset used would be helpful to add, but ensure it does not contain any proprietary details.
"Environment variable 'CUOPT_CLIENT_SAK' not found"
An API Key is not set in the environment variable.
For Linux:
export CUOPT_CLIENT_SAK=”NVIDIA_IDENTITY_FEDERATION_API_KEY”
Alternatively, set it in the cli or pass it in the function call.
Authentication Error: Invalid Client SAK
Invalid Client SAK; please ensure that the API Key specified is correct. A new key may always be generated from the NGC console.
Important
Bugs should be formally submitted using the NGC cuOpt Service landing page.
404 Client Error: Not Found for URL: *********
Trying to hit the wrong endpoint.
This could be a version cache mismatch; the cache needs to be cleared.
rm -rf version_cache.json
orcuopt_cli -g
if using the CLI
cuOpt Error: Bad Request - 400: All values in cost_matrix must be >= 0
If the cost/time matrix has a negative value, which is not accepted.
cuOpt Error: Bad Request - 400: All rows in the cost matrix must be of the same length
If the cost/time matrix is not a square matrix.
cuOpt Error: Bad Request - 400: task_locations represent index locations and must be greater than or equal to 0
If the task locations provided are a negative value or invalid.
cuOpt Error: Unprocessable Entity - 422: [{'loc': ……
Malformed JSON data, it could be the name or structure of JSON data is mismatched.
The details would be in the error message.
cuOpt Error: Bad Request - 400: task_time_windows must be greater than or equal to 0
The task time window contains the non-negative value.
cuOpt Error: Bad Request - 400: service_times must be greater than or equal to 0
Service time should be a non-negative value.
cuOpt Error: Bad Request - 400: Fleet locations represent index locations and must be greater than or equal to 0
Vehicle location should be non-negative.
cuOpt Error: Internal Server Error - 500: cuOpt unhandled exception, please include this message in any error report: Return location must be between min 0 and max num_locations!
Vehicle location should lie within [0, max_num_locations], determined by cost matrix/waypoint graph size.
cuOpt Error: Bad Request - 400: All capacity dimensions values must be 0 or greater
Vehicle capacity should be non-negative.
cuOpt Error: Bad Request - 400: vehicle_time_windows: All vehicle time window values must be greater than or equal to 0
Vehicle time windows should be non-negative.