Troubleshooting

Limitations On Memory Usage

Before diving into this, it is key to understand the two types of memories: global memory and shared memory (SM). cudaMalloc always allocates global memory that resides on the GPU. The contents of global memory are visible to all the threads running in each kernel that is any thread can read and write to any location of the global memory. Global memory is limited by the total memory available to the GPU; for instance, an A100 GPU (40 GB) offers 40 GB of device memory.

While in general cuOpt memory usage is problem size dependent, it is equally important to note that cuOpt memory usage is very sensitive to the constraints specified. As an approximate estimate, a 10,000 locations CVRPTW test case with challenging constraints can execute on a single A100 GPU (40 GB) without any out-of-memory issue.

Docker Sanity Check

  • If Docker is installed and configured correctly, the following will run nvidia-smi in a container and display information about the GPU. If this command fails, check the items below.

    $ docker run --rm --gpus all nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
    

Docker Not Installed or the Wrong Version

  • This page shows the supported Linux distributions and container runtime versions.

  • Check the Docker version.

    $ docker –version
    
  • If Docker is not installed, or the Docker version is not supported, follow this guide to install a supported version. We recommend at least Docker 19.03 because it includes the –gpus flag.

Docker Does Not Start

  • The current user does not have permission to run containers.

    Docker may return an error like this:

    docker: Got permission denied while trying to connect to the Docker daemon socket…
    

    This means that the current user is not part of the Docker group. To enable the current user to run containers:

    $ sudo usermod -aG docker $USER
    

    Logout and then login again for the changes to take effect.

  • Docker service is not running.

    Check the status of the Docker service.

    $ systemctl status docker
    

    If the status shows that Docker is not running, try restarting it.

    $ sudo systemctl restart docker
    

    If Docker still does not start, check the status report for error messages.

  • NVIDIA container toolkit is not installed.

    If the NVIDIA container toolkit is not installed, Docker may return an error like to this:

    docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
    

    Follow this guide to set up the NVIDIA Container Toolkit.

  • NVIDIA GPU driver is not installed.

    If the NVIDIA driver is not installed, Docker may return an error similar to this:

    1docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    2
    3nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
    

    There are various ways to install the NVIDIA GPU driver. One simple method is to install the CUDA Toolkit. For other methods, search the NVIDIA documentation.

cuOpt Service Not Starting

  • Check the logs for the container (see cuOpt service monitoring below).

    • Is port 5000 already in use?

    • If port 5000 is unavailable, the logs will contain an error like this

      ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use”
      
  • Try to locate the process that is using port 5000 and stop it if possible. A tool like netstat run as the root user can help identify ports mapped to processes, and docker ps will show running containers.

  • Alternatively, use port mapping to launch cuOpt on a different port such as 5001 (note the omission of –network=host flag):

    $ docker run -d --rm --gpus all -p 5001:5000 <CUOPT_IMAGE>
    

cuOpt Service Not Responding

  • cuOpt microservice health check on the cuOpt host.

    Perform a health check locally on the host running cuOpt:

    $ curl -s -o /dev/null -w '%{http_code}\\n' localhost:5000/cuopt/health 200
    

    If this command returns 200, cuOpt is running and listening on the specified port.

    If this command returns something other than 200, check the following:

    • Check that a cuOpt container is running with docker -ps.

    • Examine the cuOpt container log for errors.

    • Did you include the –network=host or a -p port-mapping flag to docker when you launched cuOpt? If you used port mapping, did you perform the health check using the correct port?

    • Restart cuOpt and see if that corrects the problem.

  • cuOpt microservice health check from a remote host.

    If you are trying to reach cuOpt from a remote host, run the health check from the remote host and specify the IP address of the cuOpt host, for example:

    1$ curl -s -o /dev/null -w '%{http_code}\\n' 34.23.145.121::5000/cuopt/health
    2
    3200
    

    If this command does not return 200, but a health check locally on the cuOpt host does return 200, the problem is a network configuration or firewall issue. The host is not reachable, or the cuOpt port is not open to incoming traffic.

cuOpt Service Monitoring

  • Checking the cuOpt container log.

    Look for the cuOpt container id, for example:

    1$ docker ps --format 'table {{.ID}} {{.Image}}' | grep cuopt
    2
    322a1726778ee nvcr.io/nvidia/cuopt/cuopt:22.12
    

    Print the logs using the container id:

    1$ docker logs 22a1726778ee
    2
    3INFO: Started server process [1045]
    4INFO: Waiting for application startup.
    5INFO: Application startup complete.
    6INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
    7INFO: 127.0.0.1:36870 - "GET /cuopt/health HTTP/1.1" 200 OK
    8INFO: 12.22.141.131:16016 - "GET /cuopt/health HTTP/1.1" 200 OK
    
  • cuOpt microservice health check.

    Perform a health check on the host running cuOpt:

    1$ curl -s -o /dev/null -w '%{http_code}\\n' localhost:5000/cuopt/health
    2
    3200
    

How to Check if My Configuration Is Valid for a cuOpt Variant (Constraints, Inputs) When Using the cuOpt Service?

  • The cuOpt service will validate the data passed in a set or update operation for the following general data categories:

    • Cost matrix

    • Waypoint graph

    • Fleet data

    • Includes time windows, breaks, capacities, and so on.

    • Task data

    • Includes time windows, service times, demands and so on.

    • Solver configuration

    If the data passed is not valid, the HTTP status code of the response will be >= 400, and the response will contain a detailed error message. If the data passed is valid, the HTTP status code of the response will be 200. See the Server API documentation for more detail on the HTTP response.

    Other configuration errors may be caught by the solver when get_routes is called.

Common Misconfigurations Reported by the cuOpt Service, and How to Fix Them (Missing Data, Incompatible Constraints)

  • The error messages listed below are up to date for the cuOpt server.

  • If the authentication is failing:

    • Make sure Client ID and secret are set properly.

    • Check if the secret is expired, and if expired, generate a new one.

    • Delete the cache for a token if nothing else works.

  • If you are failing to connect to the endpoint:

    • You may have a local cache that is pointing to the wrong function; delete the cache to unblock.

    • Alternatively, it is possible that the NVIDIA cloud service infrastructure is down.

  • If a request starts polling for a result and stops after a while without a result:

    • Wait for some time and try to hit the endpoint that was provided at the end to retrieve the result.

    • If the time taken is excessive for a simple problem, engage cuOpt support.

  • If you are getting HTTP errors 500 or 409, these are from cuOpt.

    • Capture any output from the program and send it to us via a bug or incident report. The dataset used would be helpful to add, but ensure it does not contain any proprietary details.

  • If you are getting any other HTTP errors, they are from NVIDIA cloud service infrastructure.

    • Please submit a bug report.

"Environment variable 'CUOPT_CLIENT_SAK' not found"
  • An API Key is not set in the environment variable.

  • For Linux:

    • export CUOPT_CLIENT_SAK=”NVIDIA_IDENTITY_FEDERATION_API_KEY”

  • Alternatively, set it in the cli or pass it in the function call.

Authentication Error: Invalid Client SAK
  • Invalid Client SAK; please ensure that the API Key specified is correct. A new key may always be generated from the NGC console.

Important

Bugs should be formally submitted using the NGC cuOpt Service landing page.

cuOpt Error: Conflict - 409: Infeasible Solve
  • The problem set provided is not feasible; constraints need to relax or enable prize collection to accumulate tasks with a better price and drop others.

404 Client Error: Not Found for URL: *********
  • Trying to hit the wrong endpoint.

    • This could be a version cache mismatch; the cache needs to be cleared.

    • rm -rf version_cache.json

cuOpt Error: Bad Request - 400: All values in cost_matrix must be >= 0
  • If the cost/time matrix has a negative value, which is not accepted.

cuOpt Error: Bad Request - 400: All rows in the cost matrix must be of the same length
  • If the cost/time matrix is not a square matrix.

cuOpt Error: Bad Request - 400: task_locations represent index locations and must be greater than or equal to 0
  • If the task locations provided are a negative value or invalid.

cuOpt Error: Unprocessable Entity - 422: [{'loc': ……
  • Malformed JSON data, it could be the name or structure of JSON data is mismatched.

  • The details would be in the error message.

cuOpt Error: Bad Request - 400: task_time_windows must be greater than or equal to 0
  • The task time window contains the non-negative value.

cuOpt Error: Bad Request - 400: service_times must be greater than or equal to 0
  • Service time should be a non-negative value.

cuOpt Error: Bad Request - 400: Fleet locations represent index locations and must be greater than or equal to 0
  • Vehicle location should be non-negative.

cuOpt Error: Internal Server Error - 500: cuOpt unhandled exception, please include this message in any error report:  Return location must be between min 0 and max num_locations!
  • Vehicle location should lie within [0, max_num_locations], determined by cost matrix/waypoint graph size.

cuOpt Error: Bad Request - 400: All capacity dimensions values must be 0 or greater
  • Vehicle capacity should be non-negative.

cuOpt Error: Bad Request - 400: vehicle_time_windows: All vehicle time window values must be greater than or equal to 0
  • Vehicle time windows should be non-negative.