The VRP and PDP problems are derived from the TSP, which is one of the most studied problems in operations research and more generally, in computer science. TSP asks the following question: “Given a list of destinations and a matrix of distances between each pair of destinations, what is the shortest possible route that visits each destination exactly one time and returns to the original location?” For example, the TSP has several applications in planning and logistics, where a good solution can save significant travel time and fuel costs in the transportation and delivery of goods. VRP and PDP are essentially extensions of TSP with additional complexity. VRP generalizes the TSP to solve for the optimal set of routes for a fleet of vehicles in order to deliver to a given set of customers. PDP adds the possibility of two different types of services, namely pickup or delivery, whereas in VRP all customers require the same service be performed at a customer location.

In mathematical terms, the TSP, VRP, and PDP belong to the class of problems called ‘NP-hard’, meaning that the required time to find an optimal solution increases *at least* exponentially with the size of the problem (for example, the number of deliveries to make). The number of possible states in the search space for VRP is of the order of n!, where n is the number of nodes (locations the vehicle must reach) in the network. Given the large search space, brute force approaches are practically intractable for large problem sizes (more than a few dozen locations) even on a modern supercomputer.

For instance, a ten-node problem has about 3628800 (3.6*10^6) possible states, but if we double the problem size, the solution space becomes 2432902008176640000 (2.4 * 10^18), which means that the solution space grew by a factor of 670442572800; it is about 6.7 trillion times larger, i.e. there is a massive need for more compute.

Given the time and computational resources required for brute-force enumeration, obtaining the exact optimal solution is not realistic at all. However, there are well-studied heuristics that yield near-optimal solutions for very large networks within a reasonable time, and NVIDIA cuOpt focuses on using these heuristics.

cuOpt first generates a feasible initial solution (initial phase) and then iteratively improves the solution quality (improvement phase). The termination criteria are reached when either the solution quality is improving slower than a threshold tolerance (which is adaptively set internally by cuOpt heuristics) or if an execution time-limit has elapsed.

GPUs, with their ability to harness thousands of parallel cores, are an ideal computing platform for accelerating massive parallelizable problems where thousands or millions of separate tasks are to be computed in parallel. This enables orders-of-magnitude speedups when running heuristics for this class of problems, thereby reducing operational costs and improving solution accuracy.

cuOpt provides a managed service API that is intuitive and easy to adopt. It exposes a composable solver for all the implemented variants of the VRP problem and available heuristics.

A Python reference client and CLI is provided for cuOpt that enables users to easily submit cuOpt problems represented as JSON objects. Users are free to build their own clients using the reference client and documentation as guides.

Before diving into this, it is key to understand the two types of memories—global memory and shared memory. cudaMalloc always allocates global memory that resides on the GPU. The contents of global memory are visible to all the threads running in each kernel that is any thread can read and write to any location of the global memory. In contrast, shared memory is memory shared between the threads within a block and is not visible to all threads. For instance, the shared memory of the A100 GPUs capacity per SM is 164 KB, with 108 SMs on the chip. Each SM has an isolated shared memory and can only communicate by copying to global memory. Global memory is limited by the total memory available to the GPU, for instance, an A100 GPU (40 GB) offers 40 GB of device memory. Shared memory is like a local cache shared among the threads of a block, magnitudes faster to access than global memory and limited in capacity.

While in general cuOpt memory usage is problem size dependent, it is equally important to note that cuOpt memory usage is very sensitive to the constraints specified. The global memory usage is determined by the size of the input distance and cost matrices while the longest route in terms of number of nodes on the route determines the peak shared memory usage. As an approximate estimate, a 10,000 locations CVRPTW test case with challenging constraints can execute on a single A100 GPU (40 GB) without any out-of-memory issue. However, for even larger problem sizes.