TorchSWE case study#

TorchSWE is a shallow-water solver created by Dr. Pi-Yueh Chuang and Prof. Lorena Barba that solves the vertically averaged Navier-Stokes equations using MPI and CuPy. It can simulate free-surface water flow in rivers, channels, and coastal areas, as well as model flood inundation. Given a topography, TorchSWE can predict flood-prone areas and the height of water inundation, making it a valuable tool for risk mapping.

High-resolution numerical simulations—such as those on real topographies requiring hundreds of millions of data points—demand distributed computation across multiple GPUs. Although scalability is achievable with MPI4Py and CuPy, this approach requires manually partitioning the problem and managing inter-GPU data communication, which are complex and error-prone tasks.

cuPyNumeric enables a distributed implementation of TorchSWE using only NumPy operations, without the complexities of MPI+CuPy. After porting TorchSWE to cuPyNumeric by removing all domain decomposition logic, it scaled effortlessly across multiple GPUs and nodes without further code modifications. This scalability enabled high-fidelity simulations exceeding 1.2 billion data points using 32 GPUs, allowing researchers to tackle critical scientific problems in flood inundation modeling without needing specialized distributed computing expertise. Overall, the cuPyNumeric implementation reduced the lines of code by over 20%, and simplified development and maintenance by eliminating complex logic for managing distribution and communication.

Deep dive into the TorchSWE code implementation

Original code details

TorchSWE uses stencil operations to model shallow-water equations on a 2D grid, where each point is updated based on neighboring values, simulating water flow dynamics. The stencil computations are structured to update each grid cell iteratively, based on data from surrounding cells, mimicking fluid behavior over time. Below is an example that mimics the basic structure of the stencil logic from the TorchSWE repository:

[1]:

import numpy as np

# Example dimensions for the grid
nx, ny = 128, 128
grid = np.ones((nx, ny))  # Initialize the grid with "1"

# Stencil operation
for i in range(1, nx - 1):
    for j in range(1, ny - 1):
        grid[i, j] = (grid[i + 1, j] + grid[i - 1, j] + grid[i, j + 1] + grid[i, j - 1]) / 4

This code iteratively updates cell h[i, j] using adjacent cells, representing a basic averaging stencil operation that can be extended to various boundary conditions and flow dynamics in the shallow-water model. For full context, refer to TorchSWE on GitHub.

Parallelizing stencil operations for multi-GPU systems is challenging. When arrays are partitioned across multiple GPUs, any update to a cell requires the updated values to be shared between GPUs to maintain consistency across boundaries. This communication overhead and synchronization make parallelizing stencil code complex and difficult to implement efficiently on multi-GPU architectures.

Below, we outline TorchSWE’s MPI4Py logic in more detail to highlight the complexity involved in this implementation. Here’s an example code snippet that mirrors the TorchSWE MPI logic, implementing a simple MPI stencil operation from above:

[2]:

from mpi4py import MPI
import cupy as cp

num_timesteps=10

def set_device(comm: MPI.Comm):
    # Device selection for each rank on multi-GPU nodes (TorchSWE-specific)
    n_gpus = cp.cuda.runtime.getDeviceCount()
    local_rank = comm.Get_rank() % n_gpus
    cp.cuda.runtime.setDevice(local_rank)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Determine grid size and decompose domain
gnx, gny = 126,126   # global grid dimensions
local_nx, local_ny = gnx // size, gny  # local grid dimensions per rank
local_grid = cp.ones((local_nx + 2, local_ny + 2))  # with halo boundaries

# Set up MPI data types and boundaries
send_type, recv_type = MPI.DOUBLE.Create_subarray((local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1)), MPI.DOUBLE.Create_subarray((local_nx + 2, local_ny + 2), (local_nx, local_ny), (1, 1))
send_type.Commit()
recv_type.Commit()

# Stencil computation loop
for timestep in range(num_timesteps):
    # Boundary exchange with non-blocking sends/receives
    reqs = []
    if rank > 0:
        reqs.append(comm.Isend(local_grid[1, :], dest=rank - 1))
        reqs.append(comm.Irecv(local_grid[0, :], source=rank - 1))
    if rank < size - 1:
        reqs.append(comm.Isend(local_grid[local_nx, :], dest=rank + 1))
        reqs.append(comm.Irecv(local_grid[local_nx + 1, :], source=rank + 1))

    # Ensure all sends/receives are complete
    MPI.Request.Waitall(reqs)

    # Perform stencil operation
    for i in range(1, local_nx + 1):
        for j in range(1, local_ny + 1):
            local_grid[i, j] = 0.25 * (local_grid[i - 1, j] + local_grid[i + 1, j] +
                                       local_grid[i, j - 1] + local_grid[i, j + 1])

# Clean up MPI data types
send_type.Free()
recv_type.Free()
MPI.Finalize()

This example follows TorchSWE’s approach to domain decomposition and parallelization as in the original implementation. It starts with MPI initialization and sets up logic to manage GPU assignment per rank, dividing the global grid into subdomains. Each rank is responsible for a local subgrid with added halo rows to hold neighboring data. Once the domain is decomposed, the user must ensure proper communication of data at processor boundaries, accounting for datatype differences between CuPy and MPI4Py. For optimal performance, the appropriate type of point-to-point communication, such as non-blocking send/recv, must be selected, as incorrect implementation can cause deadlock. Users must also handle varying numbers of neighboring ranks on domain boundaries and ensure data exchange across mesh, topography, and solution variables. Non-blocking Isend and Irecv functions handle boundary data exchanges, allowing each rank to receive necessary data for stencil computations. After a Waitall synchronization step, each rank performs computations on its subdomain. Finally, custom MPI data types are freed, and MPI_Finalize() concludes the environment.

The actual TorchSWE code has additional complexities specific to its use of multiple arrays, GPU memory management, one-sided communications etc. For the complete implementation, you can refer to the TorchSWE repository.

Explicit distributed logic, like that in TorchSWE, is difficult to debug and maintain throughout the lifespan of simulation codes. Most applications, including TorchSWE, require specialized validation tests to ensure correct outputs. This results in significant programming effort and further complicates development.

cuPyNumeric Implementation

In the cuPyNumeric version of TorchSWE, stencil operations are implemented using distributed array handling from cuPyNumeric, simplifying the code and removing the need for manual partitioning or boundary synchronization. The code operates similarly to NumPy slicing but scales across multiple GPUs. For example, the stencil computation in this version would typically involve using simple array slices like below (instead of the nested loops with integrated MPI logic as in the original implementation).

[ ]:

import cupynumeric as np

# Example dimensions
nx, ny = 128, 128

# Initialize the array h
grid = np.ones((nx, ny))

# Stencil operation using slicing
grid[1:-1, 1:-1] = (
    grid[2:, 1:-1] +  # Below
    grid[:-2, 1:-1] +  # Above
    grid[1:-1, 2:] +  # Right
    grid[1:-1, :-2]   # Left
) / 4

This operation is automatically managed across nodes and GPUs without needing MPI-specific code. More details can be found in the cuPyNumeric port of TorchSWE.

The cuPyNumeric version of TorchSWE eliminates 600 lines of code related to domain decomposition, communication, synchronization, and validation that would otherwise be needed when using MPI4Py with CuPy. These 600 lines require substantial knowledge of distributed computing from domain scientists. By using cuPyNumeric, the simplified NumPy code scales efficiently to 1024 GPUs, making high-fidelity flood modeling accessible without requiring specialized expertise in distributed systems.

Conclusion

cuPyNumeric significantly simplifies the development and maintenance of distributed simulations, such as TorchSWE, by abstracting complex parallelization, synchronization, and communication logic. This eliminates the need for specialized HPC knowledge and reduces the risk of errors, allowing domain scientists to focus on their research. With cuPyNumeric, large-scale simulations can scale efficiently across large HPC systems, enhancing productivity, reducing programming effort, and lowering development costs.

[ ]: