Installing NVSHMEM Language Bindings for Python#

Note

NVSHMEM4Py requires a working NVSHMEM installation. Refer to NVSHMEM installation guide for instructions on how to install NVSHMEM.

Installing NVSHMEM4Py from Released Objects#

NVSHMEM4Py is available on both PyPI and Conda-Forge. The easiest way to install NVSHMEM4Py is through these officially supported release channels.

To install NVSHMEM4Py from PyPI, enter the following commands:

virtualenv nvshmem4py-env # Create a virtual environment
source nvshmem4py-env/bin/activate # Activate the virtual environment
pip install nvshmem4py # Install NVSHMEM4Py

To install NVSHMEM4Py from Conda-Forge, enter the following commands:

conda install -c nvidia nvshmem4py # Install NVSHMEM4Py

Installing NVSHMEM4Py from Source#

To install NVSHMEM4Py from source, first follow the instructions to download the NVSHMEM source code from the NVSHMEM download page.

After you download the source code you must build NVSHMEM4Py. This involves building .so files from the Cython source code in the nvshmem4py/nvshmem/bindings/ directory and packaging the NVSHMEM Python source along with those objects.

As with the NVSHMEM installation, NVSHMEM4Py is built using the cmake build system. By default, NVSHMEM4Py is built when NVSHMEM is built. To build NVSHMEM4Py separately, we provide cmake targets:

# Run from the root of the NVSHMEM source directory
cmake -S . -B build
make build_nvshmem4py_wheels # This will build NVSHMEM4Py objects for all supported CUDA versions and all supported Python versions
make build_nvshmem4py_wheel_3.9 # This will build NVSHMEM4Py objects for all supported CUDA versions and Python version CPython 3.9
make build_nvshmem4py_wheel_cu12_3.9 # This will build NVSHMEM4Py objects for Python 3.10 and Python version CPython 3.9

After you run the above commands, the Python objects are placed in <path to build directory>/dist/. There are three types of files:

nvshmem4py_cu12-<version>-cp312-cp312-linux_<CPU Arch>.whl: A wheel for CUDA12.x and CPython 3.12
nvshmem4py_cu12-<version>-cp312-cp312-none-manylinux-<manylinux tag>_<CPU Arch>.whl: A Manylinux-compliant wheel for CUDA12.x and CPython 3.12
nvshmem4py_cu12-<version>.tar.gz: A Python source distribution tarball for CUDA12.x. This object is not specific to a Python version, and can be used with any Python version above 3.9.

You can use pip to install these built objects. Wheels are platform-specific, Python-version-specific, and self-contained, so they can be installed without any need for compilation. When pip install is run on a source tarball, the Cython code is compiled to .so files at install time. Compiling the NVSHMEM4Py source code requires the CUDA headers to be accessible to the compiler. The path to the CUDA include directory must be included in the compiler’s search path (e.g. -I/usr/local/cuda/include, export CPPFLAGS="-I/usr/local/cuda/include", export CPATH="/usr/local/cuda/include:$CPATH", or similar).

NVSHMEM4Py’s dependencies are:

CUDA 11.x or CUDA 12.x (a version compatible with your NVSHMEM installation)
NVSHMEM 3.3 or later
A Python 3.9 or later interpreter
Several Python packages: * Cython >= 0.29.24
- nvidia-nvshmem-cu12
- numpy >= 1.26.0
- cuda-python >= 12.8.0
- cuda.core >= 0.2.0

Additionally, there are some optional Python dependencies: * MPI and MPI4Py >= 4.0.3 (for MPI support) * CuPy >= 13.4.1 (for interoperability with CuPy arrays) * Torch >= 2.6.0 (for interoperability with PyTorch tensors)

Note

NVSHMEM4Py is tested with OpenMPI >=4.0.5. Other MPI implementations may work, but are not officially supported.

Using NVSHMEM4Py in Your Applications#

Launching NVSHMEM4Py Programs#

NVSHMEM4Py supports the same launch methods as NVSHMEM. Refer to NVSHMEM Launch Methods for more information.

For more information on initializing NVSHMEM4Py, refer to the NVSHMEM4Py API documentation.

Using NVSHMEM with Your Python Program with NVSHMEM4Py#

Enter the command:
```
import nvshmem.core as nvshmem
```
Choose a method of launching and initializing NVSHMEM and NVSHMEM4Py.

Running Performance Tests#

NVSHMEM4Py ships with performance tests for host collective operations. They are similar to the performance tests for the NVSHMEM C and C++ bindings in the perftest/host/coll directory.

To run the performance tests, you must have a working NVSHMEM installation. Please refer to the NVSHMEM installation guide for instructions on how to install NVSHMEM.

If you built the NVSHMEM library with NVSHMEM_MPI_SUPPORT=1, set the environment variables CUDA_HOME, NVSHMEM_HOME and MPI_HOME to build NVSHMEM performance tests:
```
CUDA_HOME=<path to supported CUDA installation>
NVSHMEM_HOME=<path to directory where NVSHMEM is installed>
MPI_HOME=<path to MPI installation>
```
If you built NVSHMEM with MPI and OpenSHMEM support (NVSHMEM_MPI_SUPPORT=1 and NVSHMEM_SHMEM_SUPPORT=1) when you build perftest/, build without SHMEM interoperability by setting the environment variable NVSHMEM_SHMEM_SUPPORT to 0. By default, performance tests are installed under perftest/perftest_install. To install to a different path, set NVSHMEM_PERFTEST_INSTALL to point to the correct path.

The configuration options NVSHMEM_MPI_SUPPORT and NVSHMEM_OPENSHMEM_SUPPORT must be set to the same values as when NVSHMEM and NVSHMEM perftests were built.
Update LD_LIBRARY_PATH to point to $CUDA_HOME/lib64, $MPI_HOME/lib, and $NVSHMEM_HOME/lib.
Assuming Hydra is installed under HYDRA_HOME, run performance tests as NVSHMEM jobs, hybrid MPI+NVSHMEM jobs, or hybrid OpenSHMEM+NVSHMEM jobs with the following commands (using perftest/device/pt-to-pt/put.cu as an example).

NVSHMEM job using Hydra (PMI-1)

# Make sure that NVSHMEM host library is in your LD_LIBRARY_PATH, or set LD_PRELOAD to point at it.
$HYDRA_HOME/bin/nvshmrun -n <up to number of P2P or InfiniBand
NIC accessible GPUs> python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py

NVSHMEM job using Slurm

srun -n <up to number of P2P or InfiniBand NIC accessible GPUs>
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py

$MPI_HOME/bin/mpirun -n <up to number of GPUs accessible by P2P
or InfiniBand NIC> -x NVSHMEMTEST_USE_MPI_LAUNCHER=1
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py

Hybrid OpenSHMEM/NVSHMEM job

$MPI_HOME/bin/oshrun -n <GPUs> -x USE_SHMEM_IN_TEST=1
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py

Where <GPUs> is the number of GPUs to use, up to the number of GPUs accessible by P2P or InfiniBand.

“Hello World” Example#

Save the following code as nvshmem_hello_world.py:

import cupy
import nvshmem.core as nvshmem
from cuda.core.experimental import Device, system
import mpi4py

local_rank_per_node = MPI.COMM_WORLD.Get_rank() % system.num_devices
dev = Device(local_rank_per_node)
dev.set_current()
stream = dev.create_stream()
nvshmem.core.init(device=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi")

arr_src = nvshmem.core.array((2,2), dtype="float32")
arr_dst = nvshmem.core.array((2,2), dtype="float32")
arr_dst[:] = 0
arr_src[:] = local_rank_per_node + 1

# Perform a sum reduction from arr_src to arr_dst across all PEs in TEAM_WORLD (an AllReduce)
nvshmem.core.reduce(nvshmem.core.Teams.TEAM_WORLD, arr_dst, arr_src, "sum", stream=stream)
stream.sync()

# Print dst, src after
print(f"Dest after collective from PE {nvshmem.core.my_pe()}:", arr_dst)
print(f"Src after collective from PE {nvshmem.core.my_pe()}:", arr_src)

# Free Buffers, finalize NVSHMEM
nvshmem.core.free_array(arr_src)
nvshmem.core.free_array(arr_dst)
nvshmem.core.finalize()

Run the nvshmem_hello_world.py sample with one of the following commands:

When running on one host with two GPUs (connected by PCI-E, NVLink or Infiniband):
```
$HYDRA_HOME/bin/nvshmrun -n 2 -ppn 2 python3 nvshmem_hello_world.py
```
When running on two hosts with one GPU per host, connected by InfiniBand:
```
$HYDRA_HOME/bin/nvshmrun -n 2 -ppn 1 –-hosts hostname1,hostname2 python3 nvshmem_hello_world.py
```

For more examples, please refer to the NVSHMEM4Py examples.