Installing NVSHMEM Language Bindings for Python#
Note
NVSHMEM4Py requires a working NVSHMEM installation. Refer to NVSHMEM installation guide for instructions on how to install NVSHMEM.
Installing NVSHMEM4Py from Released Objects#
NVSHMEM4Py is available on both PyPI and Conda-Forge. The easiest way to install NVSHMEM4Py is through these officially supported release channels.
To install NVSHMEM4Py from PyPI, enter the following commands:
virtualenv nvshmem4py-env # Create a virtual environment
source nvshmem4py-env/bin/activate # Activate the virtual environment
pip install nvshmem4py # Install NVSHMEM4Py
To install NVSHMEM4Py from Conda-Forge, enter the following commands:
conda install -c nvidia nvshmem4py # Install NVSHMEM4Py
Installing NVSHMEM4Py from Source#
To install NVSHMEM4Py from source, first follow the instructions to download the NVSHMEM source code from the NVSHMEM download page.
After you download the source code you must build NVSHMEM4Py. This involves building .so
files from the Cython source code in the nvshmem4py/nvshmem/bindings/
directory and packaging the NVSHMEM Python source along with those objects.
As with the NVSHMEM installation, NVSHMEM4Py is built using the cmake
build system. By default, NVSHMEM4Py is built when NVSHMEM is built. To build NVSHMEM4Py separately, we provide cmake
targets:
# Run from the root of the NVSHMEM source directory
cmake -S . -B build
make build_nvshmem4py_wheels # This will build NVSHMEM4Py objects for all supported CUDA versions and all supported Python versions
make build_nvshmem4py_wheel_3.9 # This will build NVSHMEM4Py objects for all supported CUDA versions and Python version CPython 3.9
make build_nvshmem4py_wheel_cu12_3.9 # This will build NVSHMEM4Py objects for Python 3.10 and Python version CPython 3.9
After you run the above commands, the Python objects are placed in <path to build directory>/dist/. There are three types of files:
nvshmem4py_cu12-<version>-cp312-cp312-linux_<CPU Arch>.whl
: A wheel for CUDA12.x and CPython 3.12nvshmem4py_cu12-<version>-cp312-cp312-none-manylinux-<manylinux tag>_<CPU Arch>.whl
: A Manylinux-compliant wheel for CUDA12.x and CPython 3.12nvshmem4py_cu12-<version>.tar.gz
: A Python source distribution tarball for CUDA12.x. This object is not specific to a Python version, and can be used with any Python version above 3.9.
You can use pip
to install these built objects. Wheels are platform-specific, Python-version-specific, and self-contained, so they can be installed without any need for compilation. When pip install
is run on a source tarball, the Cython code is compiled to .so files at install time. Compiling the NVSHMEM4Py source code requires the CUDA headers to be accessible to the compiler. The path to the CUDA include directory must be included in the compiler’s search path (e.g. -I/usr/local/cuda/include
, export CPPFLAGS="-I/usr/local/cuda/include"
, export CPATH="/usr/local/cuda/include:$CPATH"
, or similar).
NVSHMEM4Py’s dependencies are:
CUDA 11.x or CUDA 12.x (a version compatible with your NVSHMEM installation)
NVSHMEM 3.3 or later
A Python 3.9 or later interpreter
Several Python packages: * Cython >= 0.29.24
nvidia-nvshmem-cu12
numpy >= 1.26.0
cuda-python >= 12.8.0
cuda.core >= 0.2.0
Additionally, there are some optional Python dependencies: * MPI and MPI4Py >= 4.0.3 (for MPI support) * CuPy >= 13.4.1 (for interoperability with CuPy arrays) * Torch >= 2.6.0 (for interoperability with PyTorch tensors)
Note
NVSHMEM4Py is tested with OpenMPI >=4.0.5. Other MPI implementations may work, but are not officially supported.
Using NVSHMEM4Py in Your Applications#
Launching NVSHMEM4Py Programs#
NVSHMEM4Py supports the same launch methods as NVSHMEM. Refer to NVSHMEM Launch Methods for more information.
For more information on initializing NVSHMEM4Py, refer to the NVSHMEM4Py API documentation.
Using NVSHMEM with Your Python Program with NVSHMEM4Py#
Enter the command:
import nvshmem.core as nvshmem
Choose a method of launching and initializing NVSHMEM and NVSHMEM4Py.
Running Performance Tests#
NVSHMEM4Py ships with performance tests for host collective operations. They are similar to the performance tests for the NVSHMEM C and C++ bindings in the perftest/host/coll
directory.
To run the performance tests, you must have a working NVSHMEM installation. Please refer to the NVSHMEM installation guide for instructions on how to install NVSHMEM.
If you built the NVSHMEM library with
NVSHMEM_MPI_SUPPORT=1
, set the environment variablesCUDA_HOME
,NVSHMEM_HOME
andMPI_HOME
to build NVSHMEM performance tests:CUDA_HOME=<path to supported CUDA installation> NVSHMEM_HOME=<path to directory where NVSHMEM is installed> MPI_HOME=<path to MPI installation>
If you built NVSHMEM with MPI and OpenSHMEM support (
NVSHMEM_MPI_SUPPORT=1
andNVSHMEM_SHMEM_SUPPORT=1
) when you buildperftest/
, build without SHMEM interoperability by setting the environment variableNVSHMEM_SHMEM_SUPPORT
to0
. By default, performance tests are installed underperftest/perftest_install
. To install to a different path, setNVSHMEM_PERFTEST_INSTALL
to point to the correct path.The configuration options
NVSHMEM_MPI_SUPPORT
andNVSHMEM_OPENSHMEM_SUPPORT
must be set to the same values as when NVSHMEM and NVSHMEM perftests were built.Update
LD_LIBRARY_PATH
to point to$CUDA_HOME/lib64
,$MPI_HOME/lib
, and$NVSHMEM_HOME/lib
.Assuming Hydra is installed under
HYDRA_HOME
, run performance tests as NVSHMEM jobs, hybrid MPI+NVSHMEM jobs, or hybrid OpenSHMEM+NVSHMEM jobs with the following commands (usingperftest/device/pt-to-pt/put.cu
as an example).
NVSHMEM job using Hydra (PMI-1)
# Make sure that NVSHMEM host library is in your LD_LIBRARY_PATH, or set LD_PRELOAD to point at it.
$HYDRA_HOME/bin/nvshmrun -n <up to number of P2P or InfiniBand
NIC accessible GPUs> python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py
NVSHMEM job using Slurm
srun -n <up to number of P2P or InfiniBand NIC accessible GPUs>
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py
$MPI_HOME/bin/mpirun -n <up to number of GPUs accessible by P2P
or InfiniBand NIC> -x NVSHMEMTEST_USE_MPI_LAUNCHER=1
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py
Hybrid OpenSHMEM/NVSHMEM job
$MPI_HOME/bin/oshrun -n <GPUs> -x USE_SHMEM_IN_TEST=1
python3 NVSHMEM4PY_INSTALL/perftest/reduction_on_stream.py
Where <GPUs>
is the number of GPUs to use, up to the number of GPUs accessible by P2P or InfiniBand.
“Hello World” Example#
Save the following code as
nvshmem_hello_world.py
:import cupy import nvshmem.core as nvshmem from cuda.core.experimental import Device, system import mpi4py local_rank_per_node = MPI.COMM_WORLD.Get_rank() % system.num_devices dev = Device(local_rank_per_node) dev.set_current() stream = dev.create_stream() nvshmem.core.init(device=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi") arr_src = nvshmem.core.array((2,2), dtype="float32") arr_dst = nvshmem.core.array((2,2), dtype="float32") arr_dst[:] = 0 arr_src[:] = local_rank_per_node + 1 # Perform a sum reduction from arr_src to arr_dst across all PEs in TEAM_WORLD (an AllReduce) nvshmem.core.reduce(nvshmem.core.Teams.TEAM_WORLD, arr_dst, arr_src, "sum", stream=stream) stream.sync() # Print dst, src after print(f"Dest after collective from PE {nvshmem.core.my_pe()}:", arr_dst) print(f"Src after collective from PE {nvshmem.core.my_pe()}:", arr_src) # Free Buffers, finalize NVSHMEM nvshmem.core.free_array(arr_src) nvshmem.core.free_array(arr_dst) nvshmem.core.finalize()
Run the
nvshmem_hello_world.py
sample with one of the following commands:When running on one host with two GPUs (connected by PCI-E, NVLink or Infiniband):
$HYDRA_HOME/bin/nvshmrun -n 2 -ppn 2 python3 nvshmem_hello_world.py
When running on two hosts with one GPU per host, connected by InfiniBand:
$HYDRA_HOME/bin/nvshmrun -n 2 -ppn 1 –-hosts hostname1,hostname2 python3 nvshmem_hello_world.py
For more examples, please refer to the NVSHMEM4Py examples.