Troubleshooting And FAQs¶
General FAQs¶
Q: What does the following runtime warning imply?
WARN: IB HCA and GPU are not connected to a PCIe switch so InfiniBand
performance can be limited depending on the CPU generation
A: This warning is related to the HCA to GPU mapping of the platform. For more
information, refer to the SHMEM_HCA_PE_MAPPING
variable in Environment Variables.
Q: What does the following runtime error indicate?
NULL value could not find mpi library in environment.
A: This occurs if libmpi.so
or libmpi_ibm.so
is not present in the
environment. For more information, refer to the NVSHMEM_MPI_LIB_NAME
variable
in Environment Variables to specify the name of the MPI library
installed.
Q: What does the following runtime error indicate?
src/comm/transports/ibrc/ibrc.cpp:: NULL value mem registration failed.
A: This occurs if GPUDirect RDMA is not enabled on the platform, thereby
preventing registration of cudaMalloc
memory with the InfiniBand driver. This
usually indicates that the nv_peer_mem
kernel module is absent. When
nv_peer_mem
is installed, output from lsmod
is similar to the following:
~$ lsmod | grep nv_peer_mem
nv_peer_mem 20480 0
ib_core 241664 11
rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
nvidia 17596416 226
nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
nv_peer_mem is available here: https://github.com/Mellanox/nv_peer_memory
Q: What does the following runtime error indicate?
src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed
A: This occurs when ibverbs library is present on the system but the library is not able to detect any InfiniBand devices on the system. Make sure that the InfiniBand devices are available and/or are in a working state.
Q: My application uses the CMake build system. Adding NVSHMEM to the build system breaks for a CMake version below 3.11. Why?
A: Device linking support was added in version 3.11 which NVSHMEM requires.
Q: Why does a CMake build of my NVSHMEM application fail with version 3.12 but does not with an earlier version?
A: A new CMake policy adds -pthread
to the nvcc
device linking causing the
linking failure. Before 3.12, the default policy did not add -pthread
. For
3.12 and newer, add cmake_policy(SET CMP0074 OLD)
to CMakeLists.txt
.
Q: What CMake settings needed to build CUDA or NVSHMEM applications?
A: Add the following to the CMake file, substituting the target GPU
architecture in place of compute_70
and sm_70
.
string(APPEND CMAKE_CUDA_FLAGS "-gencode arch=compute_70,code=sm_70")
Q: Why does my NVSHMEM Hydra job become non-responsive on Summit?
A: Summit requires the additional option --launcher ssh
to be passed to
nvshmrun
at the command-line.
Q: Can two PEs share the same GPU with NVSHMEM?
A: NVSHMEM assumes a 1:1 mapping of PEs to GPUs. NVSHMEM jobs launched with more PEs than available GPUs are not supported.
Q. What is the right way to use CUDA_VISIBLE_DEVICES with NVSHMEM?
A. When using CUDA_VISIBLE_DEVICES
with NVSHMEM, all PEs should be passed the
same value of CUDA_VISIBLE_DEVICES
. Note that we may change this in a future
NVSHMEM version.
Prerequisite FAQs¶
Q: Does NVSHMEM require CUDA?
A: Yes. CUDA must be installed to use NVSHMEM, please refer to the installation guide for version requirements. NVSHMEM is a communication library intended to be used for efficient data movement and synchronization between two or more GPUs. It is currently not intended for data movement that does not involve GPUs.
Q: Does NVSHMEM require MPI?
A: No. NVSHMEM applications without MPI dependencies can use NVSHMEM and be
launched with the Hydra launcher. Hydra can be installed using the installation
script included with NVSHMEM. This script installs the Hydra launcher binaries
with the names nvshmrun
and nvshmrun.hydra
. An externally installed
copy of the Hydra launcher can also be used, which typically installs the Hydra
launcher binaries with the names mpiexec
, mpirun
, and
mpiexec.hydra
.
Q: My NVSHMEM job runs on NVIDIA Volta GPUs but hangs on NVIDIA Kepler GPUs. Why does this happen?
A: NVSHMEM Synchronizing APIs inside the CUDA kernel are only supported on NVIDIA Volta and newer GPUs.
Running NVSHMEM Programs FAQs¶
Q: I get a missing symbol error when I try to launch an NVSHMEM application using PMIx with the OpenMPI implementation of mpirun. How can I fix this?
A: There is a known incompatibility between the internal PMIx implementation in OpenMPI and the one we use in NVSHMEM. Trying to run an nvshmem application with OpenMPI using the internal PMIx will most likely result in the following error:
pmix_mca_base_component_repository_open: unable to open mca_gds_ds21:
perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
This can be worked around by compiling OpenMPI with an external implementation of PMIx using
the --with-pmix={PATH_TO_PMIX}
configure option.
Interoperability With Other Programming Models FAQs¶
Q: Can NVSHMEM be used in MPI applications?
A: Yes. NVSHMEM provides an initialization API that takes an MPI communicator as an attribute. Each MPI rank in the communicator becomes an OpenSHMEM PE. Currently, NVSHMEM has been tested with OpenMPI 4.0.0. In principle, other OpenMPI derivatives such as SpectrumMPI (available on Summit and Sierra) are also expected to work.
Q: Can NVSHMEM be used in OpenSHMEM applications?
A: Yes. NVSHMEM provides an initialization API that supports running NVSHMEM on
top of an OpenMPI/OSHMEM job. Each OSHMEM PE maps 1:1 to an NVSHMEM PE. NVSHMEM
has been tested with OpenMPI 4.0.0/OSHMEM and OpenMPI3+/OSHMEM depends on UCX
(NVSHMEM has been tested with UCX 1.4.0). The OpenMPI-4.0.0 installation must
be configured with the --with-ucx
flag to enable OpenSHMEM + NVSHMEM
interoperability.
GPU-GPU Interconnection FAQs¶
Q: Can I use NVSHMEM to transfer data across GPUs on different sockets?
A: Yes, if there is an InfiniBand NIC accessible to GPUs on both the sockets. Otherwise, NVSHMEM requires that all GPUs are P2P accessible.
Q: Can I use NVSHMEM to transfer data between P2P-accessible GPUs that are connected by PCIe?
A: Yes, NVSHMEM supports PCIe. However, when using PCIe for P2P communication, either InfiniBand support is required to use NVSHMEM atomic memory operations API or one has to use NVSHMEM’s UCX transport (that will use sockets for atomics when IB is absent).
Q: Can I use NVSHMEM to transfer data between GPUs on different hosts connected by InfiniBand?
A: Yes. NVSHMEM supports InfiniBand. Strided-RMA (shmem_iput/iget) operations are not supported over InfiniBand.
Q: Can I run NVSHMEM on a host without InfiniBand NICs?
A: Yes. Support on P2P platforms remains unchanged.
Q: Can I run NVSHMEM on a host with InfiniBand NICs where some NICs are disabled or configured in a non-InfiniBand mode?
A: Yes. See the Useful Environment Variables section for how to explicitly specify NIC affinity to PEs.
NVSHMEM API Usage FAQs¶
Q: What’s the difference between, say, nvshmemx_putmem_on_stream and nvshmemx_putmem_nbi_on_stream? It seems both are asynchronous to the host thread and ordered with respect to a given stream.
A: The function nvshmemx_putmem_nbi_on_stream
is implemented in a more
deferred way by not issuing the transfer immediately but making it wait on an
event at the end of the stream. If there is another transfer in process at the
same time (on another stream), bandwidth could be shared. If the application
can avoid this, nvshmemx_putmem_nbi_on_stream
gives the flexibility to
express this intent to NVSHMEM. But NVSHMEM currently does not track activity
on all CUDA streams. The current implementation records an event on the user
provided stream, makes an NVSHMEM internal stream wait on the event, and then
issues a put on the internal stream. If all nbi puts land on the same internal
stream, they are serialized so that the bandwidth is used exclusively.
Q: Can I issue multiple nvshmemx_barrier_all_on_stream on multiple streams concurrently and then cudaStreamSynchronize on each stream?
A: Multiple concurrent nvshmemx_barrier_all_on_stream
/ nvshmem_barrier_all
calls are not valid. Only one barrier (or any other collective) among the same
set of PEs can be in-flight at any given time. To use concurrent barriers among
partially overlapping teams, syncneighborhood_kernel
can be used as a
template to implement a custom barrier. See the following for an example of a
custom barrier (multi-gpu-programming-models).
Q: Suppose there are in-flight nvshmem_putmem_on_stream operations. Does nvshmem_barrier_all() ensure completion of the pending NVSHMEM operations on streams?
A: The nvshmem_barrier_all()
operation does not ensure completion of the
pending NVSHMEM operations on streams. The cudaStreamSynchronize
function
should be called before calling nvshmem_barrier_all
.
Q: Why is nvshmem_quiet necessary in the syncneighborhood_kernel?
A: It is required by nvshmem_barrier
semantics. As stated in
multi-gpu-programming-models, “nvshmem_barrier ensures that all previously
issued stores and remote memory updates, including AMO and RMA operations, done
by any of the PEs in the team on the default context are complete before
returning.”
Q: If a kernel uses nvshmem_put_block instead of nvshmem_p, is nvshmem_quiet still required?
A: It is required per OpenSHMEM’s requirement to put semantics which do not guarantee delivery of data to the destination array on the remote PE. For more information, see multi-gpu-programming-models.
Q: I use the host-side blocking API, nvshmem_putmem_on_stream, on the same CUDA stream that I want to be delivered at the target in order. Is nvshmem_quiet required even though there is no non-blocking call and they are issued in separate kernels?
A: In the current implementation, nvshmem_putmem_on_stream
includes quiet.
However, it is only required to release the local buffer and not necessarily
deliver at the target by the OpenSHMEM spec.
Q: Is it sufficient to use a nvshmem_fence (instead of a nvshmem_quiet) in the above case if the target is the same PE?
A: In the current implementation, all messages to the same PE are delivered in
the order they are received by the HCA, which follows the stream order. So,
even nvshmem_fence
is not required. These are not the semantics provided by
the OpenSHMEM specification, however. The nvshmem_putmem_on_stream
function
on the same CUDA stream only ensures that the local buffers for the transfers
will be released in the same order.
Q: When nvshmem_quiet is used inside a device kernel, is the quiet operation scoped within the stream the kernel is running on? In other words, does it ensure completion of all operations or only those issued to the same stream?
A: It ensures completion of all operations that are GPU-initiated. A
nvshmem_quiet
call on the device does not quiet in-flight operations from the
host.
Debugging FAQs¶
NOTE: Ensure you follow the CUDA Best Practices Guide to ease debugging CUDA programs. For example, read Error Handling.
Q: Is there any hint to diagnose the hang?
A: Check if there are stream 0 blocking CUDA calls from the application, like
cudaDeviceSynchronize
or cudaMemcpy
, especially in the iterative phase of
the application. Stream 0 blocking calls in the initialization and finalization
phases are usually safe. Check if the priority of the user stream used for
NVSHMEM _on_stream
calls is explicitly set with
cudaStreamCreateWithPriority
. Check that the determinism of the hang changes
with single-node (all pairs of GPUs connected by NVLink or PCI-E only) compared
to single-node (GPUs on different sockets connected by InfiniBand loopback) or
multi-node (GPUs connected by InfiniBand).
Q: How do I dump debugging information?
A: Refer to the runtime environment variables: NVSHMEM_INFO
, NVSHMEM_DEBUG
,
and NVSHMEM_DEBUG_FILE
in Environment Variables.
Q: Why is the receive buffer not updated with remote data even after synchronization with a flag?
A: For synchronization with flag, the application must use nvshmem_wait_until
or nvshmem_test
API. A plain while loop or if condition to check flag value
is not sufficient. NVSHMEM needs to perform consistency operation to ensure
that the data is visible to the GPU after synchronization using flag value.
Miscellaneous FAQs¶
Q: Does pointer arithmetic work with shmem pointers? For example,
int* outmsg = (int *) shmem_malloc(2* sizeof(int));
shmem_int_p(target + 1, mype, peer);
A: Yes.
Q: Can I avoid cudaDeviceSynchronize + MPI_Barrier to synchronize across multiple GPUs?
A: Yes, nvshmem_barrier_all_on_stream
with cudaStreamSynchronize
can be
called from the host thread. If multiple barrier synchronization events can
happen before synchronizing with the host thread, this gives better
performance. Calling nvshmem_barrier_all
from inside the CUDA kernel can be
used for collective synchronization if there are other things that can be done
by the same CUDA kernel after a barrier synchronization event. For
synchronizing some pairs of PEs and not all, pair-wise nvshmem_atomic_set
calls by the initiator and nvshmem_wait_until
or nvshmem_test
calls by the
target can be used.
Q: How should I allocate memory for NVSHMEM?
A: See Memory Management for information on allocating symmetric memory.
Note that NVSHMEM requires the local and remote pointer to both be symmetric
for communication with a remote peer connected by InfiniBand. If the remote
peer is P2P accessible (PCI-E or NVLink), the local pointer can be obtained
using cudaMalloc
and is not required to be from the symmetric heap.
Q: Is there any example of a mini-application written using NVSHMEM?
A. Yes. The multi-GPU programming models GitHub repository contains an example Jacobi mini-application written using NVSHMEM.
Q: I am observing degraded performance when running NVSHMEM with NCCL support. How do I fix the performance?
A: NVSHMEM and NCCL both launch CPU proxy threads for communicatin over IB. These threads can interfere with each other leading to context switches and hence poor performance. These threads should have dedicated hardware threads (or cores) for better performance. NCCL proxy thread is active only when NVSHMEM is using NCCL for collective communication. Try binding options like –bind-to numa or –bind-to core, –bind-to none for improving performance.