Queue Pair (QP) Specific APIs¶
NVSHMEM provides a set of queue pair (QP) specific communication APIs that enable fine-grained control over communication resources for advanced use cases. Queue pairs are communication endpoints that can be created and managed independently, allowing applications to separate different communication streams and potentially achieve better performance through reduced contention and improved load balancing.
Overview¶
Queue pairs in NVSHMEM provide an abstraction for managing separate communication channels. Each queue pair represents an independent communication resource that can be used for RMA operations, signaling operations, and memory ordering operations. By using multiple queue pairs, applications can:
- Reduce contention: Different communication patterns can use separate queue pairs to avoid contention on shared resources.
- Improve load balancing: Multiple queue pairs can be assigned to different threads or workloads to balance communication resources.
- Enable independent ordering: Operations on different queue pairs can proceed independently, allowing for more flexible ordering semantics.
- Support specialized communication patterns: Applications with distinct communication phases can use different queue pairs for each phase.
QP Handle¶
Queue pairs are identified by a handle of type nvshmemx_qp_handle_t.
This handle must be obtained through the nvshmemx_qp_create function
and is used as a parameter to all QP-specific communication operations.
The handle is an opaque type that should not be manipulated directly by
user code.
Special QP Handle Values¶
In addition to handles created by nvshmemx_qp_create, the following
special values from the nvshmemx_qp_handle_index_t enumeration can
be used with communication operations (RMA and signaling):
NVSHMEMX_QP_DEFAULT: Use the default queue pairs. When this value is passed to a QP-specific communication API, it has the same effect as calling the corresponding non-QP-specific API. This allows applications to switch between custom and default queue pairs without code changes.NVSHMEMX_QP_ANY: Use any available queue pair, either a default queue pair or one of the custom created queue pairs. The implementation will select an appropriate queue pair for the operation. This can be useful for load balancing when the application does not require operations to use specific queue pairs.
These special values provide flexibility in queue pair selection, allowing applications to write code that can work with or without custom queue pairs, or to allow the implementation to optimize queue pair selection.
Special PE Values for Synchronization¶
For synchronization operations (fence and quiet), in addition to
specifying a particular PE number, the following special value from the
nvshmem_pe_index_t enumeration can be used:
NVSHMEMX_PE_ALL(aliasNVSHMEMX_PE_ANY): Synchronize operations to all PEs, not just a single target PE. When this value is used with fence operations, ordering is ensured for operations to all PEs. When used with quiet operations, completion is ensured for operations to all PEs.
Special QP Values for Synchronization¶
For synchronization operations (fence and quiet), in addition to the special values available for communication operations, the following additional behavior is available:
NVSHMEMX_QP_DEFAULT: For synchronization operations, this synchronizes all default queue pairs, ensuring ordering or completion for all operations on default communication resources.NVSHMEMX_QP_ALL: For synchronization operations, this synchronizes all queue pairs (both default and custom created), ensuring ordering or completion for operations across all communication resources. Note thatNVSHMEMX_QP_ALLis an alias forNVSHMEMX_QP_ANY.
These special values enable powerful synchronization patterns. For
example, using NVSHMEMX_QP_ALL with NVSHMEMX_PE_ALL in a quiet
operation ensures that all operations on all queue pairs to all PEs are
complete and visible, providing a global completion guarantee across all
communication resources.
API Categories¶
The QP-specific APIs are organized into the following categories:
- QP Creation: APIs for creating and managing queue pairs (NVSHMEMX_QP_CREATE).
- QP RMA Operations: Remote memory access operations that target specific queue pairs, including put/get operations in various forms (scalar, vector, strided, non-blocking) and thread group variants (QP Remote Memory Access).
- QP Signal Operations: Signaling operations that use specific queue pairs for point-to-point synchronization, including put-with-signal variants (QP Signaling Operations).
- QP Memory Ordering: Memory ordering operations (fence and quiet) that operate on specific queue pairs to ensure ordering and completion of operations (QP Memory Ordering).
Device-Side Only¶
With the exception of nvshmemx_qp_create, which is a host-side
initialization function, all QP-specific communication APIs are
device-side operations intended to be called from CUDA kernels. These
APIs provide the same functionality as their non-QP counterparts but
allow targeting specific queue pair resources.
Ordering Semantics¶
Operations issued on different queue pairs are independent and do not
have any inherent ordering relationship. To ensure ordering between
operations on different queue pairs, explicit synchronization (such as
nvshmemx_qp_quiet) must be used. Operations on the same queue pair
follow the same ordering semantics as their non-QP counterparts, as
described in Section Memory Ordering.
Thread Safety¶
Queue pair handles can be safely shared and used by multiple threads
within a CUDA kernel. However, the thread safety guarantees for
QP-specific operations are the same as for non-QP operations. Thread
group variants (e.g., _warp and _block suffixes) must be called
collectively by all threads in the corresponding thread group.
Transport Support and Compatibility¶
The QP-specific APIs are designed to be portable across different system configurations and transport implementations. Applications can safely call QP-specific APIs in all environments, including configurations where custom queue pairs are not supported or not needed.
In systems where application PEs reside within a single NVLink domain and communication occurs exclusively over NVLink, custom queue pairs are not necessary for performance. Similarly, not all remote transport plugin implementations support the creation of additional queue pairs beyond the default resources. Currently, custom queue pair creation is supported by the IBRC, IBDEVX, and IBGDA (with RC queue pairs) transport plugins.
When nvshmemx_qp_create is called in an environment where custom
queue pairs are not supported—either because the transport does not
support them or because all PEs are within a single NVLink domain—the
function will succeed but return NVSHMEMX_QP_DEFAULT as the handle
value for all newly created queue pairs. This graceful fallback behavior
ensures that applications remain portable without requiring conditional
compilation or runtime checks. Operations performed using these handles
will execute correctly using the default communication resources,
maintaining functional correctness while allowing applications to be
written with QP-specific APIs throughout.
This design allows developers to write code that takes advantage of custom queue pairs when available for improved performance and load balancing, while automatically falling back to default behavior on systems where custom queue pairs are not supported or not beneficial.
NVSHMEMX_QP_CREATE¶
-
int
nvshmemx_qp_create(int num_qps, nvshmemx_qp_handle_t **out_qp_array)¶
- num_qps [IN]
- The number of queue pairs to create.
- out_qp_array [OUT]
- Pointer to an array of queue pair handles that will be allocated and populated by this function.
Description
The nvshmemx_qp_create function creates num_qps queue pair
handles and returns them in a newly allocated array. The function
allocates memory for the array of handles and stores the pointer in
out_qp_array.
Returns
Returns 0 on success. Returns a non-zero value on failure.
Notes
Collective Operation: This is a collective operation that must be
called by all PEs in the world team. All PEs must call this function
with the same num_qps value.
PE-Independent Handles: Each queue pair handle returned in the array can be used to communicate with any PE. The handles are not specific to a particular target PE. For example, a single queue pair handle can be used for RMA operations targeting PE 0, PE 1, PE 2, etc.
Per-PE Synchronization: Although a single queue pair handle can target multiple PEs, synchronization operations (fence and quiet) on that queue pair can be performed on a per-PE basis. This allows fine-grained control over ordering and completion guarantees for operations to specific PEs on specific queue pairs.
Memory Management: The library allocates the array pointed to by
out_qp_array. Users must not allocate memory for this array before
calling the function. Users are responsible for freeing this memory
after NVSHMEM finalization. Do not free the array before finalization as
the handles must remain valid.
This function must be called on the host after NVSHMEM initialization and before any device-side QP-specific operations are performed.
The number of queue pairs that can be created may be limited by system resources. Applications should handle creation failures gracefully.
Queue pairs created by this function are independent of the default communication context and do not interfere with operations performed without QP handles. Operations using custom queue pair handles and operations using the default context (non-QP APIs) operate on separate resources.
QP Remote Memory Access¶
The following routines provide RMA operations on specific queue pairs. These operations have the same semantics as their non-QP counterparts but target a specific queue pair resource for improved control over communication channels.
NVSHMEMX_QP_PUT¶
-
__device__ void
nvshmemx_qp_TYPENAME_put(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_block(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Symmetric address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Device address of the data object containing the data to be copied.
The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific put routines perform the same operation as their non-QP
counterparts but use a specific queue pair resource. The routines return
after the data has been copied out of the source array on the local
PE. The delivery of data words into the data object on the destination
PE may occur in any order. Furthermore, two successive put routines on
the same queue pair may deliver data out of order unless a call to
nvshmemx_qp_fence is introduced between the two calls.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_put_warpnvshmemx_qp_TYPENAME_put_block
Returns
None.
NVSHMEMX_QP_P¶
-
__device__ void
nvshmemx_qp_TYPENAME_p(TYPE *dest, TYPE value, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Symmetric address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - value [IN]
- The value to be transferred to
dest. The type ofvalueshould match that implied in the SYNOPSIS section. - pe [IN]
- The number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific scalar put routines provide a very low latency put capability for single elements of most basic types using a specific queue pair resource. These routines perform the same operation as their non-QP counterparts but use the specified queue pair.
As with nvshmemx_qp_put, these routines start the remote transfer
and may return before the data is delivered to the remote PE. Use
nvshmemx_qp_quiet to force completion of all remote Put transfers
on the specified queue pairs.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Returns
None.
NVSHMEMX_QP_GET¶
-
__device__ void
nvshmemx_qp_TYPENAME_get(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_get_block(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_get_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Device address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of the source data object. The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific get routines perform the same operation as their non-QP
counterparts but use a specific queue pair resource. The routines return
after the data has been copied into the dest array on the local PE.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_get_warpnvshmemx_qp_TYPENAME_get_block
Returns
None.
NVSHMEMX_QP_G¶
-
__device__ TYPE
nvshmemx_qp_TYPENAME_g(const TYPE *source, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- source [IN]
- Symmetric address of the source data object. The type of
sourceshould match that implied in the SYNOPSIS section. - pe [IN]
- The number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific scalar get routines provide a very low latency get capability for single elements of most basic types using a specific queue pair resource. These routines perform the same operation as their non-QP counterparts but use the specified queue pair.
The routine returns the value retrieved from the remote PE.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Returns
Returns the value retrieved from the remote PE. The return type matches the TYPE specified in the function name.
NVSHMEMX_QP_PUT_NBI¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_nbi_block(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_nbi_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Symmetric address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Device address of the data object containing the data to be copied.
The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific nonblocking put routines perform the same operation as their non-QP counterparts but use a specific queue pair resource. The routines initiate the transfer and return immediately without waiting for the operation to complete.
The completion of the transfer can be ensured by calling
nvshmemx_qp_quiet on the appropriate queue pairs.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_put_nbi_warpnvshmemx_qp_TYPENAME_put_nbi_block
Returns
None.
NVSHMEMX_QP_GET_NBI¶
-
__device__ void
nvshmemx_qp_TYPENAME_get_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_get_nbi_block(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_get_nbi_warp(TYPE *dest, const TYPE *source, size_t nelems, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Device address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Symmetric address of the source data object. The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific nonblocking get routines perform the same operation as their non-QP counterparts but use a specific queue pair resource. The routines initiate the transfer and return immediately without waiting for the operation to complete.
The completion of the transfer can be ensured by calling
nvshmemx_qp_quiet on the appropriate queue pairs.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_get_nbi_warpnvshmemx_qp_TYPENAME_get_nbi_block
Returns
None.
QP Signaling Operations¶
The following routines provide signaling operations on specific queue pairs. These operations allow for point-to-point synchronization using specific queue pair resources.
NVSHMEMX_QP_SIGNAL_OP¶
-
__device__ void
nvshmemx_qp_signal_op(uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
- sig_addr [OUT]
- Symmetric address of the signal data object on the remote PE.
- signal [IN]
- The value to be used for the signal update.
- sig_op [IN]
- The signal operator to be applied. Valid operators are
NVSHMEM_SIGNAL_SETandNVSHMEM_SIGNAL_ADD. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The nvshmemx_qp_signal_op routine performs a signal operation on the
signal data object at sig_addr on PE pe using the specified
queue pair. The operation performed depends on the sig_op parameter:
NVSHMEM_SIGNAL_SET: The signal data object is set to the value ofsignal.NVSHMEM_SIGNAL_ADD: The value ofsignalis added to the signal data object.
The signal operation is performed atomically with respect to other signal operations, signal-fetch operations, and point-to-point synchronization routines that access the same signal data object.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Returns
None.
NVSHMEMX_QP_PUT_SIGNAL¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal_block(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal_warp(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Symmetric address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Device address of the data object containing the data to be copied.
The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - sig_addr [OUT]
- Symmetric address of the signal data object on the remote PE.
- signal [IN]
- The value to be used for the signal update.
- sig_op [IN]
- The signal operator to be applied. Valid operators are
NVSHMEM_SIGNAL_SETandNVSHMEM_SIGNAL_ADD. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific put-with-signal routines perform the same operation as
their non-QP counterparts but use a specific queue pair resource. The
routines copy the data from source to dest on PE pe, and
after the data transfer completes, perform a signal operation on the
signal data object at sig_addr.
The signal operation is guaranteed to occur after the data transfer completes. The signal operation is performed atomically with respect to other signal operations, signal-fetch operations, and point-to-point synchronization routines that access the same signal data object.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_put_signal_warpnvshmemx_qp_TYPENAME_put_signal_block
Returns
None.
NVSHMEMX_QP_PUT_SIGNAL_NBI¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal_nbi(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal_nbi_block(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
-
__device__ void
nvshmemx_qp_TYPENAME_put_signal_nbi_warp(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, nvshmemx_qp_handle_t qp_index)¶
where TYPE is one of the standard RMA types and has a corresponding TYPENAME specified by Table stdrmatypes.
- dest [OUT]
- Symmetric address of the destination data object. The type of
destshould match that implied in the SYNOPSIS section. - source [IN]
- Device address of the data object containing the data to be copied.
The type of
sourceshould match that implied in the SYNOPSIS section. - nelems [IN]
- Number of elements in the
destandsourcearrays. - sig_addr [OUT]
- Symmetric address of the signal data object on the remote PE.
- signal [IN]
- The value to be used for the signal update.
- sig_op [IN]
- The signal operator to be applied. Valid operators are
NVSHMEM_SIGNAL_SETandNVSHMEM_SIGNAL_ADD. - pe [IN]
- PE number of the remote PE.
- qp_index [IN]
- Queue pair handle identifying the communication resource to use. This
can be a handle obtained from
nvshmemx_qp_create, or one of the special valuesNVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ANY.
Description
The QP-specific nonblocking put-with-signal routines perform the same
operation as their non-QP counterparts but use a specific queue pair
resource. The routines initiate the data transfer from source to
dest on PE pe, followed by a signal operation on the signal data
object at sig_addr. The routine returns immediately without waiting
for the operations to complete.
The signal operation is guaranteed to occur after the data transfer completes. The signal operation is performed atomically with respect to other signal operations, signal-fetch operations, and point-to-point synchronization routines that access the same signal data object.
The completion of both the data transfer and signal operation can be
ensured by calling nvshmemx_qp_quiet on the appropriate queue pairs.
Operations on different queue pairs are independent and do not have any inherent ordering relationship.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_TYPENAME_put_signal_nbi_warpnvshmemx_qp_TYPENAME_put_signal_nbi_block
Returns
None.
QP Memory Ordering¶
The following routines provide memory ordering operations on specific queue pairs. These operations ensure ordering and completion of operations issued on the specified queue pairs.
NVSHMEMX_QP_FENCE¶
-
__device__ void
nvshmemx_qp_fence(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
-
__device__ void
nvshmemx_qp_fence_block(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
-
__device__ void
nvshmemx_qp_fence_warp(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
- pe [IN]
- PE number of the target PE for which ordering is to be ensured. This
can be a specific PE number, or the special value
NVSHMEMX_PE_ALL(aliasNVSHMEMX_PE_ANY) to ensure ordering for all PEs. - qp_handle [IN]
- Pointer to an array of queue pair handles. This can also be a pointer
to one of the special values
NVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ALLto fence all default queue pairs or all queue pairs (including custom created ones), respectively. - num_qps [IN]
- Number of queue pairs in the
qp_handlearray. When using special QP values (NVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ALL), this should be set to 1.
Description
The nvshmemx_qp_fence routine ensures ordering of delivery of
operations on symmetric data objects issued on the specified queue pairs
to the target PE(s). All operations on symmetric data objects issued to
PE pe on the specified queue pairs prior to the call to
nvshmemx_qp_fence are guaranteed to be delivered before any
subsequent operations on symmetric data objects to the same PE(s) on the
same queue pairs.
nvshmemx_qp_fence guarantees order of delivery, not completion. It
does not guarantee order of delivery of nonblocking Get or values
fetched by nonblocking AMO routines.
The routine operates on all queue pairs specified in the qp_handle
array. Ordering is ensured for operations on these queue pairs to the
specified PE(s).
Special PE Values:
When pe is set to NVSHMEMX_PE_ALL (or its alias
NVSHMEMX_PE_ANY), ordering is ensured for operations to all PEs, not
just a single target PE.
Special QP Values:
NVSHMEMX_QP_DEFAULT: Fence all default queue pairs. Pass a pointer to this value asqp_handleand setnum_qpsto 1.NVSHMEMX_QP_ALL: Fence all queue pairs (both default and custom created). Pass a pointer to this value asqp_handleand setnum_qpsto 1. This ensures ordering across all communication resources.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_fence_warp: Collective fence operation for a warp.nvshmemx_qp_fence_block: Collective fence operation for a thread block.
Returns
None.
Notes
Thread group variants provide potential performance benefits through cooperative synchronization.
NVSHMEMX_QP_QUIET¶
-
__device__ void
nvshmemx_qp_quiet(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
-
__device__ void
nvshmemx_qp_quiet_block(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
-
__device__ void
nvshmemx_qp_quiet_warp(int pe, nvshmemx_qp_handle_t *qp_handle, int num_qps)¶
- pe [IN]
- PE number of the target PE for which completion is to be ensured.
This can be a specific PE number, or the special value
NVSHMEMX_PE_ALL(aliasNVSHMEMX_PE_ANY) to ensure completion for operations to all PEs. - qp_handle [IN]
- Pointer to an array of queue pair handles. This can also be a pointer
to one of the special values
NVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ALLto quiet all default queue pairs or all queue pairs (including custom created ones), respectively. - num_qps [IN]
- Number of queue pairs in the
qp_handlearray. When using special QP values (NVSHMEMX_QP_DEFAULTorNVSHMEMX_QP_ALL), this should be set to 1.
Description
The nvshmemx_qp_quiet routine ensures completion of all operations
on symmetric data objects issued to PE(s) pe on the specified queue
pairs. All operations on symmetric data objects targeting the specified
PE(s) on the given queue pairs are guaranteed to be complete and visible
when nvshmemx_qp_quiet returns.
The routine operates on all queue pairs specified in the qp_handle
array. Completion is ensured for operations on these queue pairs to the
specified PE(s).
On systems with only NVLink, all operations on symmetric data objects
are guaranteed to be complete and visible to the target PE(s) when
nvshmemx_qp_quiet returns. On systems with both NVLink and
InfiniBand, visibility is only guaranteed at the destination PE(s).
Special PE Values:
When pe is set to NVSHMEMX_PE_ALL (or its alias
NVSHMEMX_PE_ANY), completion is ensured for operations to all PEs,
not just a single target PE.
Special QP Values:
NVSHMEMX_QP_DEFAULT: Quiet all default queue pairs. Pass a pointer to this value asqp_handleand setnum_qpsto 1.NVSHMEMX_QP_ALL: Quiet all queue pairs (both default and custom created). Pass a pointer to this value asqp_handleand setnum_qpsto 1. This ensures completion of operations across all communication resources.
Thread Group Variants:
Thread group variants with _warp and _block suffixes are also
available and must be called collectively by all active threads in the
warp or block, respectively:
nvshmemx_qp_quiet_warp: Collective quiet operation for a warp.nvshmemx_qp_quiet_block: Collective quiet operation for a thread block.
Returns
None.
Notes
Thread group variants provide potential performance benefits through cooperative synchronization but require collective invocation by all threads in the group.