Examples
========

Source code for the examples described in this section is available in the
examples folder of the NVSHMEM package.

Attribute-Based Initialization Example
--------------------------------------

The following code shows an MPI version of the simple shift program that was
explained in The NVSHMEM Programming Model. It shows the use of the NVSHMEM
attribute-based initialization API where the MPI communicator can be used to
set up NVSHMEM.

.. literalinclude:: examples/init-attrib.cu
  :language: C

Collective Launch Example
-------------------------

The following code shows an example implementation of a single ring-based
reduction where multiple iterations of the code, including computation,
communication and synchronization are expressed as a single kernel.
 
This example also demonstrates the use of NVSHMEM collective launch, required
when the NVSHMEM synchronization API is used from inside the CUDA kernel.
 
There is no MPI dependency for the example. NVSHMEM can be used to port
existing MPI applications and develop new applications.

.. literalinclude:: examples/collective-launch.cu
  :language: C

On-Stream Example
-----------------

The following example shows how ``nvshmemx_*_on_stream`` functions can be used to
enqueue a SHMEM operation onto a CUDA stream for execution in stream order.
Specifically, the example shows the following:

 * How a collective SHMEM reduction operation can be made to wait on a
   preceding kernel in the stream.
 * How a kernel can be made to wait for a communication result from a previous
   collective SHMEM reduction operation.

The example shows one use case for relieving CPU control over GPU compute and
communication.

.. literalinclude:: examples/on-stream.cu
  :language: c

Threadgroup Example
-------------------

The example in this section shows how ``nvshmemx_collect32_block`` can be used to
leverage threads to accelerate a SHMEM collect operation when all threads in
the block depend on the result of a preceding communication operation. For this
instance, partial vector sums are computed across different PEs and have a
SHMEM collect operation to obtain the complete sum across PEs.

.. literalinclude:: examples/thread-group.cu
  :language: c

Put on Block Example
--------------------

In the example below, every thread in block 0 calls ``nvshmemx_float_put_block``.
Alternatively, every thread can call ``nvshmem_float_p``, but ``nvshmem_float_p``
has a disadvantage that when the destination GPU is connected via InfiniBand,
there is one RMA message for every single element, which can be detrimental to
performance.

The disadvantage with using ``nvshmem_float_put`` in this case is that when the
destination GPU is P2P-connected, a single thread will copy the entire data to
the destination GPU. While ``nvshmemx_float_put_block`` can leverage all the
threads in the block to copy the data in parallel to the destination GPU.

.. literalinclude:: examples/put-block.cu
  :language: c