****************************************************
Communication abstraction library API and data types
****************************************************

Communication abstraction library is a helper module for cuSolverMP library and helps to set up communications between different GPUs. cuSolverMP API accepts cal_comm_t communicator object and requires it to be created prior to any cuSolverMP call. At this moment communicator object cal_comm_t can be initalized only with MPI communicator with some of the communications routines using underlying MPI library, thus requiring MPI to be initialized for the duration of the library usage.
At this moment library's communications support only the case where each participating process only uses single GPU and each participating GPU can only be used by one process.

----

=======================================
Communication abstraction library setup
=======================================

Currently communications are based on system (shared memory), CUDA runtime (interprocess memory copies and CUDA events), NCCL (for collectives calls) and MPI (if there are no other variants) libraries. This module will initialize underlying structures required for communications and setup will return error in case there are any issues. You can use MPI diagnostics (refer to your MPI distribution documentation to set verbosity level for MPI), NCCL diagnostics (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) and (:ref:`this module diagnostics <module-usage-label>`) to understand possible failure reason. cusolverMp library tries to use most efficient way to communicate between GPUs, however user provided environment variables and settings (i.e. MCA MPI parameters or NCCL environment variables) can affect functionality and performance of communication module.  

----

.. _module-usage-label:

==========================
Using communication module
==========================

There are few environment variables that can change communication module behaviour (in addition to underlying MPI and NCCL environment variables).

.. csv-table::
   :header: "Variable", "Description"
   :widths: auto

    "CAL_LOG_LEVEL", "Verbosity level of communication module with `0` means no output and `6` means maximum amount of details. Default - `0`"
    "CAL_ALLOW_SET_PEER_ACCESS", "If `1` - allows enabling p2p access between pairs of GPUs on the same node. Default - `0`"

----

============================================
Communication abstraction library data types
============================================


.. _calError_t-label:

------------------
:code:`calError_t`
------------------

Return values from communication abstraction library APIs. The values are described in the table below:

.. csv-table::
   :header: "Value", "Description"
   :widths: auto

    "CAL_OK", "Success."
    "CAL_ERROR", "Generic error."
    "CAL_ERROR_INVALID_PARAMETER", "Invalid parameter to the interface function."
    "CAL_ERROR_INTERNAL", "Invalid error."
    "CAL_ERROR_CUDA", "Error in CUDA runtime or driver API."
    "CAL_ERROR_MPI", "Error in MPI call."
    "CAL_ERROR_IPC", "Error in system IPC communication call."
    "CAL_ERROR_NOT_SUPPORTED", "Requested configuration or parameters are not supported."

----

.. _cal_comm_t-label:

------------------
:code:`cal_comm_t`
------------------

   | The `cal_comm_t` stores device endpoint and resources related to communication.
   | It must be created and destroyed using :ref:`cal_comm_create_dist() <cal_comm_create_distr-label>` and :ref:`cal_comm_destroy() <cal_comm_destroy-label>` functions respectively.

----

=====================================
Communication abstraction library API
=====================================

.. _cal_comm_create_distr-label:

-----------------------------
:code:`cal_comm_create_distr`
-----------------------------

.. code-block:: cpp
   
    calError_t cal_comm_create_distr(
        void* mpi_comm,
        int local_device,
        cal_comm_t* new_comm) 


| Single communicator initialization, uses MPI to create communication channels between GPUs. Allows using only one GPU per MPI rank and different MPI ranks can't use same GPU. This API is a collective call with respect of host - all participating processes will synchronize in this API. `local_device` device ID from CUDA Runtime enumeration will be assigned to new communicator and used for all following operations this communicator. Total number of processes in the communicator will be equal to number of ranks in mpi_comm.

.. csv-table::
   :header: "Parameter", "Description"
   :widths: auto

    "mpi_comm",        "Pointer to MPI Communicator that will be used for communicator setup."
    "local_device",    "Local device id that will be assigned to new communicator. Should be same as device of active context."
    "new_comm",        "Pointer where to store new communicator handle."

See :ref:`calError_t <calError_t-label>` for the description of the return value.

----

.. _cal_comm_destroy-label:

------------------------
:code:`cal_comm_destroy`
------------------------

.. code-block:: cpp
    
    calError_t cal_comm_destroy(
        cal_comm_t comm)


| Releases resources associated with provided communicator handle

.. csv-table::
   :header: "Parameter", "Description"
   :widths: auto

    "comm",        "Communicator handle to release."

See :ref:`calError_t <calError_t-label>` for the description of the return value.

----

.. _cal_stream_sync-label:

-----------------------
:code:`cal_stream_sync`
-----------------------

.. code-block:: cpp
    
    calError_t cal_stream_sync(
        cal_comm_t comm,
        cudaStream_t stream)


| Blocks calling thread until all of the outstanding device operations are finished in `stream`. This includes outstanding communication operations submitted to this stream. Use this function in place of `cudaStreamSynchronize` to progress possible outstanding communication operations.

.. csv-table::
   :header: "Parameter", "Description"
   :widths: auto

    "comm",        "Communicator handle."
    "stream",      "CUDA stream to synchronize."

See :ref:`calError_t <calError_t-label>` for the description of the return value.

----

.. _cal_get_rank-label:

-------------------------
:code:`cal_get_comm_size`
-------------------------

.. code-block:: cpp
    
    calError_t cal_get_comm_size(
        cal_comm_t comm,
        int* size )

| Retrieve number of processing elements in the provided communicator.
 

.. csv-table::
   :header: "Parameter", "Description"
   :widths: auto

    "comm",  "Communicator handle."
    "size",  "Number of processing elements."

See :ref:`calError_t <calError_t-label>` for the description of the return value.

----

.. _cal_get_comm_size-label:

--------------------
:code:`cal_get_rank`
--------------------

.. code-block:: cpp
    
    calError_t cal_get_rank(
        cal_comm_t comm,
        int* rank )

| Retrieve processing element rank assigned to communicator (base-0).
 

.. csv-table::
   :header: "Parameter", "Description"
   :widths: auto

    "comm",  "Communicator handle."
    "rank",  "Rank Id of the caller process."

See :ref:`calError_t <calError_t-label>` for the description of the return value.