NVIDIA DOCA DPA All-to-all Application Guide

This guide explains the all-to-all collective operation example when accelerated using the DPA in NVIDIA® BlueField®-3 DPU.

This reference application shows how the message passing interface (MPI) all-to-all collective can be accelerated on the Data Path Accelerator (DPA). In an MPI collective, all processes in the same job call the collective routine.

Given a communicator of n ranks, the application performs a collective operation in which all the processes send and receive the same amount of data from all the processes (hence all-to-all).

This document describes how to run the all-to-all example using DOCA DPA API .

All-to-all is an MPI method. MPI is a standardized and portable message passing standard designed to function on parallel computing architectures. An MPI program is one where several processes run in parallel.

system-design-diagram-version-1-modificationdate-1707420870047-api-v2.png

Each process in the diagram divides its local sendbuf into n blocks (4 in this example), each containing sendcount elements (4 in this example). Process i sends the k-th block of its local sendbuf to process k which places the data in the i-th block of its local recvbuf.

Implementing the all-to-all method using DOCA DPA offloads the copying of the elements from the srcbuf to the recvbufs to the DPA, and leaves the CPU free to perform other computations.

The following diagram describes the differences between the host based all-to-all and DPA all-to-all.

all-to-all-non-blocking-version-1-modificationdate-1707420868853-api-v2.png

  • In DPA all-to-all, DPA threads perform the all-to-all and the CPU is free to do other computations

  • In host-based all-to-all, the CPU must still perform the all-to-all at some point and is not completely free for other computations

This application leverages the following DOCA library:

Refer to its programming guide for more information.

  • NVIDIA BlueField-3 DPU is required

  • The application can be run on the DPU or on the host

  • Open MPI version 4.1.5rc2 or greater (included in DOCA's installation)

Installation

Please refer to the NVIDIA DOCA Installation Guide for Linux for details on how to install BlueField-related software.

Prerequisites

MPI is used for the compilation and running of this application. Make sure that MPI is installed on your setup (openmpi is provided as part of the installation of doca-tools ).

Warning

The installation also requires updating the LD_LIBRARY_PATH and PATH environment variable to include MPI. For example, if openmpi is installed under /usr/mpi/gcc/openmpi-4.1.7a1 then updating the environment variables should be like this:

Copy
Copied!
            

export PATH=/usr/mpi/gcc/openmpi-4.1.7a1/bin:${PATH} export LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.7a1/lib:${LD_LIBRARY_PATH}


Application Execution

The DPA all-to-all application is provided in both source and binary forms. The binary is located under /opt/mellanox/doca/applications/dpa_all_to_all/bin/doca_dpa_all_to_all.

  1. Application usage instructions:

    Copy
    Copied!
                

    Usage: doca_dpa_all_to_all [DOCA Flags] [Program Flags]   DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse all command flags from an input json file   Program Flags: -m, --msgsize <Message size> The message size - the size of the sendbuf and recvbuf (in bytes). Must be in multiplies of integer size. Default is size of one integer times the number of processes. -d, --devices <IB device names> IB devices names that supports DPA, separated by comma without spaces (max of two devices). If not provided then a random IB device will be chosen.

    Note

    This usage printout can be printed to the command line using the -h (or --help) options:

    Copy
    Copied!
                

    /opt/mellanox/doca/applications/dpa_all_to_all/bin/doca_dpa_all_to_all -h

    Note

    For additional information, please refer to section "Command Line Flags".

  2. CLI example for running the application on the host:

    Warning

    This is an MPI program, so use mpirun to run the application (with the -np flag to specify the number of processes to run).

    • The following runs the DPA all-to-all application with 8 processes using the default message size (the number of processes, which is 8, times the size of 1 integer) with a random InfiniBand device:

      Copy
      Copied!
                  

      mpirun -np 8 /opt/mellanox/doca/applications/dpa_all_to_all/bin/doca_dpa_all_to_all

    • The following runs the DPA all-to-all application with 8 processes, with 128 bytes as the message size, and with mlx5_0 and mlx5_1 as the InfiniBand devices:

      Copy
      Copied!
                  

      mpirun-np 8 /opt/mellanox/doca/applications/dpa_all_to_all/bin/doca_dpa_all_to_all -m 128 -d "mlx5_0,mlx5_1"

      Warning

      The application supports running with a maximum of 16 processes. If you try to run with more processes, an error is printed and the application exits.

  3. The application also supports a JSON-based deployment mode, in which all command-line arguments are provided through a JSON file:

    Copy
    Copied!
                

    doca_dpa_all_to_all --json [json_file]

    For example:

    Copy
    Copied!
                

    cd /opt/mellanox/doca/applications/dpa_all_to_all/bin ./doca_dpa_all_to_all --json ./dpa_all_to_all_params.json

    Warning

    Before execution, ensure that the used JSON file contains the correct configuration parameters, especially the InfiniBand device identifiers.

Command Line Flags

Flag Type

Short Flag

Long Flag/JSON Key

Description

JSON Content

General flags

h

help

Prints a help synopsis

N/A

v

version

Prints program version information

N/A

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

Copy
Copied!
            

"log-level": 60

N/A

sdk-log-level

Sets the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

Copy
Copied!
            

"sdk-log-level": 40

j

json

Parse all command flags from an input json file

N/A

Program flags

m

msgsize

The message size. The size of the sendbuf and recvbuf (in bytes). Must be in multiples of an integer. The default is size of 1 integer times the number of processes.

Copy
Copied!
            

"msgsize": -1

Warning

The value -1 is a placeholder to use the default size, which is only known at run time (because it depends on the number of processes).

d

devices

InfiniBand devices names that support DPA, separated by comma without spaces (max of two devices). If NOT_SET then a random InfiniBand device is chosen.

Copy
Copied!
            

"devices": "NOT_SET"

Note

Refer to DOCA Arg Parser for more information regarding the supported flags and execution modes.


Troubleshooting

Refer to the NVIDIA DOCA Troubleshooting Guide for any issue encountered with the installation or execution of the DOCA applications .

In addition to providing the application in binary form, the installation also includes all of the application sources and compilation instructions so as to allow modifying the sources and recompiling the application. For more information about the applications, as well as development and compilation tips, refer to the DOCA Applications main guide.

The sources of the application can be found under the /opt/mellanox/doca/applications/dpa_all_to_all/src directory.

Recompiling All Applications

The applications are all defined under a single meson project, meaning that the default compilation will recompile all the DOCA applications.

To build all the applications together, run:

Copy
Copied!
            

cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build

Note

doca_dpa_all_to_all is created under /tmp/build/dpa_all_to_all/src/.


Recompiling DPA All-to-all Application Only

To directly build only the all-to-all application:

Copy
Copied!
            

cd /opt/mellanox/doca/applications/ meson /tmp/build -Denable_all_applications=false -Denable_dpa_all_to_all=true ninja -C /tmp/build

Note

doca_dpa_all_to_all is created under /tmp/build/dpa_all_to_all/src/.

Alternatively, one can set the desired flags in the meson_options.txt file instead of providing them in the compilation command line:

  1. Edit the following flags in /opt/mellanox/doca/applications/meson_options.txt:

    • Set enable_all_applications to false

    • Set enable_dpa_all_to_all to true

  2. Run the following compilation commands :

    Copy
    Copied!
                

    cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build

    Note

    doca_dpa_all_to_all is created under /tmp/build/dpa_all_to_all/src/.

Troubleshooting

Please refer to the NVIDIA DOCA Troubleshooting Guide for any issue encountered with the compilation of the application .

  1. Initialize MPI.

    Copy
    Copied!
                

    MPI_Init(&argc, &argv);

  2. Parse application arguments.

    1. Initialize arg parser resources and register DOCA general parameters.

      Copy
      Copied!
                  

      doca_argp_init();

    2. Register the application's parameters.

      Copy
      Copied!
                  

      register_all_to_all_params();

    3. Parse the arguments.

      Copy
      Copied!
                  

      doca_argp_start();

      1. The msgsize parameter is the size of the sendbuf and recvbuf (in bytes). It must be in multiples of an integer and at least the number of processes times an integer size.

      2. The devices_param parameter is the names of the InfiniBand devices to use (must support DPA). It can include up to two devices names.

    4. Only let the first process (of rank 0) parse the parameters to then broadcast them to the rest of the processes.

  3. Check and prepare the needed resources for the all_to_all call:

    1. Check the number of processes (maximum is 16).

    2. Check the msgsize. It must be in multiples of integer size and at least the number of processes times integer size.

    3. Allocate the sendbuf and recvbuf according to msgsize.

  4. Prepare the resources required to perform the all-to-all method using DOCA DPA:

    1. Initialize DOCA DPA context:

      1. Open DOCA DPA device (DOCA device that supports DPA).

        Copy
        Copied!
                    

        open_dpa_device();

      2. Create DOCA DPA context using the opened device.

        Copy
        Copied!
                    

        doca_dpa_create();

    2. Create the required events for the all-to-all: One completion event for the kernel launch (wait location CPU and update location DPA) and kernel events (wait location remote and update location DPA) as the number of processes.

      Copy
      Copied!
                  

      create_dpa_a2a_events() { doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_ACCESS_CPU, DOCA_DPA_EVENT_WAIT_DEFAULT, &comp_event, 0); for (i = 0; i < resources->num_ranks; i++) doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_REMOTE, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_WAIT_DEFAULT, &(kernel_events[i]), 0); }

    3. Create DOCA DPA worker (for the endpoints).

      Copy
      Copied!
                  

      doca_dpa_worker_create();

    4. Prepare DOCA DPA endpoints:

      1. Create DOCA DPA endpoints as the number of processes/ranks.

        Copy
        Copied!
                    

        for (i = 0; i < resources->num_ranks; i++) doca_dpa_ep_create();

      2. Connect the local process' endpoints to the other processes' endpoints.

        Copy
        Copied!
                    

        connect_dpa_a2a_endpoints();

      3. Export the endpoints to DOCA DPA device endpoints (so they can be used by the DPA) and copy them to DPA heap memory.

        Copy
        Copied!
                    

        for (int i = 0; i < resources->num_ranks; i++) {       result = doca_dpa_ep_dev_export(); doca_dpa_mem_alloc(); doca_dpa_h2d_memcpy(); }

    5. Prepare the memory required to perform the all-to-all method using DOCA DPA. This includes creating memory handlers for the sendbuf and recvbuf, getting the other processes' recvbufs handlers, and copying these memory handlers and their remote keys and the events' handlers to the DPA heap memory.

      Copy
      Copied!
                  

      prepare_dpa_a2a_memory();

  5. Launch the alltoall_kernel using DOCA DPA kernel launch with all the required parameters:

    1. Every MPI rank launches a kernel of up to MAX_NUM_THREADS. This example defines MAX_NUM_THREADS as 16.

    2. Launch alltoall_kernel using kernel_launch.

      Copy
      Copied!
                  

      doca_dpa_kernel_launch();

    3. Using the doca_dpa_dev_put_signal_nb() function, every process should copy the relevant sendbuf to the correct recvbuf (according to the process' rank) for every process (including the current process itself).

      Copy
      Copied!
                  

      for (i = thread_rank; i < num_ranks; i += num_threads) doca_dpa_dev_put_signal_nb();

    4. Wait until the alltoall_kernel has finished.

      Copy
      Copied!
                  

      doca_dpa_event_wait_until();

      Warning

      Add an MPI barrier after waiting for the event to make sure that all of the processes have finished executing the alltoall_kernel.

      Copy
      Copied!
                  

      MPI_Barrier();

      After the alltoall_kernel is finished, the recvbuf of all the processes now contain the expected output of the all-to-all method.

  6. Destroy the a2a_resources:

    1. Free all the DOCA DPA memories.

      Copy
      Copied!
                  

      doca_dpa_mem_free();

    2. Unregister all the DOCA DPA host memories.

      Copy
      Copied!
                  

      doca_dpa_mem_unregister();

    3. Destroy all the DOCA DPA endpoints.

      Copy
      Copied!
                  

      doca_dpa_ep_destroy();

    4. Destroy the DOCA DPA worker.

      Copy
      Copied!
                  

      doca_dpa_worker_destroy();

    5. Destroy all the DOCA DPA events.

      Copy
      Copied!
                  

      doca_dpa_event_destroy();

    6. Destroy the DOCA DPA context.

      Copy
      Copied!
                  

      doca_dpa_destroy();

    7. Close the DOCA device.

      Copy
      Copied!
                  

      doca_dev_close();

  • /opt/mellanox/doca/applications/dpa_all_to_all/src

  • /opt/mellanox/doca/applications/dpa_all_to_all/bin/dpa_all_to_all_params.json

© Copyright 2023, NVIDIA. Last updated on Feb 9, 2024.