Running VMA

NVIDIA Messaging Accelerator (VMA) Documentation Rev 9.8.40

This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.

Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.

Example:

Copy
Copied!
            

$ LD_PRELOAD=libvma.so sockperf server -i 11.4.3.3

Warning

If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.

As a result, a VMA header message should precede your running application.

Copy
Copied!
            

VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss VMA INFO: Cmd Line: sockperf server -i 11.4.3.3 VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X: VMA INFO: ---------------------------------------------------------------------------

The output will always show:

  • The VMA version

  • The application’s name (in the above example: Cmd Line: sockperf sr)

The appearance of the VMA header indicates that the VMA library is loaded with your application.

  1. Check if the LD can find the libvma library.

    Copy
    Copied!
                

    ld -lvma –verbose

  2. Set the UID bit to enforce user ownership.

    Copy
    Copied!
                

    sudo chmod u+s /usr/lib64/libvma* sudo chmod u+s /sbin/sysctl

  3. Grant CAP_NET_RAW privileges to the application.

    Copy
    Copied!
                

    sudo setcap cap_net_raw,cap_net_admin+ep /usr/bin/sockperf

  4. Launch the application under no root.

    Copy
    Copied!
                

    LD_PRELOAD=libvma.so sockperf sr --tcp -i 10.0.0.4 -p 12345 LD_PRELOAD=libvma.so sockperf pp --tcp -i 10.0.0.4 -p 12345 -t10

Prerequisites

  • Install sockperf – a tool for network performance measurement
    This can be done by either

  • Two machines, one serves as the server and the second as a client

    • Management interfaces configured with an IP that machines can ping each other

    • Physical installation of an NVIDIA® NIC in your machines

  • Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run:

    Copy
    Copied!
                

    lspci | grep Mellanox

    Output example:

    Copy
    Copied!
                

    $ lspci | grep Mellanox 82:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 82:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

Kernel Performance

Kernel Performance Server Side

On the first machine run:

Copy
Copied!
            

$ sockperf server -i 11.4.3.3

Server side example output:

Copy
Copied!
            

sockperf: [SERVER] listen on:sockperf: == version #3.7-no.git == sockperf: [SERVER] listen on: [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 124545] using recvfrom() to block on socket(s)


Kernel Performance Client Side

On the second machine run:

Copy
Copied!
            

$ sockperf ping-pong -t 4 -i 11.4.3.3

Client-side example output:

Copy
Copied!
            

sockperf: == version #3.7-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)   [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=307425; ReceivedMessages=307424 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=272899; ReceivedMessages=272899 sockperf: ====> avg-lat= 6.488 (std-dev=0.396) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 6.488 usec sockperf: Total 272899 observations; each percentile contains 2728.99 observations sockperf: ---> <MAX> observation = 20.484 sockperf: ---> percentile 99.999 = 17.732 sockperf: ---> percentile 99.990 = 9.364 sockperf: ---> percentile 99.900 = 8.491 sockperf: ---> percentile 99.000 = 7.963 sockperf: ---> percentile 90.000 = 6.975 sockperf: ---> percentile 75.000 = 6.831 sockperf: ---> percentile 50.000 = 6.307 sockperf: ---> percentile 25.000 = 6.212 sockperf: ---> <MIN> observation = 5.887

VMA Latency

Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.

VMA Performance Server Side

On the first machine run:

Copy
Copied!
            

$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i 11.4.3.3

Server-side example output:

Copy
Copied!
            

VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss VMA INFO: Cmd Line: sockperf server -i 11.4.3.3 VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X: VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA Spec Latency [VMA_SPEC] VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: Ring On Device Memory TX 16384 [VMA_RING_DEV_MEM_TX] VMA INFO: Tx QP WRE 256 [VMA_TX_WRE] VMA INFO: Tx QP WRE Batching 4 [VMA_TX_WRE_BATCHING] VMA INFO: Rx QP WRE 256 [VMA_RX_WRE] VMA INFO: Rx QP WRE Batching 4 [VMA_RX_WRE_BATCHING] VMA INFO: Rx Poll Loops -1 [VMA_RX_POLL] VMA INFO: Rx Prefetch Bytes Before Poll 256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL] VMA INFO: GRO max streams 0 [VMA_GRO_STREAMS_MAX] VMA INFO: Select Poll (usec) -1 [VMA_SELECT_POLL] VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE] VMA INFO: Select Poll OS Ratio 1 [VMA_SELECT_POLL_OS_RATIO] VMA INFO: Select Skip OS 1 [VMA_SELECT_SKIP_OS] VMA INFO: CQ Drain Interval (msec) 100 [VMA_PROGRESS_ENGINE_INTERVAL] VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE] VMA INFO: CQ AIM Max Count 128 [VMA_CQ_AIM_MAX_COUNT] VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC] VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL] VMA INFO: TCP nodelay 1 [VMA_TCP_NODELAY] VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD] VMA INFO: Internal Thread Affinity 0 [VMA_INTERNAL_THREAD_AFFINITY] VMA INFO: Thread mode Single [VMA_THREAD_MODE] VMA INFO: Mem Allocate type 2 (Huge Pages) [VMA_MEM_ALLOC_TYPE] VMA INFO: --------------------------------------------------------------------------- sockperf: == version #3.7-no.git == sockperf: [SERVER] listen on: [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 124588] using recvfrom() to block on socket(s)


VMA Performance Client Side

On the second machine run:

Copy
Copied!
            

$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t 4 -i 11.4.3.3

Client-side example output:

Copy
Copied!
            

VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss VMA INFO: Cmd Line: sockperf server -i 11.4.3.3 VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X: VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA Spec Latency [VMA_SPEC] VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: Ring On Device Memory TX 16384 [VMA_RING_DEV_MEM_TX] VMA INFO: Tx QP WRE 256 [VMA_TX_WRE] VMA INFO: Tx QP WRE Batching 4 [VMA_TX_WRE_BATCHING] VMA INFO: Rx QP WRE 256 [VMA_RX_WRE] VMA INFO: Rx QP WRE Batching 4 [VMA_RX_WRE_BATCHING] VMA INFO: Rx Poll Loops -1 [VMA_RX_POLL] VMA INFO: Rx Prefetch Bytes Before Poll 256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL] VMA INFO: GRO max streams 0 [VMA_GRO_STREAMS_MAX] VMA INFO: Select Poll (usec) -1 [VMA_SELECT_POLL] VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE] VMA INFO: Select Poll OS Ratio 1 [VMA_SELECT_POLL_OS_RATIO] VMA INFO: Select Skip OS 1 [VMA_SELECT_SKIP_OS] VMA INFO: CQ Drain Interval (msec) 100 [VMA_PROGRESS_ENGINE_INTERVAL] VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE] VMA INFO: CQ AIM Max Count 128 [VMA_CQ_AIM_MAX_COUNT] VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC] VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL] VMA INFO: TCP nodelay 1 [VMA_TCP_NODELAY] VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD] VMA INFO: Internal Thread Affinity 0 [VMA_INTERNAL_THREAD_AFFINITY] VMA INFO: Thread mode Single [VMA_THREAD_MODE] VMA INFO: Mem Allocate type 2 (Huge Pages) [VMA_MEM_ALLOC_TYPE] VMA INFO: --------------------------------------------------------------------------- sockperf: == version #3.7-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)   [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=1855851; ReceivedMessages=1855850 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=1656957; ReceivedMessages=1656957 sockperf: ====> avg-lat= 1.056 (std-dev=0.074) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 1.056 usec sockperf: Total 1656957 observations; each percentile contains 16569.57 observations sockperf: ---> <MAX> observation = 4.176 sockperf: ---> percentile 99.999 = 1.639 sockperf: ---> percentile 99.990 = 1.552 sockperf: ---> percentile 99.900 = 1.497 sockperf: ---> percentile 99.000 = 1.305 sockperf: ---> percentile 90.000 = 1.179 sockperf: ---> percentile 75.000 = 1.054 sockperf: ---> percentile 50.000 = 1.031 sockperf: ---> percentile 25.000 = 1.015 sockperf: ---> <MIN> observation = 0.954

Comparing Results

VMA is showing over 614.3% performance improvement comparing to kernel

Average latency:

  • Using Kernel 6.488 usec

  • Using VMA 1.056 usec

Percentile latencies:

Percentile

Kernel

VMA

Max

20.484

4.176

99.999

17.732

1.639

99.990

9.364

1.552

99.900

8.491

1.497

99.000

7.963

1.305

90.000

6.975

1.179

75.000

6.831

1.054

50.000

6.307

1.031

25.000

6.212

1.015

MIN

5.887

0.954

In order to tune your system and get best performance see section Basic Performance Tuning.

Libvma-debug.so

libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.

Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).

Example:

Copy
Copied!
            

$ LD_PRELOAD=libvma-debug.so sockperf server -i 11.4.3.3

Warning

libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.

For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.

Warning

NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:

Copy
Copied!
            

./configure --enable-opt-log=none

See section Building VMA from Sources.


© Copyright 2023, NVIDIA. Last updated on Nov 3, 2023.