image image image image image

On This Page

This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.

Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.

Example: 

$ LD_PRELOAD=libvma.so sockperf server -i 11.4.3.3

If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.

As a result, a VMA header message should precede your running application. 

 VMA INFO: VMA_VERSION: 9.7.0-1 Release built on Oct 31 2022 14:45:59
 VMA INFO: Cmd Line: sockperf
 VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8-1.0.1.1:
 VMA INFO: ---------------------------------------------------------------------------

The output will always show:

  • The VMA version
  • The application’s name (in the above example: Cmd Line: sockperf sr)

The appearance of the VMA header indicates that the VMA library is loaded with your application.

Running VMA using non-root Permission

  1. Check if the LD can find the libvma library. 

    ld -lvma –verbose
  2. Set the UID bit to enforce user ownership. 

    sudo chmod u+s /usr/lib64/libvma*
    sudo chmod u+s /sbin/sysctl
  3. Grant CAP_NET_RAW privileges to the application. 

    sudo setcap cap_net_raw,cap_net_admin+ep /usr/bin/sockperf
  4. Launch the application under no root. 

    LD_PRELOAD=libvma.so sockperf sr --tcp -i 10.0.0.4 -p 12345
    LD_PRELOAD=libvma.so sockperf pp --tcp -i 10.0.0.4 -p 12345 -t10

Benchmarking Example

Prerequisites

  • Install sockperf  a tool for network performance measurement
    This can be done by either
  • Two machines, one serves as the server and the second as a client
    • Management interfaces configured with an IP that machines can ping each other
    • Physical installation of an NVIDIA® NIC in your machines
  • Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run: 

    lspci | grep Mellanox

    Output example: 

    $ lspci | grep Mellanox
    82:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    82:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

Kernel Performance

Kernel Performance Server Side

On the first machine run: 

$ sockperf server -i 11.4.3.3

Server side example output: 

sockperf: [SERVER] listen on:sockperf: == version #3.7-no.git ==
sockperf: [SERVER] listen on:
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124545] using recvfrom() to block on socket(s)

Kernel Performance Client Side

On the second machine run: 

$ sockperf ping-pong -t 4 -i 11.4.3.3

Client-side example output: 

sockperf: == version #3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=307425; ReceivedMessages=307424
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=272899; ReceivedMessages=272899
sockperf: ====> avg-lat=  6.488 (std-dev=0.396)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 6.488 usec
sockperf: Total 272899 observations; each percentile contains 2728.99 observations
sockperf: ---> <MAX> observation =   20.484
sockperf: ---> percentile 99.999 =   17.732
sockperf: ---> percentile 99.990 =    9.364
sockperf: ---> percentile 99.900 =    8.491
sockperf: ---> percentile 99.000 =    7.963
sockperf: ---> percentile 90.000 =    6.975
sockperf: ---> percentile 75.000 =    6.831
sockperf: ---> percentile 50.000 =    6.307
sockperf: ---> percentile 25.000 =    6.212
sockperf: ---> <MIN> observation =    5.887

VMA Latency

Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.

VMA Performance Server Side

On the first machine run: 

$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i 11.4.3.3

Server-side example output: 

VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.7.0-1 Release built on Oct 31 2022 14:45:59
VMA INFO: Cmd Line: sockperf server -i 11.4.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8-1.0.1.1:
VMA INFO: ---------------------------------------------------------------------------
  VMA INFO: VMA Spec                       Latency        [VMA_SPEC]
VMA INFO: Log Level                      INFO           [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX       16384          [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE                      256            [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching             4              [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE                      256            [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching             4              [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops                  -1             [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll  256            [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams                0              [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec)             -1             [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force           Enabled        [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio           1              [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS                 1              [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)       100            [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation       Disabled       [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count               128            [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation         Disabled       [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full               Disabled       [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay                    1              [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd      Enabled        [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity       0              [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode                    Single         [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type              2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7-no.git ==
sockperf: [SERVER] listen on:
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124588] using recvfrom() to block on socket(s)

VMA Performance Client Side

On the second machine run: 

$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t 4 -i 11.4.3.3

Client-side example output: 

VMA INFO: --------------------------------------------------------------------------- 
VMA INFO: VMA_VERSION: 9.7.0-1 Release built on Oct 31 2022 14:45:59
VMA INFO: Cmd Line: sockperf
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8-1.0.1.1:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec                       Latency        [VMA_SPEC]
VMA INFO: Log Level                      INFO           [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX       16384          [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE                      256            [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching             4              [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE                      256            [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching             4              [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops                  -1             [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll  256            [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams                0              [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec)             -1             [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force           Enabled        [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio           1              [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS                 1              [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)       100            [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation       Disabled       [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count               128            [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation         Disabled       [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full               Disabled       [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay                    1              [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd      Enabled        [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity       0              [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode                    Single         [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type              2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
 
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=1855851; ReceivedMessages=1855850
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=1656957; ReceivedMessages=1656957
sockperf: ====> avg-lat=  1.056 (std-dev=0.074)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 1.056 usec
sockperf: Total 1656957 observations; each percentile contains 16569.57 observations
sockperf: ---> <MAX> observation =    4.176
sockperf: ---> percentile 99.999 =    1.639
sockperf: ---> percentile 99.990 =    1.552
sockperf: ---> percentile 99.900 =    1.497
sockperf: ---> percentile 99.000 =    1.305
sockperf: ---> percentile 90.000 =    1.179
sockperf: ---> percentile 75.000 =    1.054
sockperf: ---> percentile 50.000 =    1.031
sockperf: ---> percentile 25.000 =    1.015
sockperf: ---> <MIN> observation =    0.954

Comparing Results

VMA is showing over 614.3% performance improvement comparing to kernel

Average latency:

  • Using Kernel        6.488 usec
  • Using VMA         1.056 usec

Percentile latencies:

PercentileKernelVMA

Max

20.484

4.176

99.999

17.732

1.639

99.990

9.364

1.552

99.900

8.491

1.497

99.000

7.963

1.305

90.000

6.975

1.179

75.000

6.831

1.054

50.000

6.307

1.031

25.000

6.212

1.015

MIN

5.887

0.954

In order to tune your system and get best performance see section Basic Performance Tuning.

Libvma-debug.so

libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.

Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).

Example: 

$ LD_PRELOAD=libvma-debug.so sockperf server -i 11.4.3.3

libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.

For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.

NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:

./configure --enable-opt-log=none

See section Building VMA from Sources.