This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.
Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.
Example:
If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.
As a result, a VMA header message should precede your running application.
VMA INFO: --------------------------------------------------------------------- VMA INFO: VMA_VERSION: 9.5.2-1 Release built on Apr 11 2022 18:07:16 VMA INFO: Cmd Line: sockperf sr -i 11.4.3.29 VMA INFO: OFED Version: MLNX_OFED_LINUX-5.6-0.7.0.0: VMA INFO: --------------------------------------------------------------------
The output will always show:
- The VMA version
- The application’s name (in the above example: Cmd Line: sockperf sr)
The appearance of the VMA header indicates that the VMA library is loaded with your application.
Running VMA using non-root Permission
Check if the LD can find the libvma library.
ld -lvma –verbose
Set the UID bit to enforce user ownership.
sudo chmod u+s /usr/lib64/libvma* sudo chmod u+s /sbin/sysctl
Grant
CAP_NET_RAW
privileges to the application.Launch the application under no root.
LD_PRELOAD=libvma.so sockperf sr --tcp -i 10.0.0.4 -p 12345 LD_PRELOAD=libvma.so sockperf pp --tcp -i 10.0.0.4 -p 12345 -t10
Benchmarking Example
Prerequisites
- Install sockperf – a tool for network performance measurement
This can be done by either- Downloading and building from source from: https://github.com/Mellanox/sockperf
Using
yum install: yum install sockperf
- Two machines, one serves as the server and the second as a client
- Management interfaces configured with an IP that machines can ping each other
- Physical installation of an NVIDIA® NIC in your machines
Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run:
lspci | grep Mellanox
Output example:
$ lspci | grep Mellanox 82:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 82:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Kernel Performance
Kernel Performance Server Side
On the first machine run:
$ sockperf server -i 11.4.3.3
Server side example output:
sockperf: [SERVER] listen on:sockperf: == version #3.7-no.git == sockperf: [SERVER] listen on: [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 124545] using recvfrom() to block on socket(s)
Kernel Performance Client Side
On the second machine run:
$ sockperf ping-pong -t 4 -i 11.4.3.3
Client-side example output:
sockperf: == version #3.7-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s) [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=307425; ReceivedMessages=307424 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=272899; ReceivedMessages=272899 sockperf: ====> avg-lat= 6.488 (std-dev=0.396) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 6.488 usec sockperf: Total 272899 observations; each percentile contains 2728.99 observations sockperf: ---> <MAX> observation = 20.484 sockperf: ---> percentile 99.999 = 17.732 sockperf: ---> percentile 99.990 = 9.364 sockperf: ---> percentile 99.900 = 8.491 sockperf: ---> percentile 99.000 = 7.963 sockperf: ---> percentile 90.000 = 6.975 sockperf: ---> percentile 75.000 = 6.831 sockperf: ---> percentile 50.000 = 6.307 sockperf: ---> percentile 25.000 = 6.212 sockperf: ---> <MIN> observation = 5.887
VMA Latency
Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.
VMA Performance Server Side
On the first machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i 11.4.3.3
Server-side example output:
VMA INFO: --------------------------------------------------------------------- VMA INFO: VMA_VERSION: 9.5.2-1 Release built on Apr 11 2022 18:07:16 VMA INFO: Cmd Line: sockperf sr -i 11.4.3.29 VMA INFO: OFED Version: MLNX_OFED_LINUX-5.6-0.7.0.0: VMA INFO: --------------------------------------------------------------------- VMA INFO: VMA Spec Latency [VMA_SPEC] VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: Ring On Device Memory TX 16384 [VMA_RING_DEV_MEM_TX] VMA INFO: Tx QP WRE 256 [VMA_TX_WRE] VMA INFO: Tx QP WRE Batching 4 [VMA_TX_WRE_BATCHING] VMA INFO: Rx QP WRE 256 [VMA_RX_WRE] VMA INFO: Rx QP WRE Batching 4 [VMA_RX_WRE_BATCHING] VMA INFO: Rx Poll Loops -1 [VMA_RX_POLL] VMA INFO: Rx Prefetch Bytes Before Poll 256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL] VMA INFO: GRO max streams 0 [VMA_GRO_STREAMS_MAX] VMA INFO: Select Poll (usec) -1 [VMA_SELECT_POLL] VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE] VMA INFO: Select Poll OS Ratio 1 [VMA_SELECT_POLL_OS_RATIO] VMA INFO: Select Skip OS 1 [VMA_SELECT_SKIP_OS] VMA INFO: CQ Drain Interval (msec) 100 [VMA_PROGRESS_ENGINE_INTERVAL] VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE] VMA INFO: CQ AIM Max Count 128 [VMA_CQ_AIM_MAX_COUNT] VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC] VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL] VMA INFO: TCP nodelay 1 [VMA_TCP_NODELAY] VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD] VMA INFO: Internal Thread Affinity 0 [VMA_INTERNAL_THREAD_AFFINITY] VMA INFO: Thread mode Single [VMA_THREAD_MODE] VMA INFO: Mem Allocate type 2 (Huge Pages) [VMA_MEM_ALLOC_TYPE] VMA INFO: --------------------------------------------------------------------------- sockperf: == version #3.7-no.git == sockperf: [SERVER] listen on: [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 124588] using recvfrom() to block on socket(s)
VMA Performance Client Side
On the second machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t 4 -i 11.4.3.3
Client-side example output:
VMA INFO: --------------------------------------------------------------------- VMA INFO: VMA_VERSION: 9.5.2-1 Release built on Apr 11 2022 18:07:16 VMA INFO: Cmd Line: sockperf sr -i 11.4.3.29 VMA INFO: OFED Version: MLNX_OFED_LINUX-5.6-0.7.0.0: VMA INFO: --------------------------------------------------------------------- VMA INFO: VMA Spec Latency [VMA_SPEC] VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: Ring On Device Memory TX 16384 [VMA_RING_DEV_MEM_TX] VMA INFO: Tx QP WRE 256 [VMA_TX_WRE] VMA INFO: Tx QP WRE Batching 4 [VMA_TX_WRE_BATCHING] VMA INFO: Rx QP WRE 256 [VMA_RX_WRE] VMA INFO: Rx QP WRE Batching 4 [VMA_RX_WRE_BATCHING] VMA INFO: Rx Poll Loops -1 [VMA_RX_POLL] VMA INFO: Rx Prefetch Bytes Before Poll 256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL] VMA INFO: GRO max streams 0 [VMA_GRO_STREAMS_MAX] VMA INFO: Select Poll (usec) -1 [VMA_SELECT_POLL] VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE] VMA INFO: Select Poll OS Ratio 1 [VMA_SELECT_POLL_OS_RATIO] VMA INFO: Select Skip OS 1 [VMA_SELECT_SKIP_OS] VMA INFO: CQ Drain Interval (msec) 100 [VMA_PROGRESS_ENGINE_INTERVAL] VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE] VMA INFO: CQ AIM Max Count 128 [VMA_CQ_AIM_MAX_COUNT] VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC] VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL] VMA INFO: TCP nodelay 1 [VMA_TCP_NODELAY] VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD] VMA INFO: Internal Thread Affinity 0 [VMA_INTERNAL_THREAD_AFFINITY] VMA INFO: Thread mode Single [VMA_THREAD_MODE] VMA INFO: Mem Allocate type 2 (Huge Pages) [VMA_MEM_ALLOC_TYPE] VMA INFO: --------------------------------------------------------------------------- sockperf: == version #3.7-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s) [ 0] IP = 11.4.3.3 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=1855851; ReceivedMessages=1855850 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=1656957; ReceivedMessages=1656957 sockperf: ====> avg-lat= 1.056 (std-dev=0.074) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 1.056 usec sockperf: Total 1656957 observations; each percentile contains 16569.57 observations sockperf: ---> <MAX> observation = 4.176 sockperf: ---> percentile 99.999 = 1.639 sockperf: ---> percentile 99.990 = 1.552 sockperf: ---> percentile 99.900 = 1.497 sockperf: ---> percentile 99.000 = 1.305 sockperf: ---> percentile 90.000 = 1.179 sockperf: ---> percentile 75.000 = 1.054 sockperf: ---> percentile 50.000 = 1.031 sockperf: ---> percentile 25.000 = 1.015 sockperf: ---> <MIN> observation = 0.954
Comparing Results
VMA is showing over 614.3% performance improvement comparing to kernel
Average latency:
- Using Kernel 6.488 usec
- Using VMA 1.056 usec
Percentile latencies:
Percentile | Kernel | VMA |
---|---|---|
Max | 20.484 | 4.176 |
99.999 | 17.732 | 1.639 |
99.990 | 9.364 | 1.552 |
99.900 | 8.491 | 1.497 |
99.000 | 7.963 | 1.305 |
90.000 | 6.975 | 1.179 |
75.000 | 6.831 | 1.054 |
50.000 | 6.307 | 1.031 |
25.000 | 6.212 | 1.015 |
MIN | 5.887 | 0.954 |
In order to tune your system and get best performance see section Basic Performance Tuning.
Libvma-debug.so
libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.
Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).
Example:
$ LD_PRELOAD=libvma-debug.so sockperf server -i 11.4.3.3
libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.
For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.
NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:
./configure --enable-opt-log=none
See section Building VMA from Sources.