Running VMA
This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.
Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.
Example:
$ LD_PRELOAD=libvma.so sockperf server -i 11.4
.3.3
If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.
As a result, a VMA header message should precede your running application.
VMA INFO: VMA_VERSION: 9.7
.2
-1
Release built on Nov 12
2022
14
:45
:59
VMA INFO: Cmd Line: sockperf
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8
-1.1
.2.0
:
VMA INFO: ---------------------------------------------------------------------------
The output will always show:
The VMA version
The application’s name (in the above example: Cmd Line: sockperf sr)
The appearance of the VMA header indicates that the VMA library is loaded with your application.
Check if the LD can find the libvma library.
ld -lvma –verbose
Set the UID bit to enforce user ownership.
sudo chmod u+s /usr/lib64/libvma* sudo chmod u+s /sbin/sysctl
Grant CAP_NET_RAW privileges to the application.
sudo setcap cap_net_raw,cap_net_admin+ep /usr/bin/sockperf
Launch the application under no root.
LD_PRELOAD=libvma.so sockperf sr --tcp -i
10.0
.0.4
-p12345
LD_PRELOAD=libvma.so sockperf pp --tcp -i10.0
.0.4
-p12345
-t10
Prerequisites
Install sockperf –a tool for network performance measurement
This can be done by eitherDownloading and building from source from: https://github.com/Mellanox/sockperf
Using
yum install: yum install sockperf
Two machines, one serves as the server and the second as a client
Management interfaces configured with an IP that machines can ping each other
Physical installation of an NVIDIA® NIC in your machines
Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run:
lspci | grep Mellanox
Output example:
$ lspci | grep Mellanox
82
:00.0
Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5
Ex]82
:00.1
Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5
Ex]
Kernel Performance
Kernel Performance Server Side
On the first machine run:
$ sockperf server -i 11.4
.3.3
Server side example output:
sockperf: [SERVER] listen on:sockperf: == version #3.7
-no.git ==
sockperf: [SERVER] listen on:
[ 0
] IP = 11.4
.3.3
PORT = 11111
# UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124545
] using recvfrom() to block on socket(s)
Kernel Performance Client Side
On the second machine run:
$ sockperf ping-pong -t 4
-i 11.4
.3.3
Client-side example output:
sockperf: == version #3.7
-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
[ 0
] IP = 11.4
.3.3
PORT = 11111
# UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000
sec; Warm up time=400
msec; SentMessages=307425
; ReceivedMessages=307424
sockperf: ========= Printing statistics for
Server No: 0
sockperf: [Valid Duration] RunTime=3.550
sec; SentMessages=272899
; ReceivedMessages=272899
sockperf: ====> avg-lat= 6.488
(std-dev=0.396
)
sockperf: # dropped messages = 0
; # duplicated messages = 0
; # out-of-order messages = 0
sockperf: Summary: Latency is 6.488
usec
sockperf: Total 272899
observations; each percentile contains 2728.99
observations
sockperf: ---> <MAX> observation = 20.484
sockperf: ---> percentile 99.999
= 17.732
sockperf: ---> percentile 99.990
= 9.364
sockperf: ---> percentile 99.900
= 8.491
sockperf: ---> percentile 99.000
= 7.963
sockperf: ---> percentile 90.000
= 6.975
sockperf: ---> percentile 75.000
= 6.831
sockperf: ---> percentile 50.000
= 6.307
sockperf: ---> percentile 25.000
= 6.212
sockperf: ---> <MIN> observation = 5.887
VMA Latency
Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.
VMA Performance Server Side
On the first machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i 11.4
.3.3
Server-side example output:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.7
.2
-1
Release built on Nov 12
2022
14
:45
:59
VMA INFO: Cmd Line: sockperf server -i 11.4
.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8
-1.1
.2.0
:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec Latency [VMA_SPEC]
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX 16384
[VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE 256
[VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching 4
[VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE 256
[VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching 4
[VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops -1
[VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll 256
[VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams 0
[VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec) -1
[VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio 1
[VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS 1
[VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec) 100
[VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count 128
[VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay 1
[VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity 0
[VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode Single [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type 2
(Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7
-no.git ==
sockperf: [SERVER] listen on:
[ 0
] IP = 11.4
.3.3
PORT = 11111
# UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124588
] using recvfrom() to block on socket(s)
VMA Performance Client Side
On the second machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t 4
-i 11.4
.3.3
Client-side example output:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.7
.2
-1
Release built on Nov 12
2022
14
:45
:59
VMA INFO: Cmd Line: sockperf
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8
-1.1
.2.0
:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec Latency [VMA_SPEC]
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX 16384
[VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE 256
[VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching 4
[VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE 256
[VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching 4
[VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops -1
[VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll 256
[VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams 0
[VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec) -1
[VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio 1
[VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS 1
[VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec) 100
[VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count 128
[VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay 1
[VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity 0
[VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode Single [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type 2
(Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7
-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
[ 0
] IP = 11.4
.3.3
PORT = 11111
# UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000
sec; Warm up time=400
msec; SentMessages=1855851
; ReceivedMessages=1855850
sockperf: ========= Printing statistics for
Server No: 0
sockperf: [Valid Duration] RunTime=3.550
sec; SentMessages=1656957
; ReceivedMessages=1656957
sockperf: ====> avg-lat= 1.056
(std-dev=0.074
)
sockperf: # dropped messages = 0
; # duplicated messages = 0
; # out-of-order messages = 0
sockperf: Summary: Latency is 1.056
usec
sockperf: Total 1656957
observations; each percentile contains 16569.57
observations
sockperf: ---> <MAX> observation = 4.176
sockperf: ---> percentile 99.999
= 1.639
sockperf: ---> percentile 99.990
= 1.552
sockperf: ---> percentile 99.900
= 1.497
sockperf: ---> percentile 99.000
= 1.305
sockperf: ---> percentile 90.000
= 1.179
sockperf: ---> percentile 75.000
= 1.054
sockperf: ---> percentile 50.000
= 1.031
sockperf: ---> percentile 25.000
= 1.015
sockperf: ---> <MIN> observation = 0.954
Comparing Results
VMA is showing over 614.3% performance improvement comparing to kernel
Average latency:
Using Kernel 6.488 usec
Using VMA 1.056 usec
Percentile latencies:
Percentile | Kernel | VMA |
Max | 20.484 | 4.176 |
99.999 | 17.732 | 1.639 |
99.990 | 9.364 | 1.552 |
99.900 | 8.491 | 1.497 |
99.000 | 7.963 | 1.305 |
90.000 | 6.975 | 1.179 |
75.000 | 6.831 | 1.054 |
50.000 | 6.307 | 1.031 |
25.000 | 6.212 | 1.015 |
MIN | 5.887 | 0.954 |
In order to tune your system and get best performance see section Basic Performance Tuning.
Libvma-debug.so
libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.
Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).
Example:
$ LD_PRELOAD=libvma-debug.so sockperf server -i 11.4
.3.3
libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.
For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.
NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:
./configure --enable-opt-log=none
See section Building VMA from Sources.