Running VMA
This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.
Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.
Example:
$ LD_PRELOAD=libvma.so sockperf server -i
11.4.
3.3
If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.
As a result, a VMA header message should precede your running application.
VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i
11.4.
3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
The output will always show:
The VMA version
The application’s name (in the above example: Cmd Line: sockperf sr)
The appearance of the VMA header indicates that the VMA library is loaded with your application.
Check if the LD can find the libvma library.
ld -lvma –verbose
Set the UID bit to enforce user ownership.
sudo chmod u+s /usr/lib64/libvma* sudo chmod u+s /sbin/sysctl
Grant CAP_NET_RAW privileges to the application.
sudo setcap cap_net_raw,cap_net_admin+ep /usr/bin/sockperf
Launch the application under no root.
LD_PRELOAD=libvma.so sockperf sr --tcp -i
10.0.
0.4-p
12345LD_PRELOAD=libvma.so sockperf pp --tcp -i
10.0.
0.4-p
12345-t10
Prerequisites
Install sockperf – a tool for network performance measurement
This can be done by either
Downloading and building from source from: https://github.com/Mellanox/sockperf
Using
yum install: yum install sockperf
Two machines, one serves as the server and the second as a client
Management interfaces configured with an IP that machines can ping each other
Physical installation of an NVIDIA® NIC in your machines
Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run:
lspci | grep Mellanox
Output example:
$ lspci | grep Mellanox
82:
00.0Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-
5Ex]
82:
00.1Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-
5Ex]
Kernel Performance
Kernel Performance Server Side
On the first machine run:
$ sockperf server -i
11.4.
3.3
Server side example output:
sockperf: [SERVER] listen on:sockperf: == version #
3.7-no.git ==
sockperf: [SERVER] listen on:
[
0] IP =
11.4.
3.3 PORT =
11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid
124545] using recvfrom() to block on socket(s)
Kernel Performance Client Side
On the second machine run:
$ sockperf ping-pong -t
4 -i
11.4.
3.3
Client-side example output:
sockperf: == version #
3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
[
0] IP =
11.4.
3.3 PORT =
11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=
4.000 sec; Warm up time=
400 msec; SentMessages=
307425; ReceivedMessages=
307424
sockperf: ========= Printing statistics
for Server No:
0
sockperf: [Valid Duration] RunTime=
3.550 sec; SentMessages=
272899; ReceivedMessages=
272899
sockperf: ====> avg-lat=
6.488 (std-dev=
0.396)
sockperf: # dropped messages =
0; # duplicated messages =
0; # out-of-order messages =
0
sockperf: Summary: Latency is
6.488 usec
sockperf: Total
272899 observations; each percentile contains
2728.99 observations
sockperf: ---> <MAX> observation =
20.484
sockperf: ---> percentile
99.999 =
17.732
sockperf: ---> percentile
99.990 =
9.364
sockperf: ---> percentile
99.900 =
8.491
sockperf: ---> percentile
99.000 =
7.963
sockperf: ---> percentile
90.000 =
6.975
sockperf: ---> percentile
75.000 =
6.831
sockperf: ---> percentile
50.000 =
6.307
sockperf: ---> percentile
25.000 =
6.212
sockperf: ---> <MIN> observation =
5.887
VMA Latency
Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.
VMA Performance Server Side
On the first machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i
11.4.
3.3
Server-side example output:
VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i
11.4.
3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec Latency [VMA_SPEC]
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX
16384 [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE
256 [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching
4 [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE
256 [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching
4 [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops -
1 [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll
256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams
0 [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec) -
1 [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio
1 [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS
1 [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)
100 [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count
128 [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay
1 [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity
0 [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode Single [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type
2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #
3.7-no.git ==
sockperf: [SERVER] listen on:
[
0] IP =
11.4.
3.3 PORT =
11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid
124588] using recvfrom() to block on socket(s)
VMA Performance Client Side
On the second machine run:
$ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t
4 -i
11.4.
3.3
Client-side example output:
VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i
11.4.
3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec Latency [VMA_SPEC]
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX
16384 [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE
256 [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching
4 [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE
256 [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching
4 [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops -
1 [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll
256 [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams
0 [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec) -
1 [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force Enabled [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio
1 [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS
1 [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)
100 [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation Disabled [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count
128 [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation Disabled [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full Disabled [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay
1 [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd Enabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity
0 [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode Single [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type
2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #
3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
[
0] IP =
11.4.
3.3 PORT =
11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=
4.000 sec; Warm up time=
400 msec; SentMessages=
1855851; ReceivedMessages=
1855850
sockperf: ========= Printing statistics
for Server No:
0
sockperf: [Valid Duration] RunTime=
3.550 sec; SentMessages=
1656957; ReceivedMessages=
1656957
sockperf: ====> avg-lat=
1.056 (std-dev=
0.074)
sockperf: # dropped messages =
0; # duplicated messages =
0; # out-of-order messages =
0
sockperf: Summary: Latency is
1.056 usec
sockperf: Total
1656957 observations; each percentile contains
16569.57 observations
sockperf: ---> <MAX> observation =
4.176
sockperf: ---> percentile
99.999 =
1.639
sockperf: ---> percentile
99.990 =
1.552
sockperf: ---> percentile
99.900 =
1.497
sockperf: ---> percentile
99.000 =
1.305
sockperf: ---> percentile
90.000 =
1.179
sockperf: ---> percentile
75.000 =
1.054
sockperf: ---> percentile
50.000 =
1.031
sockperf: ---> percentile
25.000 =
1.015
sockperf: ---> <MIN> observation =
0.954
Comparing Results
VMA is showing over 614.3% performance improvement comparing to kernel
Average latency:
Using Kernel 6.488 usec
Using VMA 1.056 usec
Percentile latencies:
|
Percentile
|
Kernel
|
VMA
|
Max
|
20.484
|
4.176
|
99.999
|
17.732
|
1.639
|
99.990
|
9.364
|
1.552
|
99.900
|
8.491
|
1.497
|
99.000
|
7.963
|
1.305
|
90.000
|
6.975
|
1.179
|
75.000
|
6.831
|
1.054
|
50.000
|
6.307
|
1.031
|
25.000
|
6.212
|
1.015
|
MIN
|
5.887
|
0.954
In order to tune your system and get best performance see section Basic Performance Tuning.
Libvma-debug.so
libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.
Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).
Example:
$ LD_PRELOAD=libvma-debug.so sockperf server -i
11.4.
3.3
libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.
For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.
NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:
./configure --enable-opt-log=none
See section Building VMA from Sources.