Running VMA
This section shows how to run a simple network benchmarking test and compare the kernel network stack results to VMA.
Before running a user application, you must set the library libvma.so into the environment variable LD_PRELOAD. For further information, please refer to the VMA User Manual.
Example:
            
            $ LD_PRELOAD=libvma.so sockperf server -i 11.4.3.3
    
If LD_PRELOAD is assigned with libvma.so without a path (as in the Example) then libvma.so is read from a known library path under your distributions’ OS otherwise it is read from the specified path.
As a result, a VMA header message should precede your running application.
            
            VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i 11.4.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
    
The output will always show:
- The VMA version 
- The application’s name (in the above example: Cmd Line: sockperf sr) 
The appearance of the VMA header indicates that the VMA library is loaded with your application.
- Check if the LD can find the libvma library. - ld -lvma –verbose 
- Set the UID bit to enforce user ownership. - sudo chmod u+s /usr/lib64/libvma* sudo chmod u+s /sbin/sysctl 
- Grant CAP_NET_RAW privileges to the application. - sudo setcap cap_net_raw,cap_net_admin+ep /usr/bin/sockperf 
- Launch the application under no root. - LD_PRELOAD=libvma.so sockperf sr --tcp -i - 10.0.- 0.4-p- 12345LD_PRELOAD=libvma.so sockperf pp --tcp -i- 10.0.- 0.4-p- 12345-t10
Prerequisites
- Install sockperf – a tool for network performance measurement 
 This can be done by either- Downloading and building from source from: https://github.com/Mellanox/sockperf 
- Using - yum install: yum install sockperf 
 
- Two machines, one serves as the server and the second as a client - Management interfaces configured with an IP that machines can ping each other 
- Physical installation of an NVIDIA® NIC in your machines 
 
- Your system must recognize the NVIDIA® NIC. To verify it recognizes it, run: - lspci | grep Mellanox - Output example: - $ lspci | grep Mellanox - 82:- 00.0Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-- 5Ex]- 82:- 00.1Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-- 5Ex]
Kernel Performance
Kernel Performance Server Side
On the first machine run:
            
            $ sockperf server -i 11.4.3.3
    
Server side example output:
            
            sockperf: [SERVER] listen on:sockperf: == version #3.7-no.git ==
sockperf: [SERVER] listen on:
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124545] using recvfrom() to block on socket(s)
    
    
    
        
Kernel Performance Client Side
On the second machine run:
            
            $ sockperf ping-pong -t 4 -i 11.4.3.3
    
Client-side example output:
            
            sockperf: == version #3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
 
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=307425; ReceivedMessages=307424
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=272899; ReceivedMessages=272899
sockperf: ====> avg-lat=  6.488 (std-dev=0.396)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 6.488 usec
sockperf: Total 272899 observations; each percentile contains 2728.99 observations
sockperf: ---> <MAX> observation =   20.484
sockperf: ---> percentile 99.999 =   17.732
sockperf: ---> percentile 99.990 =    9.364
sockperf: ---> percentile 99.900 =    8.491
sockperf: ---> percentile 99.000 =    7.963
sockperf: ---> percentile 90.000 =    6.975
sockperf: ---> percentile 75.000 =    6.831
sockperf: ---> percentile 50.000 =    6.307
sockperf: ---> percentile 25.000 =    6.212
sockperf: ---> <MIN> observation =    5.887
    
VMA Latency
Check the VMA performance by running sockperf and using the "VMA_SPEC=latency" environment variable.
VMA Performance Server Side
On the first machine run:
            
            $ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf server -i 11.4.3.3
    
Server-side example output:
            
            VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i 11.4.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec                       Latency        [VMA_SPEC]
VMA INFO: Log Level                      INFO           [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX       16384          [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE                      256            [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching             4              [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE                      256            [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching             4              [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops                  -1             [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll  256            [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams                0              [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec)             -1             [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force           Enabled        [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio           1              [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS                 1              [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)       100            [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation       Disabled       [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count               128            [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation         Disabled       [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full               Disabled       [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay                    1              [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd      Enabled        [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity       0              [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode                    Single         [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type              2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7-no.git ==
sockperf: [SERVER] listen on:
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: [tid 124588] using recvfrom() to block on socket(s)
    
    
    
        
VMA Performance Client Side
On the second machine run:
            
            $ LD_PRELOAD=libvma.so VMA_SPEC=latency sockperf ping-pong -t 4 -i 11.4.3.3
    
Client-side example output:
            
            VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i 11.4.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA Spec                       Latency        [VMA_SPEC]
VMA INFO: Log Level                      INFO           [VMA_TRACELEVEL]
VMA INFO: Ring On Device Memory TX       16384          [VMA_RING_DEV_MEM_TX]
VMA INFO: Tx QP WRE                      256            [VMA_TX_WRE]
VMA INFO: Tx QP WRE Batching             4              [VMA_TX_WRE_BATCHING]
VMA INFO: Rx QP WRE                      256            [VMA_RX_WRE]
VMA INFO: Rx QP WRE Batching             4              [VMA_RX_WRE_BATCHING]
VMA INFO: Rx Poll Loops                  -1             [VMA_RX_POLL]
VMA INFO: Rx Prefetch Bytes Before Poll  256            [VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
VMA INFO: GRO max streams                0              [VMA_GRO_STREAMS_MAX]
VMA INFO: Select Poll (usec)             -1             [VMA_SELECT_POLL]
VMA INFO: Select Poll OS Force           Enabled        [VMA_SELECT_POLL_OS_FORCE]
VMA INFO: Select Poll OS Ratio           1              [VMA_SELECT_POLL_OS_RATIO]
VMA INFO: Select Skip OS                 1              [VMA_SELECT_SKIP_OS]
VMA INFO: CQ Drain Interval (msec)       100            [VMA_PROGRESS_ENGINE_INTERVAL]
VMA INFO: CQ Interrupts Moderation       Disabled       [VMA_CQ_MODERATION_ENABLE]
VMA INFO: CQ AIM Max Count               128            [VMA_CQ_AIM_MAX_COUNT]
VMA INFO: CQ Adaptive Moderation         Disabled       [VMA_CQ_AIM_INTERVAL_MSEC]
VMA INFO: CQ Keeps QP Full               Disabled       [VMA_CQ_KEEP_QP_FULL]
VMA INFO: TCP nodelay                    1              [VMA_TCP_NODELAY]
VMA INFO: Avoid sys-calls on tcp fd      Enabled        [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
VMA INFO: Internal Thread Affinity       0              [VMA_INTERNAL_THREAD_AFFINITY]
VMA INFO: Thread mode                    Single         [VMA_THREAD_MODE]
VMA INFO: Mem Allocate type              2 (Huge Pages) [VMA_MEM_ALLOC_TYPE]
VMA INFO: ---------------------------------------------------------------------------
sockperf: == version #3.7-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
 
[ 0] IP = 11.4.3.3        PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=4.000 sec; Warm up time=400 msec; SentMessages=1855851; ReceivedMessages=1855850
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=3.550 sec; SentMessages=1656957; ReceivedMessages=1656957
sockperf: ====> avg-lat=  1.056 (std-dev=0.074)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 1.056 usec
sockperf: Total 1656957 observations; each percentile contains 16569.57 observations
sockperf: ---> <MAX> observation =    4.176
sockperf: ---> percentile 99.999 =    1.639
sockperf: ---> percentile 99.990 =    1.552
sockperf: ---> percentile 99.900 =    1.497
sockperf: ---> percentile 99.000 =    1.305
sockperf: ---> percentile 90.000 =    1.179
sockperf: ---> percentile 75.000 =    1.054
sockperf: ---> percentile 50.000 =    1.031
sockperf: ---> percentile 25.000 =    1.015
sockperf: ---> <MIN> observation =    0.954
    
Comparing Results
VMA is showing over 614.3% performance improvement comparing to kernel
Average latency:
- Using Kernel 6.488 usec 
- Using VMA 1.056 usec 
Percentile latencies:
| Percentile | Kernel | VMA | 
| Max | 20.484 | 4.176 | 
| 99.999 | 17.732 | 1.639 | 
| 99.990 | 9.364 | 1.552 | 
| 99.900 | 8.491 | 1.497 | 
| 99.000 | 7.963 | 1.305 | 
| 90.000 | 6.975 | 1.179 | 
| 75.000 | 6.831 | 1.054 | 
| 50.000 | 6.307 | 1.031 | 
| 25.000 | 6.212 | 1.015 | 
| MIN | 5.887 | 0.954 | 
In order to tune your system and get best performance see section Basic Performance Tuning.
Libvma-debug.so
libvma.so is limited to DEBUG log level. In case it is required to run VMA with detailed logging higher than DEBUG level – use a library called libvma-debug.so that comes with OFED installation.
Before running your application, set the library libvma-debug.so into the environment variable LD_PRELOAD (instead of libvma.so).
Example:
            
            $ LD_PRELOAD=libvma-debug.so sockperf server -i 11.4.3.3
    
libvma-debug.so is located in the same library path as libvma.so under your distribution’s OS.
For example in RHEL7.x x86_64, the libvma.so is located in /usr/lib64/libvma-debug.so.
NOTE: If you need to compile VMA with a log level higher than DEBUG run “configure” with the following parameter:
            
            ./configure --enable-opt-log=none
    
See section Building VMA from Sources.