VMA Configuration
You can control the behavior of VMA by configuring:
The libvma.conf file
VMA configuration parameters, which are Linux OS environment variables
VMA extra API
The installation process creates a default configuration file, /etc/libvma.conf, in which you can define and change the following settings:
The target applications or processes to which the configured control settings apply. By default, VMA control settings apply to all applications.
The transport protocol to be used for the created sockets.
The IP addresses and ports in which you want to offload.
By default, the configuration file allows VMA to offload everything except for the DNS server-side protocol (UDP, port 53) which will be handled by the OS.
In the libvma.conf file:
You can define different VMA control statements for different processes in a single configuration file. Control statements are always applied to the preceding target process statement in the configuration file.
Comments start with # and cause the entire line after it to be ignored.
Any beginning whitespace is skipped.
Any line that is empty is skipped.
It is recommended to add comments when making configuration changes.
The following sections describe configuration options in libvma.conf. For a sample libvma.conf file, see Example of VMA Configuration.
Configuring Target Application or Process
The target process statement specifies the process to which all control statements that appear between this statement and the next target process statement apply.
Each statement specifies a matching rule that all its sub-expressions must evaluate as true (logical and) to apply.
If not provided (default), the statement matches all programs.
The format of the target process statement is:
application-id <program-name|*> <user-defined-id|*>
Option |
Description |
<program-name|*> |
Define the program name (not including the path) to which the control statements appearing below this statement apply. Wildcards with the same semantics as "ls" are supported (* and ?). For example:
|
<user-defined-id|*> |
Specify the process ID to which the control statements appearing below this statement apply. Warning
You must also set the VMA_APPLICATION_ID environment variable to the same value as user-defined-id.
|
Configuring Socket Transport Control
Use socket control statements to specify when libvma will offload AF_INET/SOCK_STREAM or AF_INET/SOCK_DATAGRAM sockets (currently SOCK_RAW is not supported).
Each control statement specifies a matching rule that all its sub-expressions must evaluate as true (logical and) to apply. Statements are evaluated in order of definition according to "first-match".
Socket control statements use the following format:
use <transport> <role> <address|*>:<port range|*>
Where:
Option |
Description |
transport |
Define the mode of transport:
The default is vma. |
role |
Specify one of the following roles:
|
address |
You can specify the local address the server is bind to or the remote server address the client connects to. The syntax for address matching is: <IPv4 address>[/<prefix_length>]|*
|
port range |
Define the port range as:
Port range: 0-65536 |
Example of VMA Configuration
To set the following:
Apply the rules to program tcp_lat with ID B1
Use VMA by TCP clients connecting to machines that belong to subnet 192.168.1.*
Use OS when TCP server listens to port 5001 of any machine
In libvma.conf, configure:
application-id tcp-lat B1
use vma tcp_client 192.168
.1.0
/24
:*:*:*
use os tcp_server *:5001
use os udp_connect *:53
You must also set the VMA parameter:
VMA_APPLICATION_ID=B1
VMA configuration parameters are Linux OS environment variables that are controlled with system environment variables.
It is recommended that you set these parameters prior to loading the application with VMA. You can set the parameters in a system file, which can be run manually or automatically.
All the parameters have defaults that can be modified.
On default startup, the VMA library prints the VMA version information, as well as the configuration parameters being used and their values to stderr.
VMA always logs the values of the following parameters, even when they are equal to the default value:
VMA_TRACELEVEL
VMA_LOG_FILE
For all other parameters, VMA logs the parameter values only when they are not equal to the default value.
The VMA version information, parameters, and values are subject to change.
For example:
VMA INFO: VMA_VERSION: X.Y.Z-R Release built on MM DD YYYY HH:mm:ss
VMA INFO: Cmd Line: sockperf server -i 11.4
.3.3
VMA INFO: OFED Version: MLNX_OFED_LINUX-X.X-X.X.X.X:
VMA INFO: ---------------------------------------------------------------------------
Pid: 2378
Tid: 2378
VMA INFO: Log Level DEBUG [VMA_TRACELEVEL]
Pid: 2378
Tid: 2378
VMA INFO: Log Details 2
[VMA_LOG_DETAILS]
Pid: 2378
Tid: 2378
VMA DETAILS: Log Colors Enabled [VMA_LOG_COLORS]
Pid: 2378
Tid: 2378
VMA DETAILS: Log File [VMA_LOG_FILE]
Pid: 2378
Tid: 2378
VMA DETAILS: Stats File [VMA_STATS_FILE]
Pid: 2378
Tid: 2378
VMA DETAILS: Stats shared memory directory /tmp/ [VMA_STATS_SHMEM_DIR]
Pid: 2378
Tid: 2378
VMA DETAILS: VMAD output directory /tmp/vma [VMA_VMAD_NOTIFY_DIR]
Pid: 2378
Tid: 2378
VMA DETAILS: Stats FD Num (max) 100
[VMA_STATS_FD_NUM]
Pid: 2378
Tid: 2378
VMA DETAILS: Conf File /etc/libvma.conf [VMA_CONFIG_FILE]
Pid: 2378
Tid: 2378
VMA DETAILS: Application ID VMA_DEFAULT_APPLICATION_ID [VMA_APPLICATION_ID]
Pid: 2378
Tid: 2378
VMA DETAILS: Polling CPU idle usage Disabled [VMA_CPU_USAGE_STATS]
Pid: 2378
Tid: 2378
VMA DETAILS: SigIntr Ctrl-C Handle Disabled [VMA_HANDLE_SIGINTR]
Pid: 2378
Tid: 2378
VMA DETAILS: SegFault Backtrace Disabled [VMA_HANDLE_SIGSEGV]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring allocation logic TX 0
(Ring per interface
) [VMA_RING_ALLOCATION_LOGIC_TX]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring allocation logic RX 0
(Ring per interface
) [VMA_RING_ALLOCATION_LOGIC_RX]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring migration ratio TX 100
[VMA_RING_MIGRATION_RATIO_TX]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring migration ratio RX 100
[VMA_RING_MIGRATION_RATIO_RX]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring limit per interface
0
(no limit) [VMA_RING_LIMIT_PER_INTERFACE]
Pid: 2378
Tid: 2378
VMA DETAILS: Ring On Device Memory TX 0
[VMA_RING_DEV_MEM_TX]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP max syn rate 0
(no limit) [VMA_TCP_MAX_SYN_RATE]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx Mem Segs TCP 1000000
[VMA_TX_SEGS_TCP]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx Mem Bufs 200000
[VMA_TX_BUFS]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx QP WRE 2048
[VMA_TX_WRE]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx QP WRE Batching 64
[VMA_TX_WRE_BATCHING]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx Max QP INLINE 204
[VMA_TX_MAX_INLINE]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx MC Loopback Enabled [VMA_TX_MC_LOOPBACK]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx non-blocked eagains Disabled [VMA_TX_NONBLOCKED_EAGAINS]
Pid: 2378
Tid: 2378
VMA DETAILS: Tx Prefetch Bytes 256
[VMA_TX_PREFETCH_BYTES]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Mem Bufs 200000
[VMA_RX_BUFS]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx QP WRE 16000
[VMA_RX_WRE]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx QP WRE Batching 64
[VMA_RX_WRE_BATCHING]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Byte Min Limit 65536
[VMA_RX_BYTES_MIN]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Poll Loops 100000
[VMA_RX_POLL]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Poll Init Loops 0
[VMA_RX_POLL_INIT]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx UDP Poll OS Ratio 100
[VMA_RX_UDP_POLL_OS_RATIO]
Pid: 2378
Tid: 2378
VMA DETAILS: HW TS Conversion 3
[VMA_HW_TS_CONVERSION]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Poll Yield Disabled [VMA_RX_POLL_YIELD]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Prefetch Bytes 256
[VMA_RX_PREFETCH_BYTES]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx Prefetch Bytes Before Poll 0
[VMA_RX_PREFETCH_BYTES_BEFORE_POLL]
Pid: 2378
Tid: 2378
VMA DETAILS: Rx CQ Drain Rate Disabled [VMA_RX_CQ_DRAIN_RATE_NSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: GRO max streams 32
[VMA_GRO_STREAMS_MAX]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP 3T rules Disabled [VMA_TCP_3T_RULES]
Pid: 2378
Tid: 2378
VMA DETAILS: UDP 3T rules Enabled [VMA_UDP_3T_RULES]
Pid: 2378
Tid: 2378
VMA DETAILS: ETH MC L2 only rules Disabled [VMA_ETH_MC_L2_ONLY_RULES]
Pid: 2378
Tid: 2378
VMA DETAILS: Force Flowtag for
MC Disabled [VMA_MC_FORCE_FLOWTAG]
Pid: 2378
Tid: 2378
VMA DETAILS: Select Poll (usec) 100000
[VMA_SELECT_POLL]
Pid: 2378
Tid: 2378
VMA DETAILS: Select Poll OS Force Disabled [VMA_SELECT_POLL_OS_FORCE]
Pid: 2378
Tid: 2378
VMA DETAILS: Select Poll OS Ratio 10
[VMA_SELECT_POLL_OS_RATIO]
Pid: 2378
Tid: 2378
VMA DETAILS: Select Skip OS 4
[VMA_SELECT_SKIP_OS]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Drain Interval (msec) 10
[VMA_PROGRESS_ENGINE_INTERVAL]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Drain WCE (max) 10000
[VMA_PROGRESS_ENGINE_WCE_MAX]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Interrupts Moderation Enabled [VMA_CQ_MODERATION_ENABLE]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Moderation Count 48
[VMA_CQ_MODERATION_COUNT]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Moderation Period (usec) 50
[VMA_CQ_MODERATION_PERIOD_USEC]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ AIM Max Count 560
[VMA_CQ_AIM_MAX_COUNT]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ AIM Max Period (usec) 250
[VMA_CQ_AIM_MAX_PERIOD_USEC]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ AIM Interval (msec) 250
[VMA_CQ_AIM_INTERVAL_MSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ AIM Interrupts Rate (per sec) 5000
[VMA_CQ_AIM_INTERRUPTS_RATE_PER_SEC]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Poll Batch (max) 16
[VMA_CQ_POLL_BATCH_MAX]
Pid: 2378
Tid: 2378
VMA DETAILS: CQ Keeps QP Full Enabled [VMA_CQ_KEEP_QP_FULL]
Pid: 2378
Tid: 2378
VMA DETAILS: QP Compensation Level 256
[VMA_QP_COMPENSATION_LEVEL]
Pid: 2378
Tid: 2378
VMA DETAILS: Offloaded Sockets Enabled [VMA_OFFLOADED_SOCKETS]
Pid: 2378
Tid: 2378
VMA DETAILS: Timer Resolution (msec) 10
[VMA_TIMER_RESOLUTION_MSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP Timer Resolution (msec) 100
[VMA_TCP_TIMER_RESOLUTION_MSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP control thread 0
(Disabled) [VMA_TCP_CTL_THREAD]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP timestamp option 0
[VMA_TCP_TIMESTAMP_OPTION]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP nodelay 0
[VMA_TCP_NODELAY]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP quickack 0
[VMA_TCP_QUICKACK]
Pid: 2378
Tid: 2378
VMA DETAILS: Exception handling mode -1
(just log debug message) [VMA_EXCEPTION_HANDLING]
Pid: 2378
Tid: 2378
VMA DETAILS: Avoid sys-calls on tcp fd Disabled [VMA_AVOID_SYS_CALLS_ON_TCP_FD]
Pid: 2378
Tid: 2378
VMA DETAILS: Allow privileged sock opt Enabled [VMA_ALLOW_PRIVILEGED_SOCK_OPT]
Pid: 2378
Tid: 2378
VMA DETAILS: Delay after join (msec) 0
[VMA_WAIT_AFTER_JOIN_MSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: Internal Thread Affinity -1
[VMA_INTERNAL_THREAD_AFFINITY]
Pid: 2378
Tid: 2378
VMA DETAILS: Internal Thread Cpuset [VMA_INTERNAL_THREAD_CPUSET]
Pid: 2378
Tid: 2378
VMA DETAILS: Internal Thread Arm CQ Disabled [VMA_INTERNAL_THREAD_ARM_CQ]
Pid: 2378
Tid: 2378
VMA DETAILS: Internal Thread TCP Handling 0
(deferred) [VMA_INTERNAL_THREAD_TCP_TIMER_HANDLING]
Pid: 2378
Tid: 2378
VMA DETAILS: Thread mode Multi spin lock [VMA_THREAD_MODE]
Pid: 2378
Tid: 2378
VMA DETAILS: Buffer batching mode 1
(Batch and reclaim buffers) [VMA_BUFFER_BATCHING_MODE]
Pid: 2378
Tid: 2378
VMA DETAILS: Mem Allocate type 1
(Contig Pages) [VMA_MEM_ALLOC_TYPE]
Pid: 2378
Tid: 2378
VMA DETAILS: Num of UC ARPs 3
[VMA_NEIGH_UC_ARP_QUATA]
Pid: 2378
Tid: 2378
VMA DETAILS: UC ARP delay (msec) 10000
[VMA_NEIGH_UC_ARP_DELAY_MSEC]
Pid: 2378
Tid: 2378
VMA DETAILS: Num of neigh restart retries 1
[VMA_NEIGH_NUM_ERR_RETRIES]
Pid: 2378
Tid: 2378
VMA DETAILS: IPOIB support Enabled [VMA_IPOIB]
Pid: 2378
Tid: 2378
VMA DETAILS: SocketXtreme mode Disabled [VMA_SOCKETXTREME]
Pid: 2378
Tid: 2378
VMA DETAILS: BF (Blue Flame) Enabled [VMA_BF]
Pid: 2378
Tid: 2378
VMA DETAILS: fork() support Enabled [VMA_FORK]
Pid: 2378
Tid: 2378
VMA DETAILS: close on dup2() Enabled [VMA_CLOSE_ON_DUP2]
Pid: 2378
Tid: 2378
VMA DETAILS: MTU 0
(follow actual MTU) [VMA_MTU]
Pid: 2378
Tid: 2378
VMA DETAILS: MSS 0
(follow VMA_MTU) [VMA_MSS]
Pid: 2378
Tid: 2378
VMA DETAILS: TCP CC Algorithm 0
(LWIP) [VMA_TCP_CC_ALGO]
Pid: 2378
Tid: 2378
VMA DETAILS: Polling Rx on Tx TCP Disabled [VMA_RX_POLL_ON_TX_TCP]
Pid: 2378
Tid: 2378
VMA DETAILS: Trig dummy send getsockname() Disabled [VMA_TRIGGER_DUMMY_SEND_GETSOCKNAME]
VMA INFO: -----------------------------------------------------------------
Configuration Parameters Values
The following table lists the VMA configuration parameters and their possible values.
VMA Configuration Parameter |
Description and Examples |
VMA_TRACELEVEL |
PANIC = 0 – Panic level logging. This trace level causes fatal behavior and halts the application, typically caused by memory allocation problems. PANIC level is rarely used. |
ERROR = 1 – Runtime errors in VMA. Typically, this trace level assists you to identify internal logic errors, such as errors from underlying OS or InfiniBand verb calls, and internal double mapping/unmapping of objects. |
|
WARN = WARNING = 2– Runtime warning that does not disrupt the application workflow. A warning may indicate problems in the setup or in the overall setup configuration. For example, address resolution failures (due to an incorrect routing setup configuration), corrupted IP packets in the receive path, or unsupported functions requested by the user application. |
|
INFO = INFORMATION = 3– General information passed to the user of the application. This trace level includes configuration logging or general information to assist you with better use of the VMA library. |
|
DETAILS – Greater general information passed to the user of the application. This trace level includes printing of all environment variables of VMA at start up. |
|
DEBUG = 4 – High-level insight to the operations performed in VMA. In this logging level all socket API calls are logged, and internal high-level control channels log their activity. |
|
FINE = FUNC = 5 – Low-level runtime logging of activity. This logging level includes basic Tx and Rx logging in the fast path. Note that using this setting lowers application performance. We recommend that you use this level with the VMA_LOG_FILE parameter. |
|
FINER = FUNC_ALL = 6 – Very low-level runtime logging of activity. This logging level drastically lowers application performance. We recommend that you use this level with the VMA_LOG_FILE parameter. |
|
VMA_LOG_DETAILS |
Provides additional logging details on each log line. 0 = Basic log line 1 = With ThreadId 2 = With ProcessId and ThreadId 3 = With Time, ProcessId, and ThreadId (Time is the amount of milliseconds from the start of the process) Default: 0 For VMA_TRACELEVEL >= 4, this value defaults to 2. |
VMA_LOG_FILE |
Redirects all VMA logging to a specific user-defined file. This is very useful when raising the VMA_TRACELEVEL. The VMA replaces a single '%d' appearing in the log file name with the pid of the process loaded with VMA. This can help when running multiple instances of VMA, each with its own log file name. Example: VMA_LOG_FILE=/tmp/vma_log.txt |
VMA_CONFIG_FILE |
Sets the full path to the VMA configuration file. Example: VMA_CONFIG_FILE=/tmp/libvma.conf Default: /etc/libvma.conf |
LOG_COLORS |
Uses a color scheme when logging; red for errors and warnings, and dim for very low level debugs. VMA_LOG_COLORS is automatically disabled when logging is done directly to a non-terminal device (for example, when VMA_LOG_FILE is configured). Default: 1 (Enabled) |
VMA_CPU_USAGE_STATS |
Calculates the VMA CPU usage during polling hardware loops. This information is available through VMA stats utility. Default: 0 (Disabled) |
VMA_APPLICATION_ID |
Specifies a group of rules from libvma.conf for VMA to apply. Example: VMA_APPLICATION_ID=iperf_server Default: VMA_DEFAULT_APPLICATION_ID (match only the '*' group rule) |
VMA_HANDLE_SIGINTR |
When enabled, the VMA handler is called when an interrupt signal is sent to the process. VMA also calls the application's handler, if it exists. Range: 0 to 1 Default: 0 (Disabled) |
VMA_HANDLE_SIGSEGV |
When enabled, a print backtrace is performed, if a segmentation fault occurs. Range: 0 to 1 Default: 0 (Disabled) |
VMA_STATS_FD_NUM |
Maximum number of sockets monitored by the VMA statistics mechanism. Range: 0 to 1024 Default: 100 |
VMA_STATS_FILE |
Redirects socket statistics to a specific user-defined file. VMA dumps each socket's statistics into a file when closing the socket. Example: VMA_STATS_FILE=/tmp/stats |
VMA_STATS_SHMEM_DIR |
Sets the directory path for VMA to create the shared memory files for vma_stats. If this value is set to an empty string: “ “, no shared memory files are created. Default: /tmp/ |
VMA_VMAD_NOTIFY_DIR |
Sets the directory path for VMA to write files used by vmad. Default value is /tmp/vma Note: when used vmad must be run with --notify-dir directing the same folder. |
VMA_TCP_MAX_SYN_RATE |
Limits the number of TCP SYN packets that VMA handles per second for each listen socket. Example: by setting this value to 10, the maximal number of TCP connection accepted by VMA per second for each listen socket will be 10. Set this value to 0 for VMA to handle an unlimited number of TCP SYN packets per second for each listen socket. Value range is 0 to 100000. Default value is 0 (no limit) |
VMA_TX_SEGS_TCP |
Number of TCP LWIP segments allocation for each VMA process. Default: 1000000 |
VMA_TX_BUFS |
Number of global Tx data buffer elements allocation. Default: 200000 |
VMA_TX_WRE |
Number of Work Request Elements allocated in all transmit QP's. The number of QP's can change according to the number of network offloaded interfaces. Default: 3000 The size of the Tx buffers is determined by the VMA_MTU parameter value (see below). If this value is raised, the packet rate peaking can be better sustained; however, this increases memory usage. A smaller number of data buffers gives a smaller memory footprint, but may not sustain peaks in the data rate. |
VMA_TX_WRE_BATCHING |
Controls the number of aggregated Work Requests Elements before receiving a completion signal (CQ entry) from the hardware. Previously this number was hard coded as 64. The new update allows a better control of the jitter encountered in the Tx completion handling. Valid value range: 1-64 Default: 64 |
VMA_TX_MAX_INLINE |
Max send inline data set for QP. Data copied into the INLINE space is at least 32 bytes of headers and the rest can be user datagram payload. VMA_TX_MAX_INLINE=0 disables INLINEing on the TX transmit path. In older releases this parameter was called VMA_MAX_INLINE. Default: 220 |
VMA_TX_MC_LOOPBACK |
Sets the initial value used internally by the VMA to control multicast loopback packet behavior during transmission. An application that calls setsockopt() with IP_MULTICAST_LOOP overwrites the initial value set by this parameter. Range: 0 - Disabled, 1 - Enabled Default: 1 |
VMA_TX_NONBLOCKED_EAGAINS |
Returns value 'OK' on all send operations that are performed on a non-blocked udp socket. This is the OS default behavior. The datagram sent is silently dropped inside the VMA or the network stack. When set to Enabled (set to 1), VMA returns with error EAGAIN if it was unable to accomplish the send operation, and the datagram was dropped. In both cases, a dropped Tx statistical counter is incremented. Default: 0 (Disabled) |
VMA_TX_PREFETCH_BYTES |
Accelerates an offloaded send operation by optimizing the cache. Different values give an optimized send rate on different machines. We recommend that you adjust this parameter to your specific hardware. Range: 0 to MTU size Disable with a value of 0 Default: 256 bytes |
VMA_RX_BUFS |
The number of Rx data buffer elements allocated for the processes. These data buffers are used by all QPs on all HCAs, as determined by the VMA_QP_LOGIC. Default: 200000 |
VMA_RX_WRE |
The number of Work Request Elements allocated in all received QPs. Default: 16000 |
VMA_RX_WRE_BATCHING |
Number of Work Request Elements and RX buffers to batch before recycling. Batching decreases the latency mean, but might increase latency STD. Valid value range: 1-1024 Default: 64 |
VMA_RX_BYTES_MIN |
The minimum value in bytes used per socket by the VMA when applications call to setsockopt(SO_RCVBUF). If the application tries to set a smaller value than configured in VMA_RX_BYTES_MIN, VMA forces this minimum limit value on the socket. VMA offloaded sockets receive the maximum amount of ready bytes. If the application does not drain sockets and the byte limit is reached, newly received datagrams are dropped. The application's socket usage of current, max,dropped bytes and packet counters, can be monitored using vma_stats. Default: 65536 |
VMA_RX_POLL |
The number of times to unsuccessfully poll an Rx for VMA packets before going to sleep. Range: -1, 0 … 100,000,000 Default: 100,000 This value can be reduced to lower the load on the CPU. However, the price paid for this is that the Rx latency is expected to increase. Recommended values:
Once the VMA has gone to sleep, if it is in blocked mode, it waits for an interrupt; if it is in non-blocked mode, it returns -1. This Rx polling is performed when the application is working with direct blocked calls to read(), recv(), recvfrom(), and recvmsg(). When the Rx path has successful poll hits, the latency improves dramatically. However, this causes increased CPU utilization. For more information, see Debugging, Troubleshooting, and Monitoring. |
VMA_RX_POLL_INIT |
VMA maps all UDP sockets as potential Offloaded-capable. Only after ADD_MEMBERSHIP is set, the offload starts working and the CQ polling starts VMA. This parameter controls the polling count during this transition phase where the socket is a UDP unicast socket and no multicast addresses were added to it. Once the first ADD_MEMBERSHIP is called, the VMA_RX_POLL (above) takes effect. Value range is similar to the VMA_RX_POLL (above). Default: 0 |
VMA_RX_UDP_POLL_OS_RATIO |
Defines the ratio between VMA CQ poll and OS FD poll. This will result in a single poll of the not-offloaded sockets every VMA_RX_UDP_POLL_OS_RATIO offloaded socket (CQ) polls. No matter if the CQ poll was a hit or miss. No matter if the socket is blocking or non-blocking. When disabled, only offloaded sockets are polled. This parameter replaces the two old parameters:
Disable with 0 Default: 10 |
VMA_HW_TS_CONVERSION |
Defines timestamp conversion method. The value of VMA_HW_TS_CONVERSION is determined by all devices, that is, if the hardware of one device does not support the conversion, then it will be disabled for the other devices. Currently only UDP RX flow is supported. Options = [0,1,2,3,4]:
Default value: 3 (Sync to system time) |
VMA_RX_POLL_YIELD |
When an application is running with multiple threads on a limited number of cores, there is a need for each thread polling inside VMA (read, readv, recv, and recvfrom) to yield the CPU to another polling thread so as not to starve them from processing incoming packets. Default: 0 (Disabled) |
VMA_RX_PREFETCH_BYTES |
The size of the receive buffer to prefetch into the cache while processing ingress packets. The default is a single cache line of 64 bytes which should be at least 32 bytes to cover the IPoIB+IP+UDP headers and a small part of the user payload. Increasing this size can help improve performance for larger user payloads. Range: 32 bytes to MTU size Default: 256 bytes |
VMA_RX_CQ_DRAIN_RATE_NSEC |
Socket's receive path CQ drain logic rate control. When disabled (default), the socket's receive path attempts to return a ready packet from the socket's receive ready packet queue. If the ready receive packet queue is empty, the socket checks the CQ for ready completions for processing. When enabled, even if the socket's receive ready packet queue is not empty, this parameter checks the CQ for ready completions for processing. This CQ polling rate is controlled in nanosecond resolution to prevent CPU consumption due to over CQ polling. This enables improved 'real-time' monitoring of the socket ready packet queue. Recommended value is 100-5000 (nsec) Default: 0 (Disabled) |
VMA_RX_POLL_ON_TX_TCP |
Enables TCP RX polling during TXP TX operation for faster TCP ACK reception Default: 0 (Disabled) |
VMA_GRO_STREAMS_MAX |
Controls the number of TCP streams to perform GRO (generic receive offload) simultaneously. Disable GRO with a value of 0. Default: 32 |
VMA_TCP_3T_RULES |
Uses only 3 tuple rules for TCP, instead of using 5 tuple rules. This can improve performance for a server with a listen socket which accepts many connections from the same source IP. Enable with a value of 1. Default: 0 (Disabled) |
VMA_UDP_3T_RULES |
This parameter is relevant in case the application uses connected UDP sockets. 3 tuple rules are used in hardware flow steering rule when the parameter is enabled, and in 5 tuple flow steering rule when it is disabled. Enabling this option can reduce hardware flow steering resources. However, when it is disabled, the application might see benefits in latency and cycles per packet. Default: 1 (Enable) |
VMA_ETH_MC_L2_ONLY_RULES |
Uses only L2 rules for Ethernet Multicast. All loopback traffic will be handled by VMA instead of OS. Enable with a value of 1. Default: 0 (Disabled) |
VMA_SELECT_POLL |
The duration in micro-seconds (usec) in which to poll the hardware on Rx path before blocking for an interrupt (when waiting and also when calling select(), poll(), or epoll_wait()). Range: -1, 0 … 100,000,000 Default: 100,000 When the selected path has successfully received poll hits, the latency improves dramatically. However, this comes at the expense of CPU utilization. For more information, see Debugging, Troubleshooting, and Monitoring. |
VMA_SELECT_POLL_OS_RATIO |
This enables polling the OS file descriptors while the user thread calls select(), poll(), or epoll_wait(), and VMA is busy in the offloaded socket polling loop. This results in a single poll of the non-offloaded sockets every VMA_SELECT_POLL_RATIO offloaded socket (CQ) polls. When disabled, only offloaded sockets are polled. (See VMA_SELECT_POLL for more information.) Disable with 0 Default: 10 |
VMA_SELECT_SKIP_OS |
In select(), poll(), or epoll_wait()forces the VMA to check the non-offloaded sockets even though an offloaded socket has a ready packet that was found while polling. Range: 0 … 10,000 Default: 4 |
VMA_CQ_POLL_BATCH_MAX |
The maximum size of the array while polling the CQs in the VMA. Default: 16 |
VMA_PROGRESS_ENGINE_INTERVAL |
Internal VMA thread safety which checks that the CQ is drained at least once every N milliseconds. This mechanism allows VMA to progress the TCP stack even when the application does not access its socket (so it does not provide a context to VMA). If the CQ was already drained by the application receive socket API calls, this thread goes back to sleep without any processing. Disable with 0 Default: 10 milliseconds |
VMA_PROGRESS_ENGINE_WCE_MAX |
Each time the VMA's internal thread starts its CQ draining, it stops when it reaches this maximum value. The application is not limited by this value in the number of CQ elements that it can ProcessId from calling any of the receive path socket APIs. Default: 2048 |
VMA_CQ_MODERATION_ENABLE |
Enable CQ interrupt moderation. Default: 1 (Enabled) |
VMA_CQ_MODERATION_COUNT |
Number of packets to hold before generating interrupt. Default: 48 |
VMA_CQ_MODERATION_PERIOD_USEC |
Period in microseconds for holding the packet before generating interrupt. Default: 50 |
VMA_CQ_AIM_MAX_COUNT |
Maximum count value to use in the adaptive interrupt moderation algorithm. Default: 560 |
VMA_CQ_AIM_MAX_PERIOD_USEC |
Maximum period value to use in the adaptive interrupt moderation algorithm. Default: 250 |
VMA_CQ_AIM_INTERVAL_MSEC |
Frequency of interrupt moderation adaptation. Interval in milliseconds between adaptation attempts. Use value of 0 to disable adaptive interrupt moderation. Default: 250 |
VMA_CQ_AIM_INTERRUPTS_RATE_PER_SEC |
Desired interrupts rate per second for each ring (CQ). The count and period parameters for CQ moderation will change automatically to achieve the desired interrupt rate for the current traffic rate. Default: 5000 |
VMA_CQ_KEEP_QP_FULL |
If disabled (default), the CQ does not try to compensate for each poll on the receive path. It uses a "debt" to remember how many WRE are missing from each QP, so that it can fill it when buffers become available. If enabled, CQ tries to compensate QP for each polled receive completion. If there is a shortage of buffers, it reposts a recently completed buffer. This causes a packet drop, and is monitored in vma_stats. Default: 1 (Enabled) |
VMA_QP_COMPENSATION_LEVEL |
The number of spare receive buffer CQ holds that can be allowed for filling up QP while full receive buffers are being processed inside VMA. Default: 256 buffers |
VMA_OFFLOADED_SOCKETS |
Creates all sockets as offloaded/not-offloaded by default.
Default: 1 (Enabled) |
VMA_TIMER_RESOLUTION_MSEC |
Control VMA internal thread wakeup timer resolution (in milliseconds). Default: 10 (milliseconds) |
VMA_TCP_TIMER_RESOLUTION_MSEC |
Controls VMA internal TCP timer resolution (fast timer) (in milliseconds). Minimum value is the internal thread wakeup timer resolution (VMA_TIMER_RESOLUTION_MSEC). Default: 100 (milliseconds) |
VMA_TCP_CTL_THREAD |
Does all TCP control flows in the internal thread. This feature should be disabled if using blocking poll/select (epoll is OK).
Default: 0 (disabled) |
VMA_TCP_TIMESTAMP_OPTION |
Currently, LWIP is not supporting RTTM and PAWS mechanisms. See RFC1323 for info.
Default: 0 (disabled) |
VMA_TCP_NODELAY |
If set, it disables the Nagle algorithm option for each TCP socket during initialization. Meaning that TCP segments are always sent as soon as possible, even if there is only a small amount of data. For more information on TCP_NODELAY flag refer to TCP manual page. Valid Values are:
|
VMA_TCP_QUICKACK |
If set, it disables the delayed acknowledge ability. Meaning that TCP will respond after every packet. For more information on TCP_QUICKACK flag refer to TCP manual page. Valid Values are:
|
VMA_EXCEPTION_HANDLING |
Handles missing support or error cases in Socket API or functionality by VMA. It quickly identifies VMA unsupported Socket API or features.
Default: -1 |
VMA_AVOID_SYS_CALLS_ON_TCP_FD |
For TCP fd, avoid system calls for the supported options of: ioctl, fcntl, getsockopt, setsockopt. Non-supported options will go to OS. To activate, use VMA_AVOID_SYS_CALLS_ON_TCP_FD=1. Default: 0 (disabled) |
VMA_THREAD_MODE |
By default VMA is ready for multi-threaded applications, meaning it is thread-safe. If the user application is single threaded, use this configuration parameter to help eliminate VMA locks and improve performance. Values:
Default: 1 (Multi with spin lock) |
VMA_BUFFER_BATCHING_MODE |
Enables batching of returning Rx buffers and pulling Tx buffers per socket.
Default: 1 |
VMA_MEM_ALLOC_TYPE |
This replaces the VMA_HUGETBL parameter logic. VMA will try to allocate data buffers as configured:
OFED will also try to allocate QP & CQ memory accordingly:
To override OFED use: (MLX_QP_ALLOC_TYPE, MLX_CQ_ALLOC_TYPE). Default: 1 (Contiguous pages) |
VMA_FORK |
Controls VMA fork support. Setting this flag on will cause VMA to call ibv_fork_init() function. ibv_fork_init() initializes libibverbs's data structures to handle fork() function calls correctly and avoid data corruption. If ibv_fork_init() is not called or returns a non-zero status, then libibverbs data structures are not fork()-safe and the effect of an application calling fork() is undefined. ibv_fork_init() works on Linux kernels 2.6.17 and later, which support the MADV_DONTFORK flag for madvise(). You should use an OFED stack version that supports fork() with huge pages (MLNX_OFED 1.5.3 to 3.2 and 4.0-2.0.0.0 and later). VMA allocates huge pages (VMA_HUGETBL) by default. For limitations of using fork() with VMA, please refer to the Release Notes. Default: 1 (Enabled) |
VMA_MTU |
Size of each Rx and Tx data buffer (Maximum Transfer Unit). This value sets the fragmentation size of the packets sent by the VMA library.
Default: 0 (following interface actual MTU) |
VMA_MSS |
Defines the max TCP payload size that can be sent without IP fragmentation. Value of 0 will set VMA's TCP MSS to be aligned with VMA_MTU configuration (leaving 40 bytes of room for IP + TCP headers; "TCP MSS = VMA_MTU - 40"). Other VMA_MSS values will force VMA's TCP MSS to that specific value. Default: 0 (following VMA_MTU) |
VMA_CLOSE_ON_DUP2 |
When this parameter is enabled, VMA handles the duplicated file descriptor (oldfd), as if it is closed (clear internal data structures) and only then forwards the call to the OS. This is, in effect, a very rudimentary dup2 support. It supports only the case where dup2 is used to close file descriptors. Default: 1 (Enabled) |
VMA_INTERNAL_THREAD_AFFINITY |
Controls which CPU core(s) the VMA internal thread is serviced on. The CPU set should be provided as either a hexadecimal value that represents a bitmask or as a comma delimited of values (ranges are ok). Both the bitmask and comma delimited list methods are identical to what is supported by the taskset command. See the man page on taskset for additional information. The -1 value disables the Internal Thread Affinity setting by VMA. Bitmask examples: 0x00000001 – Run on processor 0 Comma delimited examples: 0,4,8 – Run on processors 0,4, and 8 Default: -1. |
VMA_INTERNAL_THREAD_CPUSET |
Selects a CPUSET for VMA internal thread (For further information, see man page of cpuset). The value is either the path to the CPUSET (for example: /dev/cpuset/my_set), or an empty string to run it on the same CPUSET the process runs on. |
VMA_INTERNAL_THREAD_ARM_CQ |
Wakes up the internal thread for each packet that the CQ receives. Polls and processes the packet and brings it to the socket layer. This can minimize latency for a busy application that is not available to receive the packet when it arrives. However, this might decrease performance for high pps rate applications. Default: 0 (Disabled) |
VMA_INTERNAL_THREAD_TCP_TIMER_HANDLING |
Selects the internal thread policy when handling TCP timers. Use value of 0 for deferred handling. The internal thread will not handle TCP timers upon timer expiration (once every 100ms) in order to let application threads handling it first. Use value of 1 for immediate handling. The internal thread will try locking and handling TCP timers upon timer expiration (once every 100ms). Application threads may be blocked till internal thread finishes handling TCP timers Default value is 0 (deferred handling) |
VMA_WAIT_AFTER_JOIN_MSEC |
This parameter indicates the time of delay the first packet is send after receiving the multicast JOINED event from the SM. This is helpful to overcome loss of first few packets of an outgoing stream due to SM lengthy handling of MFT configuration on the switch chips. Default: 0 (milli-sec) |
VMA_NEIGH_UC_ARP_QUATA |
VMA will send UC ARP in case neigh state is NUD_STALE. If that neigh state is still NUD_STALE VMA will try VMA_NEIGH_UC_ARP_QUATA retries to send UC ARP again and then will send BC ARP. Default: 3 |
VMA_NEIGH_UC_ARP_DELAY_MSEC |
This parameter indicates number of msec to wait between every UC ARP. Default: 10000 |
VMA_NEIGH_NUM_ERR_RETRIES |
Indicates number of retries to restart NEIGH state machine if NEIGH receives ERROR event. Default: 1 |
VMA_BF |
Enables/disables BlueFlame usage of the card. Default: 1 (Enabled) |
VMA_SOCKETXTREME |
When this parameter is enabled, VMA operates in SocketXtreme mode. SocketXtreme mode brings down latency, eliminating copy operations and increasing throughput, thus allowing applications to further utilize true kernel bypass architecture. An application should use a socket extension API named SocketXtreme. Default: 0 (Disabled) |
VMA_TRIGGER_DUMMY_SEND_GETSOCKNAME |
This parameter triggers dummy packet sent from getsockname() to warm up the caches. For more information see section "Dummy Send" to Improve Low Message Rate Latency. Default: 0 (Disable) |
VMA_SPEC |
Warning
VMA_SPEC sets all the required configuration parameters of VMA. Usually, no additional configuration is required. VMA predefined specification profile for latency:
|
Beta Level Features Configuration Parameters
The following table lists configuration parameters and their possible values for new VMA Beta level features. The parameters below are disabled by default.
These VMA features are still experimental and subject to changes. They can help improve performance of multithread applications.
We recommend altering these parameters in a controlled environment until reaching the best performance tuning.
VMA Configuration Parameter |
Description and Examples |
VMA_RING_ALLOCATION_LOGIC_TX VMA_RING_ALLOCATION_LOGIC_RX |
Ring allocation logic is used to separate the traffic into different rings. By default, all sockets use the same ring for both RX and TX over the same interface. For different interfaces, different rings are used, even when specifying the logic to be per socket or thread. Using different rings is useful when tuning for a multi-threaded application and aiming for HW resource separation. Important
This feature might decrease performance for applications which their main processing loop is based on select() and/or poll().
The logic options are:
Default: 0 |
VMA_RING_MIGRATION_RATIO_TX VMA_RING_MIGRATION_RATIO_RX |
Ring migration ratio is used with the "ring per thread" logic in order to decide when it is beneficial to replace the socket's ring with the ring allocated for the current thread. Each VMA_RING_MIGRATION_RATIO iteration (of accessing the ring), the current thread ID is checked to see whether the ring matches the current thread. If not, ring migration is considered. If the ring continues to be accessed from the same thread for a certain iteration, the socket is migrated to this thread ring. Use a value of -1 in order to disable migration. Default: 100 |
VMA_RING_LIMIT_PER_INTERFACE |
Limits the number of rings that can be allocated per interface. For example, in ring allocation per socket logic, if the number of sockets using the same interface is larger than the limit, several sockets will share the same ring. Warning
VMA_RX_BUFS might need to be adjusted in order to have enough buffers for all rings in the system. Each ring consumes VMA_RX_WRE buffers.
Use a value of 0 for an unlimited number of rings. Default: 0 (no limit) |
VMA_RING_DEV_MEM_TX |
VMA can use the on-device-memory to store the egress packet if it does not fit into the BF inline buffer. This improves application egress latency by reducing the PCI transactions. Using VMA_RING_DEV_MEM_TX, enables the user to set the amount of the on-device-memory buffer allocated for each TX ring. The total size of the on-device-memory is limited to 256k for a single port HCA and to 128k for dual port HCA. Default value is 0 |
VMA_TCP_CC_ALGO |
TCP congestion control algorithm. The default algorithm coming with LWIP is a variation of Reno/New-Reno. The new Cubic algorithm was adapted from FreeBsd implementation. Use value of 0 for LWIP algorithm. Use value of 1 for the Cubic algorithm. Use value of 2 in order to disable the congestion algorithm. Default: 0 (LWIP). |
VMA can be loaded using Dynamically Loaded (DL) libraries. These libraries are not automatically loaded at program link time or start-up as with LD_PRELOAD. Instead, there is an API for opening a library, looking up symbols, handling errors, and closing the library.
The example below demonstrates how to load socket() function. Similarly, users should load all other network-related functions as declared in sock-redirect.h:
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>
#include <arpa/inet.h>
#include <sys/socket.h>
typedef int
(*socket_fptr_t) (int
__domain, int
__type, int
__protocol);
int
main(int
argc, const
char
** argv)
{
void
* lib_handle;
socket_fptr_t vma_socket;
int
fd;
lib_handle = dlopen("libvma.so"
, RTLD_LAZY);
if
(!lib_handle) {
printf("FAILED to load libvma.so\n"
);
exit(1
);
}
vma_socket = (socket_fptr_t)dlsym(lib_handle, "socket"
);
if
(vma_socket == NULL) {
printf("FAILED to load socket()\n"
);
exit(1
);
}
fd = vma_socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
if
(fd < 0
) {
printf("FAILED open socket()\n"
);
exit(1
);
}
printf("socket creation succeeded fd = %d\n"
, fd);
close(fd);
dlclose(lib_handle);
return
0
;
}
For more information, please refer to dlopen man page.
For a complete example that includes all the necessary functions, see sockperf’s vma-redirect.h and vma-redirect.cpp files.