NVIDIA Messaging Accelerator (VMA) Documentation Rev 9.8.40
NVIDIA Messaging Accelerator (VMA) Documentation Rev 9.8.40

Appendix: Sockperf – UDP/TCP Latency and Throughput Benchmarking Tool

This appendix presents sockperf, VMA's sample application for testing latency and throughput over socket API.

Sockperf can be used natively, or with VMA acceleration.

Sockperf is an open source utility. For more general information, see https://github.com/Mellanox/sockperf.

Sockperf's advantage over other network benchmarking utilities is its focus on testing the performance of high-performance systems (as well as testing the performance of regular networking systems). In addition, sockperf covers most of the socket API call and options.

Specifically, in addition to the standard throughput tests, sockperf:

  • Measures latency of each discrete packet at sub-nanosecond resolution (using TSC register that counts CPU ticks with very low overhead).

  • Measures latency for ping-pong mode and for latency under load mode. This means that you can measure latency of single packets even under a load of millions of PPS (without waiting for reply of packet before sending a subsequent packet on time).

  • Enables spike analysis by providing in each run a histogram with various percentiles of the packets’ latencies (for example: median, min, max, 99% percentile, and more) in addition to average and standard deviation.

  • Can provide full logs containing all a packet’s tx/rx times, without affecting the benchmark itself. The logs can be further analyzed with external tools, such as MS-Excel or matplotlib.

  • Supports many optional settings for good coverage of socket API, while still keeping a very low overhead in the fast path to allow cleanest results.

Sockperf operates by sending packets from the client (also known as the publisher) to the server (also known as the consumer), which then sends all or some of the packets back to the client. This measured roundtrip time is the route trip time (RTT) between the two machines on a specific network path with packets of varying sizes.

  • The latency for a given one-way path between the two machines is the RTT divided by two.

  • The average RTT is calculated by summing the route trip times for all the packets that perform the round trip and then dividing the total by the number of packets.

Sockperf can test the improvement of UDP/TCP traffic latency when running applications with and without VMA.

Sockperf can work as a server (consumer) or execute under-load, ping-pong, playback and throughput tests as a client (publisher).

In addition, sockperf provides more detailed statistical information and analysis, as described in the following section.

Sockperf is installed on the VMA server at /usr/bin/sockperf. For examples of running sockperf, see:

Warning

If you want to use multicast, you must first configure the routing table to map multicast addresses to the Ethernet interface, on both client and server. (See Configuring the Routing Table for Multicast Tests).

Advanced Statistics and Analysis

In each run, sockperf presents additional advanced statistics and analysis information:

  • In addition to the average latency and standard deviation, sockperf presents a histogram with various percentiles, including:

    • 50 percentile – the latency value for which 50 percent of the observations are smaller than it. The 50 percentile is also known as the median, and is different from the statistical average.

    • 99 percentile – the latency value for which 99 percent of the observations are smaller than it (and 1 percent are higher)

These percentiles, and the other percentiles that the histogram provides, are very useful for analyzing spikes in the network traffic.

  • Sockperf can provide a full log of all packets’ tx and rx times by dumping all the data that it uses for calculating percentiles and building the histogram to a comma separated file. This file can be further analyzed using external tools such as Microsoft Excel or matplotlib.

All these additional calculations and reports are executed after the fast path is completed. This means that using these options has no effect on the benchmarking of the test itself. During runtime of the fast path, sockperf records txTime and rxTime of packets using the TSC CPU register, which has a negligible effect on the benchmark itself, as opposed to using the computer’s clock, which can affect benchmarking results.

If you want to use multicast, you must first configure the routing table to map multicast addresses to the Ethernet interface, on both client and server.

Example

Copy
Copied!
            

# route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0

where eth0 is the Ethernet interface.

You can also set the interface on runtime in sockperf:

  • Use "--mc-rx-if -<ip>" to set the address of the interface on which to receive multicast packets (can be different from the route table)

  • Use "--mc-tx-if -<ip>" to set the address of the interface on which to transmit multicast packets (can be different from the route table)

To measure latency statistics, after the test completes, sockperf calculates the route trip times (divided by two) between the client and the server for all messages, then it provides the average statistics and histogram.

UDP Ping-pong

To run UDP ping-pong:

  1. Run the server by using:

    Copy
    Copied!
                

    # sockperf sr -i <server-ip>

  2. Run the client by using:

    Copy
    Copied!
                

    # sockperf pp -i <server-ip> -m 64

    Where -m/--msg-size is the message size in bytes (minimum default 14).

Warning

For more sockperf Ping-pong options run:

Copy
Copied!
            

# sockperf pp –h


TCP Ping-pong

To run TCP ping-pong:

  1. Run the server by using:

    Copy
    Copied!
                

    # sockperf sr -i <server-ip> --tcp

  2. Run the client by using:

    Copy
    Copied!
                

    # sockperf pp -i <server-ip> --tcp –m 64

TCP Ping-pong using VMA

To run TCP ping-pong using VMA:

  1. Run the server by using:

    Copy
    Copied!
                

    # VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf sr -i <server-ip> --tcp

  2. Run the client by using:

    Copy
    Copied!
                

    # VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf pp -i <server-ip> --tcp –m 64

    Where VMA_SPEC=latency is a predefined specification profile for latency.

To determine the maximum bandwidth and highest message rate for a single-process, single-threaded network application, sockperf attempts to send the maximum amount of data in a specific period of time.

UDP MC Throughput

To run UDP MC throughput:

  1. On both the client and the server, configure the routing table to map the multicast addresses to the interface by using:

    Copy
    Copied!
                

    # route add -net 224.0.0.0 netmask 240.0.0.0 dev <interface>

  2. Run the server by using:

    Copy
    Copied!
                

    # sockperf sr -i <server-100g-ip>

  3. Run the client by using:

    Copy
    Copied!
                

    # sockperf tp -i <server-100g-ip> -m 1472

    Where -m/--msg-size is the message size in bytes (minimum default 14).

  4. The following output is obtained:

    Copy
    Copied!
                

    sockperf: Total of 936977 messages sent in 1.100 sec sockperf: Summary: Message Rate is 851796 [msg/sec] sockperf: Summary: BandWidth is 1195.759 MBps (9566.068 Mbps)

Warning

For more sockperf throughput options run:

Copy
Copied!
            

# sockperf tp –h


UDP MC Throughput using VMA

To run UDP MC throughput:

  1. After configuring the routing table as described in Configuring the Routing Table for Multicast Tests, run the server by using:

    Copy
    Copied!
                

    # LD_PRELOAD=libvma.so sockperf sr -i <server-ip>

  2. Run the client by using:

    Copy
    Copied!
                

    # LD_PRELOAD=libvma.so sockperf tp -i <server-ip> -m 1472

  3. The following output is obtained:

    Copy
    Copied!
                

    sockperf: Total of 4651163 messages sent in 1.100 sec sockperf: Summary: Message Rate is 4228326 [msg/sec] sockperf: Summary: BandWidth is 5935.760 MBps (47486.083 Mbps)

UDP MC Throughput Summary

Test

100 Gb Ethernet

100 Gb Ethernet + VMA

Message Rate

851796 [msg/sec]

4228326 [msg/sec]

Bandwidth

1195.759 MBps (9566.068 Mbps)

5935.760 MBps (47486.083 Mbps)

VMA Improvement

4740.001 MBps (396.4%)


You can use additional sockperf subcommands

Usage: sockperf <subcommand> [options] [args]

  • To display help for a specific subcommand, use:

    Copy
    Copied!
                

    sockperf <subcommand> --help

  • To display the program version number, use:

    Copy
    Copied!
                

    sockperf --version

Option

Description

For help, use

help (h ,?)

Display a list of supported commands.

under-load (ul)

Run sockperf client for latency under load test.

# sockperf ul -h

ping-pong (pp)

Run sockperf client for latency test in ping pong mode.

# sockperf pp -h

playback (pb)

Run sockperf client for latency test using playback of predefined traffic, based on timeline and message size.

# sockperf pb -h

throughput (tp)

Run sockperf client for one way throughput test.

# sockperf tp -h

server (sr)

Run sockperf as a server.

# sockperf sr -h

For additional information, see https://github.com/Mellanox/sockperf.

Additional Options

The following tables describe additional sockperf options, and their possible values.

Client Options

Short Command

Full Command

Description

-h,-?

--help,--usage

Show the help message and exit.

N/A

--tcp

Use TCP protocol (default UDP).

-i

--ip

Listen on/send to IP <ip>.

-p

--port

Listen on/connect to port <port> (default 11111).

-f

--file

Read multiple ip+port combinations from file <file> (will use IO muxer '-F').

-F

--iomux-type

Type of multiple file descriptors handle [s|select|p|poll|e|epoll|r|recvfrom|x|socketxtreme](default epoll).

N/A

--timeout

Set select/poll/epoll timeout to <msec> or -1 for infinite (default is 10 msec).

-a

--activity

Measure activity by printing a '.' for the last <N> messages processed.

-A

--Activity

Measure activity by printing the duration for last <N> messages processed.

N/A

--tcp-avoid-nodelay

Stop/Start delivering TCP Messages Immediately (Enable/Disable Nagel).

The default is Nagel Disabled except for in Throughput where the default is Nagel enabled.

N/A

--tcp-skip-blocking-send

Enables non-blocking send operation (default OFF).

N/A

--tos

Allows setting tos.

N/A

--mc-rx-if

IP address of interface on which to receive multicast packets (can be different from the route table).

N/A

--mc-tx-if

IP address of interface on which to transmit multicast packets (can be different from the route table).

N/A

--mc-loopback-enable

Enable MC loopback (default disabled).

N/A

--mc-ttl

Limit the lifetime of the message (default 2).

N/A

--mc-source-filter

Set the address <ip, hostname> of the mulitcast messages source which is allowed to receive from.

N/A

--uc-reuseaddr

Enables unicast reuse address (default disabled).

N/A

--lls

Turn on LLS via socket option (value = usec to poll).

N/A

--buffer-size

Set total socket receive/send buffer <size> in bytes (system defined by default).

N/A

--nonblocked

Open non-blocked sockets.

N/A

--recv_looping_num

Set sockperf to loop over recvfrom() until EAGAIN or <N> good received packets, -1 for infinite, must be used with --nonblocked (default 1).

N/A

--dontwarmup

Do not send warm up packets on start.

N/A

--pre-warmup-wait

Time to wait before sending warm up packets (seconds).

N/A

--vmazcopyread

If possible use VMA's zero copy reads API (see the VMA readme).

N/A

--daemonize

Run as daemon.

N/A

--no-rdtsc

Do not use the register when measuring time; instead use the monotonic clock.

N/A

--load-vma

Load VMA dynamically even when LD_PRELOAD was not used.

N/A

--rate-limit

Use rate limit (packet-pacing). When used with VMA, it must be run with VMA_RING_ALLOCATION_LOGIC_TX mode.

N/A

--set-sock-accl

Set socket acceleration before running VMA (available for some NVIDIA® systems).

-d

--debug

Print extra debug information.


Server Options

Short Command

Full Command

Description

N/A

--threads-num

Run <N> threads on server side (requires '-f' option).

N/A

--cpu-affinity

Set threads affinity to the given core IDs in the list format (see: cat /proc/cpuinfo).

N/A

--vmarxfiltercb

If possible use VMA's receive path packet filter callback API (See the VMA readme).

N/A

--force-unicast-reply

Force server to reply via unicast.

N/A

--dont-reply

Set server to not reply to the client messages.

-m

--msg-size

Set maximum message size that the server can receive <size> bytes (default 65507).

-g

--gap-detection

Enable gap-detection.

Sending Bursts

Use the "-b (--burst=<size>)" option to control the number of messages sent by the client in every burst.

SocketXtreme

sockperf v3.2 and above supports VMA socketXtreme polling mode.

In order to support socketXtreme, sockperf should be configured using --enable-vma-api parameter compiled with the compatible vma_extra.h file during compilation.

New iomux type should appear -x / --socketxtreme:

Short Command

Full Command

Description

-F

--iomux-type

Type of multiple file descriptors handle [s|select|p|poll|e|epoll|r|recvfrom|x|socketxtreme](default epoll).

Warning

SocketXtreme should be also enabled for VMA. For further information, please refer to Installing VMA with SocketXtreme.

In order to use socketXtreme, VMA should also be compiled using --enable-socketxtreme parameter.

socketXtreme requires forcing the Client side to bind to a specific IP address. Hence, while running UDP client with socketXtreme, running --client_ip is mandatory:

Copy
Copied!
            

--client_ip -Force the client side to bind to a specific ip address (default = 0).


Use "-d (--debug)" to print extra debug information without affecting the results of the test. The debug information is printed only before or after the fast path.

  1. If the following error is received:

    Copy
    Copied!
                

    sockperf error: sockperf: No messages were received from the server. Is the server down?

    Perform troubleshooting as follows:

    • Make sure that exactly one server is running

    • Check the connection between the client and server

    • Check the routing table entries for the multicast/unicast group

    • Extend test duration (use the "--time" command line switch)

    • If you used extreme values for --mps and/or --reply-every switch, try other values or try the default values

  2. If the following error is received, it means that Sockperf is trying to compile against VMA with no socketXtreme support:

    Copy
    Copied!
                

    In file included from src/Client.cpp:32:0: src/IoHandlers.h: In member function 'int IoSocketxtreme::waitArrival()': src/IoHandlers.h:421:71: error: 'VMA_SOCKETXTREME_PACKET' was not declared in this scope if (m_rings_vma_comps_map_itr->second->vma_comp_list[i].events & VMA_SOCKETXTREME_PACKET){ ^ src/IoHandlers.h:422:18: error: 'struct vma_api_t' has no member named 'socketxtreme_free_vma_packets' g_vma_api->socketxtreme_free_vma_packets(&m_rings_vma_comps_map_itr->second->vma_comp_list[i].packet, 1);

    There are two ways to solve this:

  • Configure sockperf with --disable-vma-api parameter;
    or

  • Use VMA 8.5.1 or above

© Copyright 2023, NVIDIA. Last updated on Nov 3, 2023.