NVIDIA WinOF-2 Documentation v23.7
Linux Kernel Upstream Release Notes v6.5

Fabric Performance Utilities

The performance utilities described in this chapter are intended to be used as a performance micro-benchmark. They support both InfiniBand and RoCE.

Warning

For further information on the following tools, please refer to the help text of the tool by running the --help command line parameter.

Warning

The performance utilities described in the table below will be deprecated as of the next release.

Utility

Description

nd_write_bw

This test is used for performance measuring of RDMA-Write requests in Microsoft Windows Operating Systems. nd_write_bw is performance oriented for RDMA-Write with maximum throughput, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_write_bw runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

nd_write_lat

This test is used for performance measuring of RDMA-Write requests in Microsoft Windows Operating Systems. nd_write_lat is performance oriented for RDMA-Write with minimum latency, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_write_lat runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

nd_read_bw

This test is used for performance measuring of RDMA-Read requests in Microsoft Windows Operating Systems. nd_read_bw is performance oriented for RDMA-Read with maximum throughput, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_read_bw runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

nd_read_lat

This test is used for performance measuring of RDMA-Read requests in Microsoft Windows Operating Systems. nd_read_lat is performance oriented for RDMA-Read with minimum latency, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_read_lat runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

nd_send_bw

This test is used for performance measuring of Send requests in Microsoft Windows Operating Systems. nd_send_bw is performance oriented for Send with maximum throughput, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_send_bw runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

nd_send_lat

This test is used for performance measuring of Send requests in Microsoft Windows Operating Systems. nd_send_lat is performance oriented for Send with minimum latency, and runs over Microsoft's NetworkDirect standard. The level of customizing for the user is relatively high. User may choose to run with a customized message size, customized number of iterations, or alternatively, customized test duration time. nd_send_lat runs with all message sizes from 1B to 4MB (powers of 2), message inlining, CQ moderation.

MlxNdPerf is a new tool that replaces all older Network Direct applications from older drivers (e.g nd_write_bw, nd_read_bw, nd_send_bw, nd_*_lat). The tool is used to determine the maximum performance with various parameters and what is the current available RDMA Read\Write\Send Performance between two endpoints.

The following are the commands used by the tool to perform various operations:

Client or Server Role

The role of Client or Server determines if this side is an RDMA requestor or responder (Client → Requestor, Server → Responder).

Usage

MlxNdPerf.exe -Server\ -Client


RDMA Operation

Determines the RDMA operation to be performed, a single option per time.

Usage

MlxNdPerf -Read\-Write\-Send


Source/Destination IP

Determines the Source IP (The local IP) and the Destination IP (Remote IP).

Usage

MlxNdPerf -SrcIP\ -DestIP


Number of Threads

Determines the number of threads to be executed, a single QP per thread.

Usage

MlxNdPerf -NumOfThreads


Port Number

Determines the port number used.

Usage

MlxNdPerf -PortNumber


Number of Scatter

Determines the number of scatter gather entries per post Send\Write\Read.

Usage

MlxNdPerf -SgeNumber


Buffer Size

Determines the number of bytes to be transmitted by a single post Send\Write\Read.

Usage

MlxNdPerf -BufferSize


Queue Depth

Determines the number of entries in the QP and the CQ.

Usage

MlxNdPerf -QueueDepth


Number of Iteration

Determines the number of iteration for post Send\Write\Read. Is ignored when in Duration mode.

Usage

MlxNdPerf -Iterations


Duration Mode

Duration mode – for how long the test executes in seconds.

Usage

MlxNdPerf -Duration


Event Notification Mode

Use event Notification mode for the CQ, it does not poll the CQ.

Usage

MlxNdPerf -UseEvents


Resiliency

Registering to the adapter's status changes callbacks and listening for any adapter status changes. In this mode the application will not exit unless the test is completed successfully.

Note: This Mode is not available for the Server side when in Send Mode.

Usage

MlxNdPerf -Resilient


Latency

Latency can be measured using the "-Latency" parameter. This parameter should be added to one of the operation - Write, Read or Send.

Usage

MlxNdPerf -Read\-Write\-Send -Latency

The following are a few limitations related to the latency tests:

  • Parameters '-Latency' and '-BufferSize' should be coded from both sides.

  • Parameters '-Resilient', '-NumOfThreads', '-UseEvents' and '-QueueDepth' are not supported with latency tests.

Verbose

Enables extra information prints.

Usage

MlxNdPerf -Verbose

Example1 Measure the bandwidth on operation IB Read with traffic from 2 threads, running for 30 seconds

  • Server side: MlxNdPerf.exe -Server -Read -SrcIp 11.137.58.1 -DestIp 11.137.58.1 -NumOfThreads 2 -Duration 30

  • Client side: MlxNdPerf.exe -Client -Read -SrcIp 11.137.57.1 -DestIp 11.137.58.1 -NumOfThreads 2 -Duration 30

Example2 Measure the latency on operation IB Write with Ipv6 addresses (if Estat presents)

  • Server side: MlxNdPerf.exe -Server -Write -SrcIp fe80::ee0d:9aff:fe42:e8e8 -DestIp fe80::ee0d:9aff:fe42:e8e8 -Latency -BufferSize 8

  • Client side: MlxNdPerf.exe -Client -Write -SrcIp fe80::ee0d:9aff:fe42:e8e4 -DestIp fe80::ee0d:9aff:fe42:e8e8 -Latency -BufferSize 8

The purpose of this test is to check interoperability between Linux and Windows via an RDMA ping. The Windows nd_rping was ported from Linux's RDMACM example: rping.c

  • Windows

    • To use the built-in nd_rping.exe tool, go to: C:\Program Files\Mellanox\MLNX_WinOF2\Performance Tools

    • To build the nd_rping.exe from scratch, use the SDK example: choose the machine's OS in the configuration manager of the solution, and build the nd_rping.exe .

  • Linux

    • Installing the MLNX_OFED on a Linux server will also provide the "rping" application.

© Copyright 2023, NVIDIA. Last updated on Nov 1, 2023.