General Performance Optimization and Tuning

To achieve the best performance for Windows, you may need to modify some of the Windows registries.

IND2QueuePairsPool

The interface is an extension to the Network Direct SPI version 2. It reduces the creation time of the IND2QueuePair and IND2CompletionQueue interfaces, hence improves the client-server connection establishment time.

The interface exposes a pool of pre-allocated IND2QueuePair and IND2CompletionQueue interfaces associated with it. Pre-allocation is done using a background thread when a pre-configured threshold is reached.

The API for this interface is documented in the SDK header file ndspi_ext_mlx.h.

Using IND2QueuePairsPool:

  1. Create a pool using IND2Adapter: QueryInterface with IID_IND2QueuePairsPool.

  2. Set pool configuration using the SetQueuePairParams and SetCompletionQueueParams methods.

  3. Set background creation thresholds using the SetLimits method

  4. Fill the pool using the Fill method.

  5. Create items IND2QueuePair and IND2CompletionQueue associated with it using the CreateObjects method.

  • Statistics about the utilization of the resource pool are available to allow the programmer to select optimal thresholds

Nd2AdapterControlSetCqInterruptModeration

The method is an extension to the second version of the Network Direct SPI. It controls the amount of events received per completion, which reduces the amount of interrupts, and thereby improves the performance.

The method allows the user to control the amount of completions that will trigger an event, and the amount of time required before the next completion can occur, until an event is sent, as long as the completion number limit has not been reached.

The API for this interface is documented in the ndspi_ext_mlx.h SDK header file.

Using Nd2AdapterControlSetCqInterruptModeration

The usage of the Nd2AdapterControlSetCqInterruptModeration is similar to the usage of the function NDK_FN_CONTROL_CQ_INTERRUPT_MODERATION in MSDN NDK SPI. For more information, see: https://msdn.microsoft.com/en-us/library/windows/hardware/jj552973(v=vs.85).aspx

The ModerationInterval will always be rounded down to its limit, thus the ModerationCount will never solely control the interrupt moderation on the CQ.

The registry entries that may be added/changed by this “General Tuning” procedure are:

Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters:

  • Disable TCP selective acks option for better cpu utilization:

    Copy
    Copied!
                

    SackOpts, type REG_DWORD, value set to 0.

Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:

  • Enable fast datagram sending for UDP traffic:

    Copy
    Copied!
                

    FastSendDatagramThreshold, type REG_DWORD, value set to 64K.

Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Ndis\Parameters:

  • Set RSS parameters:

    Copy
    Copied!
                

    RssBaseCpu, type REG_DWORD, value set to 1.

Enabling Receive Side Scaling (RSS) is performed by means of the following command:

Copy
Copied!
            

“netsh int tcp set global rss = enabled”

The IPoIB Network Adapter tuning can be performed either during installation by modifying some of Windows registries as explained in Registry Tuning above, or can be set post-installation manually.

To improve the network adapter performance, activate the performance tuning tool as follows:

  1. Start the "Device Manager" (open a command line window and enter: devmgmt.msc).

  2. Open "Network Adapters".

  3. Select Mellanox IPoIB adapter, right click and select Properties.

  4. Select the “Performance tab”.

  5. Choose one of the tuning scenarios:
    • Single port traffic - Improves performance for running single port traffic each time.
    • Dual port traffic - Improves performance for running traffic on both ports simultaneously.
    • Forwarding traffic - Improves performance for running scenarios that involve both ports (for example: via IXIA)
    • Multicast traffic - Improves performance when the main traffic runs on multicast.

  6. Click on “Run Tuning” button.
    Clicking the “Run Tuning” button changes several registry entries (described below), and checks for system services that may decrease network performance. It also generates a log including the applied changes.
    Users can view this log to restore the previous values. The log path is:

    Copy
    Copied!
                

    %HOMEDRIVE%\Windows\System32\LogFiles\PerformanceTunning.log

    This tuning is required to be performed only once after the installation is completed, and on one adapter only (as long as these entries are not changed directly in the registry, or by some other installation or script).

Warning

A reboot may be required for the changes to take effect.

The Ethernet Network Adapter general tuning can be performed during installation by modifying some of Windows registries as explained in Registry Tuning above. Specific scenarios tuning can be set post-installation manually.

To improve the network adapter performance, activate the performance tuning tool as follows:

  1. Start the "Device Manager" (open a command line window and enter: devmgmt.msc).

  2. Open "Network Adapters".

  3. Select Mellanox Ethernet adapter, right click and select Properties.

  4. Select the "Performance tab".

  5. Choose one of the tuning scenarios:
    • Single port traffic - Improves performance for running single port traffic each time.
    • Single stream traffic - Optimizes tuning for applications with single connection.
    • Dual port traffic - Improves performance for running traffic on both ports simultaneously.
    • Forwarding traffic - Improves performance for running scenarios that involve both ports (for example: via IXIA)
    • Multicast traffic - Improves performance when the main traffic runs on multicast.

  6. Click on “Run Tuning” button.

    image2019-3-12_16-34-19.png

    Clicking the "Run Tuning" button activates the general tuning as explained above and changes several driver registry entries for the current adapter and its sibling device once the sibling is an Ethernet device as well. It also generates a log including the applied changes.
    Users can view this log to restore the previous values. The log path is:

    Copy
    Copied!
                

    %HOMEDRIVE%\Windows\System32\LogFiles\PerformanceTunning.log

    This tuning is required to be performed only once after the installation is completed, and on one adapter only (as long as these entries are not changed directly in the registry, or by some other installation or script).

    Warning

    Please note that a reboot may be required for the changes to take effect.

You can also activate the performance tuning through a script called perf_tuning.exe. This script has 4 options, which include the 3 scenarios described above and an additional manual tuning through which you can set the RSS base and number of processors for each Ethernet adapter. The adapters you wish to tune are supplied to the script by their name according to the “Network Connections”.

Synopsis

Copy
Copied!
            

perf_tuning.exe -s -c1 <first connection name> [-c2 <second connection name>] perf_tuning.exe -d -c1 <first connection name> -c2 <second connection name> perf_tuning.exe -f -c1 <first connection name> -c2 <second connection name> perf_tuning.exe -m -c1 <first connection name> -b <base RSS processor number> -n <number of RSS processors> perf_tuning -st -c1 <first connection name> [-c2 <second connection name>]

Performance Tuning Tool Application Options

Flag

Description

-s

Single port traffic scenario.
This option can be followed by one or two connection names. The tuning will restore the default settings on the second connection and performed on the first connection.
This option automatically sets:

  • SendCompletionMethod = 0

  • RecvCompletionMethod = 2

  • *ReceiveBuffers = 1024

  • In Operating Systems support NDIS6.3:
    RssProfile = 4

Additionally, this option chooses the best processors to assign to:

  • DefaultRecvRingProcessor

  • TxInterruptProcessor

  • TxForwardingProcessor

  • In Operating Systems support NDIS6.2:
    RssBaseProcNumber
    MaxRssProcessors

  • In Operating Systems support NDIS6.3:
    NumRSSQueues
    RssMaxProcNumber

-d

Dual port traffic scenario.
This option must be followed by two connection names. The tuning in this case is codependent.
This option automatically sets:

  • SendCompletionMethod = 0

  • RecvCompletionMethod = 2

  • *ReceiveBuffers = 1024

  • In Operating Systems support NDIS6.3:
    RssProfile = 4

Additionally, this option chooses the best processors to assign to:

  • DefaultRecvRingProcessor

  • TxForwardingProcessor

  • In Operating Systems support NDIS6.2:
    RssBaseProcNumber
    MaxRssProcessors

  • In Operating Systems support NDIS6.3:
    NumRSSQueues
    RssMaxProcNumber

-f

Forwarding traffic scenario.
This option must be followed by two connection names. The tuning in this case is codependent.
This option automatically sets:

  • SendCompletionMethod = 1

  • RecvCompletionMethod = 0

  • *ReceiveBuffers = 4096

  • UseRSSForRawIP = 0

  • UseRSSForUDP = 0

Additionally, this option chooses the best processors to assign to:

  • DefaultRecvRingProcessor

  • TxInterruptProcessor

  • TxForwardingProcessor

  • In Operating Systems support NDIS6.2:
    RssBaseProcNumber
    MaxRssProcessors

  • In Operating Systems support NDIS6.3:
    NumRSSQueues
    RssMaxProcNumber

-m

Manual configuration
This option must be followed by one connection name.
This option assigns the provided base and number of CPUs to:

  • *RssBaseProcNumber

  • *MaxRssProcessors

Additionally, this option assigns the following with processors inside the range:

  • DefaultRecvRingProcessor

  • TxInterruptProcessor

-r

Restore default settings.
This option can be followed by one or two connection names.
This option automatically sets the driver registry values back to their default values:

  • SendCompletionMethod = 0 - IPoIB; 1 - ETH

  • RecvCompletionMethod = 2

  • *ReceiveBuffers = 1024

  • UseRSSForRawIP = 1

  • DefaultRecvRingProcessor = -1

  • TxInterruptProcessor = -1

  • TxForwardingProcessor = -1

  • UseRSSForUDP = 1

  • In Operating Systems support NDIS6.2:
    MaxRssProcessors = 8

  • In Operating Systems support NDIS6.3:
    NumRSSQueues = 8

-c1

Specifies first connection name. See examples.

-c2

Specifies second connection name. See examples.

-b

Specifies base RSS processor number. See examples.
Used for manual option (-m) only.

-n

Specifies number of RSS processors. See examples.
Used for manual option (-m) only.

-st

Single stream traffic scenario.
This option must be followed by one or two connection names for an Ethernet adapter. The tuning will restore the default settings on the second connection and performed on the first connection.
This option automatically sets:

  • SendCompletionMethod = 0

  • RecvCompletionMethod = 2

  • *ReceiveBuffers = 1024

  • In Operating Systems support NDIS6.3:
    RssProfile = 4

Additionally, this option chooses the best processors to assign to:

  • DefaultRecvRingProcessor

  • TxInterruptProcessor

  • TxForwardingProcessor

  • In Operating Systems support NDIS6.2:
    RssBaseProcNumber
    MaxRssProcessors

  • In Operating Systems support NDIS6.3:
    NumRSSQueues
    RssMaxProcNumber

Examples

For example, if the adapter is represented by "Local Area Connection 6" and "Local Area Connection 7"

Copy
Copied!
            

For single port stream tuning type: perf_tuning.exe -s -c1 "Local Area Connection 6" -c2 "Local Area Connection 7" or to set one adapter only: perf_tuning.exe -s -c1 "Local Area Connection 6" For single stream tuning type: perf_tuning.exe -st -c1 "Local Area Connection 6" -c2 "Local Area Connection 7" or to set one adapter only: perf_tuning.exe -st -c1 "Local Area Connection 6" For dual port streams tuning type: perf_tuning.exe -d -c1 "Local Area Connection 6" -c2 "Local Area Connection 7" For forwarding streams tuning type: perf_tuning.exe -f -c1 "Local Area Connection 6" -c2 "Local Area Connection 7" For manual tuning of the first adapter to use RSS on CPUs 0-3: perf_tuning.exe -m -c1 "Local Area Connection 6" -b 0 -n 4 In order to restore defaults type: perf_tuning.exe -r -c1 "Local Area Connection 6" -c2 "Local Area Connection 7"

To achieve best performance on SR-IOV VF, please run the following powershell commands on the host:

Copy
Copied!
            

Set-VMNetworkAdapter -Name "Network Adapter" -VMName vm1 -IovQueuePairsRequested 4 OR Set-VMNetworkAdapter -Name "Network Adapter" -VMName vm1 -IovQueuePairsRequested 8 for 40GbE

In order to improve live migration over SMB direct performance, please set the following registry key to 0 and reboot the machine:

Copy
Copied!
            

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters\RequireSecuritySignature

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.