NVIDIA WinOF-2 Documentation v24.1.50000
NVIDIA WinOF-2 Documentation v24.1.50000

Performance Tuning

This section describes how to modify Windows registry parameters in order to improve performance.

Warning

Modifying the registry incorrectly might lead to serious problems, including the loss of data, system hang, and you may need to reinstall Windows. As such it is recommended to backup the registry on your system before implementing recommendations included in this section. If the modifications you apply lead to serious problems, you will be able to restore the original registry state. For more details about backing up and restoring the registry, please visit www.microsoft.com.

To achieve the best performance for Windows, you may need to modify some of the Windows registries.

Registry Tuning

The registry entries that may be added/changed by this “General Tuning” procedure:

  • Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters:

    • Disable TCP selective acks option for better CPU utilization:

      Registry Key

      Type

      Value

      SackOpts

      REG_DWORD

      0

  • Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters:

    • Enable fast datagram sending for UDP traffic:

      Registry Key

      Type

      Value

      FastSendDatagramThreshold

      REG_DWORD

      64K

  • Under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Ndis\Parameters:

    • Set RSS parameters:

      Registry Key

      Type

      Value

      RssBaseCpu

      REG_DWORD

      1

Enable RSS

Enabling Receive Side Scaling (RSS) is performed by running the following command:

Copy
Copied!
            

“netsh int tcp set global rss = enabled”


Improving Live Migration

In order to improve live migration over SMB direct performance, please set the following registry key to 0 and reboot the machine:

Copy
Copied!
            

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters\RequireSecuritySignature


Ethernet Performance Tuning

The user can configure the Ethernet adapter by setting some registry keys. The registry keys may affect Ethernet performance.

Procedure_Heading_Icon-version-1-modificationdate-1707223733547-api-v2.PNG

To improve performance, activate the performance tuning tool as follows:

  1. Start the "Device Manager" (open a command line window and enter: devmgmt.msc).

  2. Open "Network Adapters".

  3. Right click the relevant Ethernet adapter and select Properties.

  4. Select the "Advanced" tab

  5. Modify performance parameters (properties) as desired.

Performance Known Issues

  • On Intel I/OAT supported systems, it is highly recommended to install and enable the latest I/OAT driver (download from www.intel.com).

  • With I/OAT enabled, sending 256-byte messages or larger will activate I/OAT. This will cause a significant latency increase due to I/OAT algorithms. On the other hand, throughput will increase significantly when using I/OAT.

Ethernet Bandwidth Improvements

Procedure_Heading_Icon-version-1-modificationdate-1707223733547-api-v2.PNG

To improve Ethernet Bandwidth:

  1. Check you are running on the closest NUMA.

    1. In the PowerShell run: Get-NetAdapterRss -Name "adapter name"

      Ethernet_Bandwidth-version-1-modificationdate-1707223734143-api-v2.png

    2. Validate that the IndirectionTable CPUs are located at the closest NUMA.
      As illustrated in the figure above, the CPUs are 0:0 - 0:7, CPU 0 -7 and the distance from the NUMA is 0, 0:0/0 - 0:7/0, unlike CPU 14-27/32767.

    3. If the CPUs are not close to the NUMEA, change the "RSS Base Processor Number" and "RSS Max Processor Number" settings under the Advance tab to point to the closest CPUs.

      Warning

      For high performance, it is recommended to work with at least 8 processors.

  2. Check the Ethernet bandwidth, run ntttcp.exe.

  • Server side: ntttcp -r -m 32,*,server_ip

  • Client side: ntttcp -s -m 32,*,server_ip

IPoIB Performance Tuning

The user can configure the IPoIB adapter by setting some registry keys. The registry keys may affect IPoIB performance.

Procedure_Heading_Icon-version-1-modificationdate-1707223733547-api-v2.PNG

To improve performance, activate the performance tuning tool as follows:

  1. Start the "Device Manager" (open a command line window and enter: devmgmt.msc).

  2. Open "Network Adapters".

  3. Right click the relevant IPoIB adapter and select Properties.

  4. Select the "Advanced" tab

  5. Modify performance parameters (properties) as desired.

The following is a list of key parameters for performance tuning.

Parameter

Description

Additional Options

Jumbo Packet

The maximum available size of the transfer unit, also known as the Maximum Transmission Unit (MTU). The MTU of a network can have a substantial impact on performance. A 4K MTU size improves performance for short messages, since it allows the OS to coalesce many small messages into a large one.

Valid MTU values range for an Ethernet driver is between 614 and 9614.

Note: All devices on the same physical network, or on the same logical network, must have the same MTU. This is applicable to the SoC MTU when using BlueField devices as well.

-

Receive Buffers

The number of receive buffers (default 512).

-

Send Buffers

The number of sent buffers (default 2048).

-

Performance Options

Configures parameters that can improve adapter performance.

Interrupt Moderation

Moderates or delays the interrupts’ generation. Hence, optimizes network throughput and CPU utilization (default Enabled).

  • When the interrupt moderation is enabled, the system accumulates interrupts and sends a single interrupt rather than a series of interrupts. An interrupt is generated after receiving 5 packets or after 10ms from the first packet received. It improves performance and reduces CPU load however, it increases latency.

  • When the interrupt moderation is disabled, the system generates an interrupt each time a packet is received or sent. In this mode, the CPU utilization data rates increase, as the system handles a larger number of interrupts. However, the latency decreases as the packet is handled faster.

Receive Side Scaling (RSS Mode)

Improves incoming packet processing performance. RSS enables the adapter port to utilize the multiple CPUs in a multi-core system for receiving incoming packets and steering them to the designated destination. RSS can significantly improve the number of transactions, the number of connections per second, and the network throughput.

This parameter can be set to one of the following values:

  • Enabled (default): Set RSS Mode

  • Disabled: The hardware is configured once to use the Toeplitz hash function, and the indirection table is never changed.

Note: I/OAT is not used while in RSS mode.

Receive Completion Method

Sets the completion methods of the received packets, and can affect network throughput and CPU utilization.

  • Polling Method

    Increases the CPU utilization as the system polls the received rings for the incoming packets. However, it may increase the network performance as the incoming packet is handled faster.

  • Adaptive (Default Settings)

    A combination of the interrupt and polling methods dynamically, depending on traffic type and network usage. Choosing a different setting may improve network and/or system performance in certain configurations.

Rx Interrupt Moderation Type

Sets the rate at which the controller moderates or delays the generation of interrupts making it possible to optimize network throughput and CPU utilization. The default setting (Adaptive) adjusts the interrupt rates dynamically depending on the traffic type and network usage. Choosing a different setting may improve network and system performance in certain configurations.

Send Completion Method

Sets the completion methods of the Send packets and it may affect network throughput and CPU utilization.

Offload Options

Allows you to specify which TCP/IP offload settings are handled by the adapter rather than the operating system.

Enabling offloading services increases transmission performance as the offload tasks are performed by the adapter hardware rather than the operating system. Thus, freeing CPU resources to work on other tasks.

IPv4 Checksums Offload

Enables the adapter to compute IPv4 checksum upon transmit and/or receive instead of the CPU (default Enabled).

TCP/UDP Checksum Offload for IPv4 packets

Enables the adapter to compute TCP/UDP checksum over IPv4 packets upon transmit and/or receive instead of the CPU (default Enabled).

TCP/UDP Checksum Offload for IPv6 packets

Enables the adapter to compute TCP/UDP checksum over IPv6 packets upon transmit and/or receive instead of the CPU (default Enabled).

Large Send Offload (LSO)

Allows the TCP/UDP stack to build a TCP/UDP message up to 64KB long and sends it in one call down the stack. The adapter then re-segments the message into multiple TCP/UDP packets for transmission on the wire with each pack sized according to the MTU. This option offloads a large amount of kernel processing time from the host CPU to the adapter.

© Copyright 2023, NVIDIA. Last updated on Feb 7, 2024.