Performance Related Troubleshooting

Linux Kernel Upstream Release Notes v6.5

Issue

Cause

Solution

Low performance issues

The OS profile might not be configured for maximum performance.

  1. Go to "Power Options" in the "Control Panel". Make sure "Maximum Performance" is set as the power scheme

  2. Reboot the machine.

Low SMBDirect performance

The NetworkDirect registry is enabled by default in the NIC but the ECN and/or PFC is not enabled in the switch.

Either enable ECN/PFC in the switch or set NetworkDirect to zero.

  1. Go to “Device Manager”, locate the Mellanox adapter that you are debugging, right- click and choose “Properties” and go to the “Information” tab:

    • PCI Gen 1: should appear as "PCI-E 2.5 GT/s"

    • PCI Gen 2: should appear as "PCI-E 5.0 GT/s"

    • PCI Gen 3: should appear as "PCI-E 8.0 GT/s"

    • Link Speed: 56.0 Gbps / 40.0Gbps / 10.0Gbps / 100 Gbps

  2. To determine if the Mellanox NIC and PCI bus can achieve their maximum speed, it's best to run nd_send_bw in a loopback. On the same machine:

    1. Run "start /b /affinity 0x1 nd_send_bw -S <IP_host>" where <IP_host> is the local IP.

    2. Run "start /b /affinity 0x2 nd_send_bw -C <IP_host>"

    3. Repeat for port 2 with the appropriate IP.
      On PCI Gen3 the expected result is around 5700MB/s
      On PCI Gen2 the expected result is around 3300MB/s
      Any number lower than that points to bad configuration or installation on the wrong PCI slot. Malfunctioning QoS settings and Flow Control can be the cause as well.

  3. To determine the maximum speed between the two sides with the most basic test:

    1. Run "nd_send_bw -S <IP_host1>" on machine 1 where <IP_host1> is the local IP.

    2. Run "nd_send_bw -C <IP_host1>" on machine 2.

    3. Results appear in Gb/s (Gigabits 2^30), and reflect the actual data that was transferred, excluding headers.

    4. If these results are not as expected, the problem is most probably with one or more of the following:

  • Old Firmware version.

  • Misconfigured Flow-control: Global pause or PFC is configured wrong on the hosts, routers and switches.

  • CPU/power options are not set to "Maximum Performance".

© Copyright 2023, NVIDIA. Last updated on Nov 1, 2023.