Optimizing IO Performance#

Networking#

We recommend that you download the latest driver and firmware for your network adapter. Before making any changes, contact your networkadapter’s vendor for information about whether the tuning options in this guide are applicable.

NUMA Node#

Always ensure that you use local CPU and memory that are in the same NUMA domain as your network adapter.

To check your network adapter’s NUMA domain, run following commands:

cat /sys/class/net/<ethernet interface>/device/numa_node
cat /sys/class/net/<ethernet interface>/device/local_cpulist

IRQ Balance#

The operating system typically distributes the interrupts among all CPU cores in a multi-processor system, but this can cause delayed interrupt processing.

To disable this on Linux, run the following command:

sudo systemctl disable irqbalance

Configuring Interrupt Handling#

A channel in a network adapter is an IRQ and a set of queues that can trigger that IRQ. Typically, you do not want more interrupt queues than the number of cores in the system, so control the number of interrupt queues in a NUMA domain.

To set the number of channels:

Before you begin, stop the irqbalance service.

  1. Check the current settings with the following command:

    ethtool -l <adapter>
    

    It tells you the current setting of various queue types.

  2. Set the number of channels, for example:

    sudo ethtool -L <adapter> combined 16 tx 0 rx 0
    
  3. To receive and to transmit (combined), set the receive queue (rx), the transmit queue (tx), or a combined queue of both types.

  4. Contact your vendor for information.

For NVIDIA Mellanox network adapters, to set the appropriate interrupt handling masks, invoke the following script:

sudo set_irq_affinity.sh <adapter>

This script comes with a MOFED installation.

TX/RX Queue Size#

The NIC’s queue size dictates how many ring buffers are allocated for DMA transfer. To help prevent package drops, we recommend that you set the size to the maximum allowed value. You can also set it to a value that works best for your use case.

To query the current setting of the queue size:

ethtool -g enp1s0
Ring parameters for ibp1s0:
Pre-set maximums:
RX: 8192
RX Mini: n/a
RX Jumbo: n/a
TX: 8192
Current hardware settings:
RX: 512
RX Mini: n/a
RX Jumbo: n/a
TX: 1024

To set the queue size of a NIC:

sudo ethtool -G <adapter> rx <value> tx <value>

Large Receive Offload#

Depending on your use case, you can optimize for max throughput or best latency, but rarely both. The Large Receive Offload (LRO) setting optimizes for maximum network throughput, but when you enable it, might negatively affect the network latency. Contact your network adapter vendors for more information about whether LRO is supported and the best practices for usage.

To enable/disable LRO:

sudo ethtool lro <on|off>

MTU#

When you bring up the network interface, we recommend that you set the network adapter’s MTU to jumbo frame (9000):

sudo ifconfig <adapter> <IP_address> netmask <network_mask> mtu 9000 up

To check the current settings, here is a sample command you can run:

ifconfig <adapter> | grep mtu

MAX_ACC_OUT_READ#

This setting is NVIDIA Mellanox-specific, and here are the recommended values for the following NICs:

  • ConnextX-6: 44

  • ConnectX-7: 0 (Device would tune this config automatically)

To check the current settings:

sudo mlxconfig -d <dev> query | grep MAX_ACC_OUT_READ

To set this setting to the recommended value:

  1. Run the following commands:

    sudo mlxconfig -d <dev> set ADVANCED_PCI_SETTINGS=1
    sudo mlxconfig -d <dev> set MAX_ACC_OUT_READ=<value>
    
  2. For this setting to take effect, reboot the system.

PCIe Max Read Request#

This setting is also NVIDIA Mellanox specific and can be applied to other network adapters.

Note

Ensure that you set the MRRS to the value recommended by your vendor.

Here is an example that shows you how to set the MRRS of an NVIDIA Mellanox NIC to 4096:

sudo setpci -v -d <dev> cap_exp+8.w=5000:7000

This setting does not persist after the system reboots.

Relaxed Ordering#

Setting the PCIe ordering to relaxed for the network adapter sometimes results in better performance. There are different ways to enable relaxed ordering on the network adapter. Contact your vendor for more information.

Here is a sample command to check relaxed ordering on NVIDIA Mellanox NICs. For this command to work, set ADVANCED_PCI_SETTINGS to True (refer to MAX_ACK_OUT_READ for more information).

sudo mlxconfig -d <dev> query | grep PCI_WR_ORDERING
PCI_WR_ORDERING per_mkey(0)

A value of 0 means that the application or driver determines whether to set RO for its memory regions. PCI_WR_ORDERING=1 forces RO for every PCIe inbound write regardless of the application, except for completion entries (CQEs).

  1. To enable relaxed ordering:

    sudo mlxconfig -d <dev> set PCI_WR_ORDERING=1
    
  2. Reboot the system.

.10b PCIe Tags#

Ideally, the PCIe endpoint should use 10b PCIe tags to ensure that it can issue a large number of read requests to hide high read latencies when the system is busy. Contact your endpoint’s vendor for more information.

Here is an example for ConnectX-7:

setpci -s <bus> -v cap_exp+28.w
1000
  • If bit 12 is 1, then 10b tags are enabled.

  • If not, set bit 12.

The drivers for IB should be unloaded first.

Example:

systemctl stop openibd
setpci -s <bus> -v cap_exp+28.w
0040
setpci -s <bus> -v cap_exp+28.w=1040:1040
systemctl start openibd

Storage/Filesystem#

This section provides information about performance tunings that are related to storage and the filesystem.

Drop Page Cache#

When files are read from storage into memories on a Linux system, they are cached in unused memory areas called page cache. To drop the page cache, for example, because you want to benchmark the storage subsystem, you might need to drop the page cache before benchmarking to see true storage performance.

To drop page cache, run the following command:

echo 3 | sudo tee /proc/sys/vm/drop_caches

To compare how much memory area has been released from dropping the page cache, compare the Cached: line in the output of following command before and after invoking the previous command:

cat /proc/meminfo | grep Cached