Optimizing IO Performance#
Networking#
We recommend that you download the latest driver and firmware for your network adapter. Before making any changes, contact your network adapter’s vendor for information about whether the tuning options in this guide are applicable.
NUMA Node#
Always ensure that you use local CPU and memory that are in the same NUMA domain as your network adapter.
To check your network adapter’s NUMA domain, run following commands:
cat /sys/class/net/<ethernet interface>/device/numa_node
cat /sys/class/net/<ethernet interface>/device/local_cpulist
IRQ Balance#
The operating system typically distributes the interrupts among all CPU cores in a multi-processor system, but this can cause delayed interrupt processing.
To disable this on Linux, run the following command:
sudo systemctl disable irqbalance
Configuring Interrupt Handling#
A channel in a network adapter is an IRQ and a set of queues that can trigger that IRQ. Typically, you do not want more interrupt queues than the number of cores in the system, so control the number of interrupt queues in a NUMA domain.
To set the number of channels:
Before you begin, stop the irqbalance service.
Check the current settings with the following command:
ethtool -l <adapter>
It tells you the current setting of various queue types.
Set the number of channels, for example:
sudo ethtool -L <adapter> combined 16 tx 0 rx 0
To receive and to transmit (combined), set the receive queue (rx), the transmit queue (tx), or a combined queue of both types.
Contact your vendor for information.
For NVIDIA Mellanox network adapters, to set the appropriate interrupt handling masks, invoke the following script:
sudo set_irq_affinity.sh <adapter>
This script comes with a MOFED installation.
TX/RX Queue Size#
The NIC’s queue size dictates how many ring buffers are allocated for DMA transfer. To help prevent package drops, we recommend that you set the size to the maximum allowed value. You can also set it to a value that works best for your use case.
To query the current setting of the queue size:
ethtool -g enp1s0
Ring parameters for ibp1s0:
Pre-set maximums:
RX: 8192
RX Mini: n/a
RX Jumbo: n/a
TX: 8192
Current hardware settings:
RX: 512
RX Mini: n/a
RX Jumbo: n/a
TX: 1024
To set the queue size of a NIC:
sudo ethtool -G <adapter> rx <value> tx <value>
Large Receive Offload#
Depending on your use case, you can optimize for max throughput or best latency, but rarely both. The Large Receive Offload (LRO) setting optimizes for maximum network throughput, but when you enable it, might negatively affect the network latency. Contact your network adapter vendors for more information about whether LRO is supported and the best practices for usage.
To enable/disable LRO:
sudo ethtool lro <on|off>
MTU#
When you bring up the network interface, we recommend that you set the network adapter’s MTU to jumbo frame (9000):
sudo ifconfig <adapter> <IP_address> netmask <network_mask> mtu 9000 up
To check the current settings, here is a sample command you can run:
ifconfig <adapter> | grep mtu
MAX_ACC_OUT_READ#
This setting is NVIDIA Mellanox-specific, and here are the recommended values for the following NICs:
ConnextX-6: 44
ConnectX-7: 0 (Device would tune this config automatically)
To check the current settings:
sudo mlxconfig -d <dev> query | grep MAX_ACC_OUT_READ
To set this setting to the recommended value:
Run the following commands:
sudo mlxconfig -d <dev> set ADVANCED_PCI_SETTINGS=1 sudo mlxconfig -d <dev> set MAX_ACC_OUT_READ=<value>
For this setting to take effect, reboot the system.
PCIe Max Read Request#
This setting is also NVIDIA Mellanox specific and can be applied to other network adapters.
Note
Ensure that you set the MRRS to the value recommended by your vendor.
Here is an example that shows you how to set the MRRS of an NVIDIA Mellanox NIC to 4096:
sudo setpci -v -d <dev> cap_exp+8.w=5000:7000
This setting does not persist after the system reboots.
Relaxed Ordering#
Setting the PCIe ordering to relaxed for the network adapter sometimes results in better performance. There are different ways to enable relaxed ordering on the network adapter. Contact your vendor for more information.
Here is a sample command to check relaxed ordering on NVIDIA Mellanox NICs. For this command to work, set ADVANCED_PCI_SETTINGS to True (refer to MAX_ACK_OUT_READ for more information).
sudo mlxconfig -d <dev> query | grep PCI_WR_ORDERING
PCI_WR_ORDERING per_mkey(0)
A value of 0 means that the application or driver determines whether to set RO for its memory regions. PCI_WR_ORDERING=1 forces RO for every PCIe inbound write regardless of the application, except for completion entries (CQEs).
To enable relaxed ordering:
sudo mlxconfig -d <dev> set PCI_WR_ORDERING=1
Reboot the system.
Storage/Filesystem#
This section provides information about performance tunings that are related to storage and the filesystem. The performance metrics for storage are:
Throughput: For large requests, throughput is reported using units such as MB/s or GB/s. For small requests, the unit is IOPS (Input/Output Operations Per Second).
Latency: Time to complete a request.
Preconditioning an SSD#
To obtain stable performance results, preconditioning workloads must be applied to new SSDs. Typically, these preconditioning workloads are:
Two full sequential write cycles over the SSD.
A full random write cycle over the SSD. This might take multiple hours.
Drop Page Cache#
When files are read from storage into memories on a Linux system, they are cached in unused memory areas called page cache. To drop the page cache, for example, because you want to benchmark the storage subsystem, you might need to drop the page cache before every test to see true storage performance.
To drop the page cache, run the following command:
# Write out dirty pages to disk(s)
sudo sync
# Drop page cache entries
echo 3 | sudo tee /proc/sys/vm/drop_caches
To compare how much memory area has been released from dropping the page cache, compare the Cached: line in the output of following command before and after invoking the previous command:
cat /proc/meminfo | grep Cached
Sample fio Job Files for NVMe SSDs#
This section provides sample fio job files for benchmarking performance of NVMe SSDs using the Flexible IO Tester (fio) benchmark.
The sample files are configured to use the io_uring engine and allow pinning the requesting cores to the local socket using the cpus_allowed parameter. The job performs random reads of 4K blocks on two NVMe SSDs concurrently.
The number of IO requests in flight can be tuned by adjusting the iodepth and numjobs parameters. Typically, for small requests (for example, 4K) higher values for iodepth and numjobs are used to keep the device busy. In the following example, the fio job uses four threads per SSD and 256 requests per thread:
[global]
ioengine=io_uring
direct=1
bs=4k
iodepth=256
numjobs=4
thread=1
group_reporting=1
runtime=60
time_based=1
ramp_time=5
rw=randread
norandommap=1
randrepeat=0
cpus_allowed_policy=split
[nvme0n1]
# nvme0 is attached to socket 0
filename=/dev/nvme0n1
cpus_allowed=0-3
numa_mem_policy=bind:0
[nvme1n1]
# nvme1 is attached to socket 1
filename=/dev/nvme1n1
cpus_allowed=80-83
numa_mem_policy=bind:1
Tuning Performance for NVMe SSDs#
To ensure that SSDs hit peak performance on Grace, consider the following factors:
Minimize the overhead of interrupt handling by coalescing interrupts for NVMe devices or by using polling.
Minimize the overhead of IOMMU translations by using lazy invalidations.
To keep the device busy, ensure that there are a sufficient number of requests in flight.
Minimize cross-socket traffic by pinning the cores performing IO operations to the local socket(s). Refer to the
cpus_allowedandnuma_mem_policyfields in the sample fio job file for more information.
Interrupt Coalescing for NVMe SSDs#
For NVMe SSDs, interrupt coalescing reduces the interrupt overhead by generating an interrupt after a batch of requests have been completed. As per the NVMe specification:
Aggregation Time (TIME): Specifies the recommended maximum time in 100 microsecond increments that a controller might delay an interrupt due to interrupt coalescing. A value of 0h means there is no delay. The controller might apply this time to each interrupt vector or across all interrupt vectors. The reset value of this setting is 0h.
Aggregation Threshold (THR): Specifies the recommended minimum number of completion queue entries to aggregate each interrupt vector before signaling an interrupt to the host. This is an 0-based value, and the reset value of this setting is 0h.
To enable interrupt coalescing with eight interrupts per batch, and an aggregation time of 100us, run the following command:
# For NVMe drive nvme0n1
sudo nvme set-feature /dev/nvme0n1 -f 0x8 -V 0x00000107
# Check the setting
sudo nvme get-feature /dev/nvme0n1 -f 0x8 -H
get-feature:0x08 (Interrupt Coalescing), Current value:0x00000107
Aggregation Time (TIME): 100 usec
Aggregation Threshold (THR): 8
Lazy IOMMU Invalidations#
Lazy IOMMU invalidations is a kernel feature that amortizes the overhead of IOMMU cache flushes over multiple IO requests. To check if lazy IOMMU invalidations is enabled, run the following command:
sudo dmesg | grep -i iommu | grep -i 'passthrough\|strict\|lazy'
To enable lazy IOMMU invalidations, boot the kernel with the iommu.strict=0 kernel parameter.
Note that the lazy invalidations feature has been enabled by default on x86 servers for several years.
Enabling the Polling-Based Completion on NVMe SSDs#
To enable polling-based completion:
Configure the Linux
nvmekernel module to use polling for a subset of IO queues:# Edit /etc/modprobe.d/nvme.conf - each NVMe device will have 16 queues with polling enabled (no interrupts) options nvme poll_queues=16
Reboot the system.
Add the following lines to the
globalsection of the fio job file:hipri=1 fixedbufs=1