NVIDIA MLNX_OFED Documentation Rev 4.9- LTS
Linux Kernel Upstream Release Notes v5.17

Optimized Memory Access

Contiguous Pages improves performance by allocating user memory regions over physical contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr.
Additional performance improvements can be reached by allocating Queue Pair (QP) and Completion Queue (CQ|) buffers to the Contiguous Pages.
To activate set the below environment variables with values of PREFER_CONTIG or CONTIG.



The following are all the possible values that can be allocated to the buffer:

Possible Value



Use current pages ANON small ones.


Force huge pages.


Force contiguous pages.


Try contiguous fallback to ANON small pages. (Default)


Try huge fallback to ANON small pages.


Try huge fallback to contiguous if failed fallback to ANON small pages.

Note that values are NOT case sensitive.


The application calls the ibv_exp_reg_mr API which turns on the IBV_EXP_ACCESS_ALLOCATE_MR bit and sets the input address to NULL. Upon success, the address field of the struct ibv_mr will hold the address to the allocated memory block. This block will be freed implicitly when the ibv_dereg_mr() is called.
The following are environment variables that can be used to control error cases/contiguity:




Configures the allocator type.

  • ALL (Default) - uses all possible allocator and selects most efficient allocator

  • ANON - enables the usage of anonymous pages and disables the allocator

  • CONTIG - forces the usage of the contiguous pages allocator. If contiguous pages are not available the allocation fails


Sets the maximum contiguous block size order.

  • Values: 12-23

  • Default: 23


Sets the minimum contiguous block size order.

  • Values: 12-23

  • Default: 12

Memory Region Re-registration allows the user to change attributes of the memory region. The user may change the PD, access flags or the address and length of the memory region. Memory
region supports contagious pages allocation. Consequently, it de-registers memory region fol- lowed by register memory region. Where possible, resources are reused instead of de-allocated and reallocated.


Please note that the verb is implemented as an experimental verb.



int ibv_exp_rereg_mr(struct ibv_mr *mr, int flags, struct ibv_pd *pd, void *addr, size_t length, uint64_t access, struct ibv_exp_rereg_mr_attr *attr);


The memory region to modify.


A bit-mask used to indicate which of the following properties of the memory region are being modified. Flags should be one of:
IBV_EXP_REREG_MR_CHANGE_TRANSLATION /* Change translation (location and length) */ IBV_EXP_REREG_MR_CHANGE_PD/* Change protection domain*/
IBV_EXP_REREG_MR_CHANGE_ACCESS/* Change access flags*/


If IBV_EXP_REREG_MR_CHANGE_PD is set in flags, this field specifies the new protection domain to associated with the memory region, otherwise, this parameter is ignored.


If IBV_EXP_REREG_MR_CHANGE_TRANSLATION is set in flags, this field specifies the start of the virtual address to use in the new translation, otherwise, this parameter is ignored.


If IBV_EXP_REREG_MR_CHANGE_TRANSLATION is set in flags, this field specifies the length of the virtual address to use in the new translation, otherwise, this parameter is ignored.


If IBV_EXP_REREG_MR_CHANGE_ACCESS is set in flags, this field specifies the new memory access rights, otherwise, this parameter is ignored. Could be one of the following:
IBV_ACCESS_ALLOCATE_MR /* Let the library allocate the memory for * the user, tries to get contiguous pages */


Future extensions

ibv_exp_rereg_mr returns 0 on success, or the value of an errno on failure (which indicates the error reason). In case of an error, the MR is in undefined state. The user needs to call ibv_dereg_mr in order to release it.

Please note that if the MR (Memory Region) is created as a Shared MR and a translation is requested, after the call, the MR is no longer a shared MR. Moreover, Re-registration of MRs that uses Mellanox PeerDirect™ technology are not supported.

Memory Window allows the application to have a more flexible control over remote access to its memory. It is available only on physical functions / native machines The two types of Memory Windows supported are: type 1 and type 2B.
Memory Windows are intended for situations where the application wants to:

  • grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty

  • grant different remote access rights to different remote agents and/or grant those rights over different ranges within registered region

For further information, please refer to the InfiniBand specification document.


Memory Windows API cannot co-work with peer memory clients (Mellanox PeerDirect™).

Query Capabilities

Memory Windows are available if and only the hardware supports it. To verify whether Memory Windows are available, run ibv_exp_query_device.
For example:


truct ibv_exp_device_attr device_attr = {.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1}; ibv_exp_query_device(context, & device_attr); if (device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MEM_WINDOW || device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MW_TYPE_2B) { /* Memory window is supported */

Memory Window

Allocating memory window is done by calling the ibv_alloc_mw verb.


type_mw = IBV_MW_TYPE_2/ IBV_MW_TYPE_1 mw = ibv_alloc_mw(pd, type_mw);

Binding Memory Windows

After allocated, memory window should be bound to a registered memory region. Memory Region should have been registered using the IBV_EXP_ACCESS_MW_BIND access flag.

  • Binding Memory Window type1 is done via the ibv_exp_bind_mw verb.


    struct ibv_exp_mw_bind mw_bind = { .comp_mask = IBV_EXP_BIND_MW_RESERVED - 1 }; ret = ibv_exp_bind_mw(qp, mw, &mw_bind);

  • Binding memory window type2B is done via the ibv_exp_post_send verb and a specific Work Request (WR) with opcode = IBV_EXP_WR_BIND_MWPrior to binding, make sure to update the existing rkey.



Invalidating Memory Window

Before rebinding Memory Window type 2, it must be invalidated using the ibv_exp_post_send verb and a specific WR with opcode = IBV_EXP_WR_LOCAL_INV.

Deallocating Memory Window

Deallocating memory window is done using the ibv_dealloc_mw verb.



User-mode Memory Registration (UMR) is a fast registration mode which uses send queue. The UMR support enables the usage of RDMA operations and scatters the data at the remote side through the definition of appropriate memory keys on the remote side.
UMR enables the user to:

  • Create indirect memory keys from previously registered memory regions, including creation of KLM's from previous KLM's. There are not data alignment or length restrictions associated with the memory regions used to define the new KLM's.

  • Create memory regions, which support the definition of regular non-contiguous memory regions.

On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes. ODP does not support contiguous pages.
ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.

  • Explicit ODP
    In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages. ODP Memory Region (MR) does not need to have valid mappings at registration time.

  • Implicit ODP
    In Implicit ODP, applications are provided with a special memory key that represents their complete address space. This all IO accesses referencing this key (subject to the access rights associated with the key) does not need to register any virtual address range.

For further information about ODP, refer to Understand On Demand Paging (ODP) Community post.

Query Capabilities

On-Demand Paging is available if both the hardware and the kernel support it. To verify whether ODP is supported, run ibv_exp_query_device:


struct ibv_exp_device_attr dattr; dattr.comp_mask = IBV_EXP_DEVICE_ATTR_ODP | IBV_EXP_DEVICE_ATTR_EXP_CAP_FLAGS; ret = ibv_exp_query_device(context, &dattr); if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP) //On-Demand Paging is supported.

Each transport has a capability field in the dattr.odp_caps structure that indicates which operations are supported by the ODP MR:


struct ibv_exp_odp_caps { uint64_t general_odp_caps; struct { uint32_t rc_odp_caps; uint32_t uc_odp_caps; uint32_t ud_odp_caps; uint32_t dc_odp_caps; uint32_t xrc_odp_caps; uint32_t raw_eth_odp_caps; } per_transport_caps; };

To check which operations are supported for a given transport, the capabilities field need to be masked with one of the following masks:


enum ibv_odp_transport_cap_bits { IBV_EXP_ODP_SUPPORT_SEND = 1 << 0, IBV_EXP_ODP_SUPPORT_RECV = 1 << 1, IBV_EXP_ODP_SUPPORT_WRITE = 1 << 2, IBV_EXP_ODP_SUPPORT_READ = 1 << 3, IBV_EXP_ODP_SUPPORT_ATOMIC = 1 << 4, IBV_EXP_ODP_SUPPORT_SRQ_RECV = 1 << 5, };

For example, to check whether RC supports Send operation:


If (dattr.odp_caps.per_transport_caps.rc_odp_caps & IBV_EXP_ODP_SUPPORT_SEND) //RC supports send operations with ODP MR

For further information, please refer to the ibv_exp_query_device manual page.

Registering ODP Explicit MR

ODP Explicit MR is registered after allocating the necessary resources (e.g., PD, buffer):


struct ibv_exp_reg_mr_in in; struct ibv_mr *mr; in.pd = pd; in.addr = buf; in.length = size; in.exp_access = IBV_EXP_ACCESS_ON_DEMAND| … ; in.comp_mask = 0; mr = ibv_exp_reg_mr(&in);

Please be aware that the exp_access differs from one operation to the other, but the IBV_EXP_ACCESS_ON_DEMAND is set for all ODP MRs.
For further information, please refer to the ibv_exp_reg_mr manual page.

Registering ODP Implicit MR

Registering an Implicit ODP MR provides you with an implicit lkey that represents the complete address space.
To register an Implicit ODP MR, in addition to the IBV_EXP_ACCESS_ON_DEMAND access flag, use in->addr = 0 and in->length = IBV_EXP_IMPLICIT_MR_SIZE.
For further information, refer to the ibv_exp_reg_mr manual page.

De-registering ODP MR

ODP MR is deregistered the same way a regular MR is deregistered:



Pre-fetching Verb

The driver can pre-fetch a given range of pages and map them for access from the HCA. The pre- fetched verb is applicable for ODP MRs only, and it is done on a best effort basis, and may silently ignore errors.


struct ibv_exp_prefetch_attr prefetch_attr; prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS; prefetch_attr.addr = addr; prefetch_attr.length = length; prefetch_attr.comp_mask = 0; ibv_exp_prefetch_mr(mr, &prefetch_attr);

For further information, please refer to the ibv_exp_prefetch_mr manual page.

ODP Statistics

To aid in debugging and performance measurements and tuning, ODP support includes an extensive set of statistics. The statistics are divided into 2 sets: standard statistics and debug statistics. Both sets are maintained on a per-device basis and report the total number of events since the device was registered.
The standard statistics are reported as sysfs entries with the following format:


/sys/class/infiniband_verbs/uverbs[0/1]/ invalidations_faults_contentions num_invalidation_pages num_invalidations num_page_fault_pages num_page_faults num_prefetchs_handled num_prefetch_pages

Counter Name



Number of times that page fault events were dropped or prefetch operations were restarted due to OS page invalidations.


Total number of pages invalidated during all invalidation events.


Number of invalidation events.


Total number of pages faulted in by page fault events.


Number of page fault events.


Number of prefetch verb calls that were completed successfully.


Total number of pages that were prefetched by the prefetch verb.

The debug statistics are reported by debugfs entries with the following format:


/sys/kernel/debug/mlx5/<pci-dev-id>/odp_stats/ num_failed_resolutions num_mrs_not_found num_odp_mr_pages num_odp_mrs

Counter Name



Number of failed page faults that could not be resolved due to non- existing mappings in the OS.


Number of faults that specified a non-existing ODP MR.


Total size in pages of current ODP MRs.


Number of current ODP MRs.

When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE. Using Inline-Receive saves PCIe read transaction since the HCA does not need to read the scatter list, therefore it improves performance in case of short receive-messages.
On poll CQ, the driver copies the received data from WQE/CQE to the user's buffers. Therefore, apart from querying Inline-Receive capability and Inline-Receive activation the feature is transparent to user application.


When Inline-Receive is active, the user application must provide a valid virtual address for the receive buffers to allow the driver to move the inline-received message to these buffers. The validity of these addresses is not checked. Therefore, the result of providing non-valid virtual addresses is unexpected. This means that since Physical Address Memory Regions use physical address, they cannot be used with Inline-Receive.

Connect-IB® supports Inline-Receive on both the requestor and the responder sides. Since data is copied at the poll CQ verb, Inline-Receive on the requestor side is possible only if the user chooses IB(V)_SIGNAL_ALL_WR.

Querying Inline-Receive Capability

User application can use the ibv_exp_query_device function to get the maximum possible Inline-Receive size. To get the size, the application needs to set the IBV_EXP_DEVICE_ATTR_INLINE_RECV_SZ bit in the ibv_exp_device_attr comp_mask.

Activating Inline-Receive

To activate the Inline-Receive, you need to set the required message size in the max_inl_recv field in the ibv_exp_qp_init_attr struct when calling ibv_exp_create_qp function. The value returned by the same field is the actual Inline-Receive size applied.


Setting the message size may affect the WQE/CQE size.


Coherent Accelerator Processor Interface (CAPI) is currently at beta level and subject to changes.

The HCA can leverage the CAPI technology to implement address translation similar to the PCI-SIG PASID extension. Using CAPI, the HCA can use the host page tables rather than the Mellanox specific address translation mechanisms. The feature is similar in nature and user facing value to the on-demand paging feature, which is host independent. However, CAPI can be more efficient in some scenarios due to page tables sharing. The sharing removes the need to maintain NIC specific translation tables and allows the HCA to access pages that the software touched on the host CPU, without triggering another page fault upon HCA access.
CAPI is supported on Power 9 servers, with an IBM branded ConnectX-5 device and is turned off by default.

To enable CAPI:

  1. Set mlxconfig's ADVANCED_PCI_SETTINGS and IBM_CAPI_EN to true. For example:


    mlxconfig -d /dev/mst/mt4121_pciconf0 set ADVANCED_PCI_SETTINGS=true  mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_CAPI_EN=true

  2. Power cycle the machine.

To disable CAPI:

  1. Set mlxconfig's IBM_CAPI_EN to false. For example:


    mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_CAPI_EN=false 

  2. Power cycle the machine.

To use CAPI, you need to set the On-Demand-Paging flag when registering a memory region. Meaning if CAPI is enabled in the system, it will be used instead of the regular On-Demand-Paging support.

This feature updates RDMA Write Operations so that when an RDMA Write operation is issued, the payload indicates which atomic operation to perform, instead of being written to the Memory Region (MR).

To verify that this feature is enabled on your device:


ibv_devinfo -v ... Tunneled atomic: SUPPORT

  • To register an MR with Tunneled Atomic, set mask IBV_EXP_ACCESS_TUNNELED_ATOMIC when calling ibv_exp_reg_mr from remote side. Normally, mask IBV_EXP_ACCESS_REMOTE_WRITE should be set as well to allow the client to access the memory remotely.

  • Client should configure the remote_addr and rkey in Send Work Request (WR) according to the Memory Region created from the server side. The Send buffer should encapsulate tunneled atomic commands. Afterwards, calling ibv_post_send with IBV_WR_RDMA_WRITE will trigger the atomic operation in the server side.

To enable Tunneled Atomic:

  1. Run:


    mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_TUNNELED_ATOMIC_EN=1 

  2. Power-cycle the machine.

To disable Tunneled Atomic:

  1. Run:


    mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_TUNNELED_ATOMIC_EN=0 

  2. Power-cycle the machine.


This feature is supported on ConnectX-5 adapter cards family only.

AS Notify is a low-latency hardware-based thread wakeup mechanism. Instead of actively polling a Completion Queue (CQ), the user application can arm the CQ and issue a "wait" instruction to put the user thread to sleep. Sleep mode benefits are the following:

  • Saving power

  • Freeing up hardware resources for other threads on the same core in the case of simultaneous multi-threading

The user application will be woken up by AS_notify interrupt once a completion event takes place. Note that when AS_notify interrupt cannot be triggered, firmware will fall back into the traditional MSI interrupt.
AS_notify is supported on IBM Power 9 servers, and is turned off by default.


To enable AS_notify feature in firmware:

  1. Set the IBM_AS_NOTIFY_EN flag to "true":


    mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_AS_NOTIFY_EN=true

  2. Power-cycle the machine.

To be able to use AS_notify interrupt, create a completion queue with AS_notify enabled (with mask IBV_EXP_CQ_AS_NOTIFY).

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.