Contiguous Pages
Contiguous Pages improves performance by allocating user memory regions over physical contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr
.
Additional performance improvements can be reached by allocating Queue Pair (QP) and Completion Queue (CQ|) buffers to the Contiguous Pages.
To activate set the below environment variables with values of PREFER_CONTIG or CONTIG.
- For QP: MLX_QP_ALLOC_TYPE
- For CQ: MLX_CQ_ALLOC_TYPE
The following are all the possible values that can be allocated to the buffer:
Possible Value | Description |
---|---|
ANON | Use current pages ANON small ones. |
HUGE | Force huge pages. |
CONTIG | Force contiguous pages. |
PREFER_CONTIG | Try contiguous fallback to ANON small pages. (Default) |
PREFER_HUGE | Try huge fallback to ANON small pages. |
ALL | Try huge fallback to contiguous if failed fallback to ANON small pages. |
Note that values are NOT case sensitive.
Usage:
The application calls the ibv_exp_reg_mr
API which turns on the IBV_EXP_ACCESS_ALLOCATE_MR
bit and sets the input address to NULL. Upon success, the address field of the struct ibv_mr
will hold the address to the allocated memory block. This block will be freed implicitly when the ibv_dereg_mr()
is called.
The following are environment variables that can be used to control error cases/contiguity:
Paramters | Description |
---|---|
| Configures the allocator type.
|
| Sets the maximum contiguous block size order.
|
| Sets the minimum contiguous block size order.
|
Memory Region Re-registration
Memory Region Re-registration allows the user to change attributes of the memory region. The user may change the PD, access flags or the address and length of the memory region. Memory
region supports contagious pages allocation. Consequently, it de-registers memory region fol- lowed by register memory region. Where possible, resources are reused instead of de-allocated and reallocated.
Please note that the verb is implemented as an experimental verb.
Example:
int ibv_exp_rereg_mr(struct ibv_mr *mr, int flags, struct ibv_pd *pd, void *addr, size_t length, uint64_t access, struct ibv_exp_rereg_mr_attr *attr);
@mr: | The memory region to modify. |
@flags: | A bit-mask used to indicate which of the following properties of the memory region are being modified. Flags should be one of: IBV_EXP_REREG_MR_CHANGE_TRANSLATION /* Change translation (location and length) */ IBV_EXP_REREG_MR_CHANGE_PD/* Change protection domain*/ IBV_EXP_REREG_MR_CHANGE_ACCESS/* Change access flags*/ |
@pd: | If IBV_EXP_REREG_MR_CHANGE_PD is set in flags, this field specifies the new protection domain to associated with the memory region, otherwise, this parameter is ignored. |
@addr: | If IBV_EXP_REREG_MR_CHANGE_TRANSLATION is set in flags, this field specifies the start of the virtual address to use in the new translation, otherwise, this parameter is ignored. |
@length: | If IBV_EXP_REREG_MR_CHANGE_TRANSLATION is set in flags, this field specifies the length of the virtual address to use in the new translation, otherwise, this parameter is ignored. |
@access: | If IBV_EXP_REREG_MR_CHANGE_ACCESS is set in flags, this field specifies the new memory access rights, otherwise, this parameter is ignored. Could be one of the following: |
@attr: | Future extensions |
ibv_exp_rereg_mr returns 0 on success, or the value of an errno on failure (which indicates the error reason). In case of an error, the MR is in undefined state. The user needs to call ibv_dereg_mr in order to release it.
Please note that if the MR (Memory Region) is created as a Shared MR and a translation is requested, after the call, the MR is no longer a shared MR. Moreover, Re-registration of MRs that uses Mellanox PeerDirect™ technology are not supported.
Memory Window
Memory Window allows the application to have a more flexible control over remote access to its memory. It is available only on physical functions / native machines The two types of Memory Windows supported are: type 1 and type 2B.
Memory Windows are intended for situations where the application wants to:
- grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty
- grant different remote access rights to different remote agents and/or grant those rights over different ranges within registered region
For further information, please refer to the InfiniBand specification document.
Memory Windows API cannot co-work with peer memory clients (Mellanox PeerDirect™).
Query Capabilities
Memory Windows are available if and only the hardware supports it. To verify whether Memory Windows are available, run ibv_exp_query_device
.
For example:
truct ibv_exp_device_attr device_attr = {.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1}; ibv_exp_query_device(context, & device_attr); if (device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MEM_WINDOW || device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MW_TYPE_2B) { /* Memory window is supported */
Memory Window
Allocating memory window is done by calling the ibv_alloc_mw
verb.
type_mw = IBV_MW_TYPE_2/ IBV_MW_TYPE_1 mw = ibv_alloc_mw(pd, type_mw);
Binding Memory Windows
After allocated, memory window should be bound to a registered memory region. Memory Region should have been registered using the IBV_EXP_ACCESS_MW_BIND access flag.
Binding Memory Window type1 is done via the ibv_exp_bind_mw verb.
struct ibv_exp_mw_bind mw_bind = { .comp_mask = IBV_EXP_BIND_MW_RESERVED - 1 }; ret = ibv_exp_bind_mw(qp, mw, &mw_bind);
Binding memory window type2B is done via the
ibv_exp_post_send
verb and a specific Work Request (WR) withopcode = IBV_EXP_WR_BIND_MW
Prior to binding, make sure to update the existing rkey.ibv_inc_rkey(mw->rkey)
Invalidating Memory Window
Before rebinding Memory Window type 2, it must be invalidated using the ibv_exp_post_send
verb and a specific WR with opcode = IBV_EXP_WR_LOCAL_INV
.
Deallocating Memory Window
Deallocating memory window is done using the ibv_dealloc_mw verb.
ibv_dealloc_mw(mw);
User-Mode Memory Registration (UMR)
User-mode Memory Registration (UMR) is a fast registration mode which uses send queue. The UMR support enables the usage of RDMA operations and scatters the data at the remote side through the definition of appropriate memory keys on the remote side.
UMR enables the user to:
- Create indirect memory keys from previously registered memory regions, including creation of KLM's from previous KLM's. There are not data alignment or length restrictions associated with the memory regions used to define the new KLM's.
- Create memory regions, which support the definition of regular non-contiguous memory regions.
On-Demand-Paging (ODP)
On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes. ODP does not support contiguous pages.
ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.
- Explicit ODP
In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages. ODP Memory Region (MR) does not need to have valid mappings at registration time.
- Implicit ODP
In Implicit ODP, applications are provided with a special memory key that represents their complete address space. This all IO accesses referencing this key (subject to the access rights associated with the key) does not need to register any virtual address range.
For further information about ODP, refer to Understand On Demand Paging (ODP) Community post.
Query Capabilities
On-Demand Paging is available if both the hardware and the kernel support it. To verify whether ODP is supported, run ibv_exp_query_device
:
struct ibv_exp_device_attr dattr; dattr.comp_mask = IBV_EXP_DEVICE_ATTR_ODP | IBV_EXP_DEVICE_ATTR_EXP_CAP_FLAGS; ret = ibv_exp_query_device(context, &dattr); if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP) //On-Demand Paging is supported.
Each transport has a capability field in the dattr.odp_caps
structure that indicates which operations are supported by the ODP MR:
struct ibv_exp_odp_caps { uint64_t general_odp_caps; struct { uint32_t rc_odp_caps; uint32_t uc_odp_caps; uint32_t ud_odp_caps; uint32_t dc_odp_caps; uint32_t xrc_odp_caps; uint32_t raw_eth_odp_caps; } per_transport_caps; };
To check which operations are supported for a given transport, the capabilities field need to be masked with one of the following masks:
enum ibv_odp_transport_cap_bits { IBV_EXP_ODP_SUPPORT_SEND = 1 << 0, IBV_EXP_ODP_SUPPORT_RECV = 1 << 1, IBV_EXP_ODP_SUPPORT_WRITE = 1 << 2, IBV_EXP_ODP_SUPPORT_READ = 1 << 3, IBV_EXP_ODP_SUPPORT_ATOMIC = 1 << 4, IBV_EXP_ODP_SUPPORT_SRQ_RECV = 1 << 5, };
For example, to check whether RC supports Send operation:
If (dattr.odp_caps.per_transport_caps.rc_odp_caps & IBV_EXP_ODP_SUPPORT_SEND) //RC supports send operations with ODP MR
For further information, please refer to the ibv_exp_query_device
manual page.
Registering ODP Explicit MR
ODP Explicit MR is registered after allocating the necessary resources (e.g., PD, buffer):
struct ibv_exp_reg_mr_in in; struct ibv_mr *mr; in.pd = pd; in.addr = buf; in.length = size; in.exp_access = IBV_EXP_ACCESS_ON_DEMAND| … ; in.comp_mask = 0; mr = ibv_exp_reg_mr(&in);
Please be aware that the exp_access differs from one operation to the other, but the IBV_EXP_ACCESS_ON_DEMAND
is set for all ODP MRs.
For further information, please refer to the ibv_exp_reg_mr
manual page.
Registering ODP Implicit MR
Registering an Implicit ODP MR provides you with an implicit lkey that represents the complete address space.
To register an Implicit ODP MR, in addition to the IBV_EXP_ACCESS_ON_DEMAND access flag, use in->addr = 0 and in->length = IBV_EXP_IMPLICIT_MR_SIZE
.
For further information, refer to the ibv_exp_reg_mr
manual page.
De-registering ODP MR
ODP MR is deregistered the same way a regular MR is deregistered:
ibv_dereg_mr(mr);
Pre-fetching Verb
The driver can pre-fetch a given range of pages and map them for access from the HCA. The pre- fetched verb is applicable for ODP MRs only, and it is done on a best effort basis, and may silently ignore errors.
Example:
struct ibv_exp_prefetch_attr prefetch_attr; prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS; prefetch_attr.addr = addr; prefetch_attr.length = length; prefetch_attr.comp_mask = 0; ibv_exp_prefetch_mr(mr, &prefetch_attr);
For further information, please refer to the ibv_exp_prefetch_mr
manual page.
ODP Statistics
To aid in debugging and performance measurements and tuning, ODP support includes an extensive set of statistics. The statistics are divided into 2 sets: standard statistics and debug statistics. Both sets are maintained on a per-device basis and report the total number of events since the device was registered.
The standard statistics are reported as sysfs entries with the following format:
/sys/class/infiniband_verbs/uverbs[0/1]/ invalidations_faults_contentions num_invalidation_pages num_invalidations num_page_fault_pages num_page_faults num_prefetchs_handled num_prefetch_pages
Counter Name | Description |
---|---|
invalidations_faults_contentions | Number of times that page fault events were dropped or prefetch operations were restarted due to OS page invalidations. |
num_invalidation_pages | Total number of pages invalidated during all invalidation events. |
num_invalidations | Number of invalidation events. |
num_page_fault_pages | Total number of pages faulted in by page fault events. |
num_page_faults | Number of page fault events. |
num_prefetches_handled | Number of prefetch verb calls that were completed successfully. |
num_prefetch_pages | Total number of pages that were prefetched by the prefetch verb. |
The debug statistics are reported by debugfs entries with the following format:
/sys/kernel/debug/mlx5/<pci-dev-id>/odp_stats/ num_failed_resolutions num_mrs_not_found num_odp_mr_pages num_odp_mrs
Counter Name | Description |
---|---|
num_failed_resolutions | Number of failed page faults that could not be resolved due to non- existing mappings in the OS. |
num_mrs_not_found | Number of faults that specified a non-existing ODP MR. |
num_odp_mr_pages | Total size in pages of current ODP MRs. |
num_odp_mrs | Number of current ODP MRs. |
Inline-Receive
When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE. Using Inline-Receive saves PCIe read transaction since the HCA does not need to read the scatter list, therefore it improves performance in case of short receive-messages.
On poll CQ, the driver copies the received data from WQE/CQE to the user's buffers. Therefore, apart from querying Inline-Receive capability and Inline-Receive activation the feature is transparent to user application.
When Inline-Receive is active, the user application must provide a valid virtual address for the receive buffers to allow the driver to move the inline-received message to these buffers. The validity of these addresses is not checked. Therefore, the result of providing non-valid virtual addresses is unexpected. This means that since Physical Address Memory Regions use physical address, they cannot be used with Inline-Receive.
Connect-IB® supports Inline-Receive on both the requestor and the responder sides. Since data is copied at the poll CQ verb, Inline-Receive on the requestor side is possible only if the user chooses IB(V)_SIGNAL_ALL_WR.
Querying Inline-Receive Capability
User application can use the ibv_exp_query_device
function to get the maximum possible Inline-Receive size. To get the size, the application needs to set the IBV_EXP_DEVICE_ATTR_INLINE_RECV_SZ
bit in the ibv_exp_device_attr comp_mask.
Activating Inline-Receive
To activate the Inline-Receive, you need to set the required message size in the max_inl_recv field in the ibv_exp_qp_init_attr struct when calling ibv_exp_create_qp function. The value returned by the same field is the actual Inline-Receive size applied.
Setting the message size may affect the WQE/CQE size.
Coherent Accelerator Processor Interface (CAPI)
Coherent Accelerator Processor Interface (CAPI) is currently at beta level and subject to changes.
The HCA can leverage the CAPI technology to implement address translation similar to the PCI-SIG PASID extension. Using CAPI, the HCA can use the host page tables rather than the Mellanox specific address translation mechanisms. The feature is similar in nature and user facing value to the on-demand paging feature, which is host independent. However, CAPI can be more efficient in some scenarios due to page tables sharing. The sharing removes the need to maintain NIC specific translation tables and allows the HCA to access pages that the software touched on the host CPU, without triggering another page fault upon HCA access.
CAPI is supported on Power 9 servers, with an IBM branded ConnectX-5 device and is turned off by default.
To enable CAPI:
Set mlxconfig's
ADVANCED_PCI_SETTINGS
andIBM_CAPI_EN
to true. For example:mlxconfig -d /dev/mst/mt4121_pciconf0 set ADVANCED_PCI_SETTINGS=true mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_CAPI_EN=true
- Power cycle the machine.
To disable CAPI:
Set mlxconfig's IBM_CAPI_EN to false. For example:
mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_CAPI_EN=false
- Power cycle the machine.
To use CAPI, you need to set the On-Demand-Paging flag when registering a memory region. Meaning if CAPI is enabled in the system, it will be used instead of the regular On-Demand-Paging support.
Tunneled Atomic
This feature updates RDMA Write Operations so that when an RDMA Write operation is issued, the payload indicates which atomic operation to perform, instead of being written to the Memory Region (MR).
To verify that this feature is enabled on your device:
ibv_devinfo -v ... Tunneled atomic: SUPPORT
- To register an MR with Tunneled Atomic, set mask
IBV_EXP_ACCESS_TUNNELED_ATOMIC
when callingibv_exp_reg_mr
from remote side. Normally, maskIBV_EXP_ACCESS_REMOTE_WRITE
should be set as well to allow the client to access the memory remotely. - Client should configure the
remote_addr
and rkey in Send Work Request (WR) according to the Memory Region created from the server side. The Send buffer should encapsulate tunneled atomic commands. Afterwards, callingibv_post_send
withIBV_WR_RDMA_WRITE
will trigger the atomic operation in the server side.
To enable Tunneled Atomic:
Run:
mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_TUNNELED_ATOMIC_EN=1
- Power-cycle the machine.
To disable Tunneled Atomic:
Run:
mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_TUNNELED_ATOMIC_EN=0
- Power-cycle the machine.
AS Notify
This feature is supported on ConnectX-5 adapter cards family only.
AS Notify is a low-latency hardware-based thread wakeup mechanism. Instead of actively polling a Completion Queue (CQ), the user application can arm the CQ and issue a "wait" instruction to put the user thread to sleep. Sleep mode benefits are the following:
- Saving power
- Freeing up hardware resources for other threads on the same core in the case of simultaneous multi-threading
The user application will be woken up by AS_notify interrupt once a completion event takes place. Note that when AS_notify interrupt cannot be triggered, firmware will fall back into the traditional MSI interrupt.
AS_notify is supported on IBM Power 9 servers, and is turned off by default.
Usage:
To enable AS_notify feature in firmware:
Set the IBM_AS_NOTIFY_EN flag to "true":
mlxconfig -d /dev/mst/mt4121_pciconf0 set IBM_AS_NOTIFY_EN=true
- Power-cycle the machine.
To be able to use AS_notify interrupt, create a completion queue with AS_notify enabled (with mask IBV_EXP_CQ_AS_NOTIFY
).