On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes. ODP does not support contiguous pages.

ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.

Explicit ODP

In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages. ODP Memory Region (MR) does not need to have valid mappings at registration time.

Implicit ODP

In Implicit ODP, applications are provided with a special memory key that represents their complete address space. This all IO accesses referencing this key (subject to the access rights associated with the key) does not need to register any virtual address range.

For further information about ODP, refer to Understand On Demand Paging (ODP) Community post.

On-Demand Paging is available if both the hardware and the kernel support it. To verify whether ODP is supported, run ibv_exp_query_device :

Copy Copied! struct ibv_exp_device_attr dattr; dattr.comp_mask = IBV_EXP_DEVICE_ATTR_ODP | IBV_EXP_DEVICE_ATTR_EXP_CAP_FLAGS; ret = ibv_exp_query_device(context, &dattr); if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)

Each transport has a capability field in the dattr.odp_caps structure that indicates which operations are supported by the ODP MR:

Copy Copied! struct ibv_exp_odp_caps { uint64_t general_odp_caps; struct { uint32_t rc_odp_caps; uint32_t uc_odp_caps; uint32_t ud_odp_caps; uint32_t dc_odp_caps; uint32_t xrc_odp_caps; uint32_t raw_eth_odp_caps; } per_transport_caps; };

To check which operations are supported for a given transport, the capabilities field need to be masked with one of the following masks:

Copy Copied! enum ibv_odp_transport_cap_bits { IBV_EXP_ODP_SUPPORT_SEND = 1 << 0 , IBV_EXP_ODP_SUPPORT_RECV = 1 << 1 , IBV_EXP_ODP_SUPPORT_WRITE = 1 << 2 , IBV_EXP_ODP_SUPPORT_READ = 1 << 3 , IBV_EXP_ODP_SUPPORT_ATOMIC = 1 << 4 , IBV_EXP_ODP_SUPPORT_SRQ_RECV = 1 << 5 , };

For example, to check whether RC supports Send operation:

Copy Copied! If (dattr.odp_caps.per_transport_caps.rc_odp_caps & IBV_EXP_ODP_SUPPORT_SEND)

For further information, please refer to the ibv_exp_query_device manual page.

ODP Explicit MR is registered after allocating the necessary resources (e.g., PD, buffer):

Copy Copied! struct ibv_exp_reg_mr_in in; struct ibv_mr *mr; in.pd = pd; in.addr = buf; in.length = size; in.exp_access = IBV_EXP_ACCESS_ON_DEMAND| … ; in.comp_mask = 0 ; mr = ibv_exp_reg_mr(&in);

Please be aware that the exp_access differs from one operation to the other, but the IBV_EXP_ACCESS_ON_DEMAND is set for all ODP MRs.

For further information, please refer to the ibv_exp_reg_mr manual page.

Registering an Implicit ODP MR provides you with an implicit lkey that represents the complete address space.

To register an Implicit ODP MR, in addition to the IBV_EXP_ACCESS_ON_DEMAND access flag, use in->addr = 0 and in->length = IBV_EXP_IMPLICIT_MR_SIZE .

For further information, refer to the ibv_exp_reg_mr manual page.

ODP MR is deregistered the same way a regular MR is deregistered:

Copy Copied! ibv_dereg_mr(mr);





The driver can pre-fetch a given range of pages and map them for access from the HCA. The pre- fetched verb is applicable for ODP MRs only, and it is done on a best effort basis, and may silently ignore errors.

Example:

Copy Copied! struct ibv_exp_prefetch_attr prefetch_attr; prefetch_attr.flags = IBV_EXP_PREFETCH_WRITE_ACCESS; prefetch_attr.addr = addr; prefetch_attr.length = length; prefetch_attr.comp_mask = 0 ; ibv_exp_prefetch_mr(mr, &prefetch_attr);

For further information, please refer to the ibv_exp_prefetch_mr manual page.

To aid in debugging and performance measurements and tuning, ODP support includes an extensive set of statistics. The statistics are divided into 2 sets: standard statistics and debug statistics. Both sets are maintained on a per-device basis and report the total number of events since the device was registered.

The standard statistics are reported as sysfs entries with the following format:

Copy Copied! /sys/ class /infiniband_verbs/uverbs[ 0 / 1 ]/ invalidations_faults_contentions num_invalidation_pages num_invalidations num_page_fault_pages num_page_faults num_prefetchs_handled num_prefetch_pages

Counter Name Description invalidations_faults_contentions Number of times that page fault events were dropped or prefetch operations were restarted due to OS page invalidations. num_invalidation_pages Total number of pages invalidated during all invalidation events. num_invalidations Number of invalidation events. num_page_fault_pages Total number of pages faulted in by page fault events. num_page_faults Number of page fault events. num_prefetches_handled Number of prefetch verb calls that were completed successfully. num_prefetch_pages Total number of pages that were prefetched by the prefetch verb.

The debug statistics are reported by debugfs entries with the following format:

Copy Copied! /sys/kernel/debug/mlx5/<pci-dev-id>/odp_stats/ num_failed_resolutions num_mrs_not_found num_odp_mr_pages num_odp_mrs