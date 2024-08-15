DOCA DevEmu PCI Generic allows the creation of a generic PCI type. The PCI Type is part of the DOCA DevEmu PCI library. It is the component responsible for configuring the capabilities and bar layout of emulated devices.

The PCI Type can be considered as the template for creating emulated devices. Such that the user first configures a type, and then they can use it to create multiple emulated devices that have the same configuration.

For a more concrete example, consider that you would like to emulate an NVMe device, then you would create a type and configure its capabilities and BAR to be compliant with the NVMe spec, after that you can use the same type, to generate multiple NVMe emulated devices.

The PCIe configuration space is 256 bytes long and has a header that is 64 bytes long. Each field can be referred to as a register (e.g., device ID).

Every PCIe device is required to implement the PCIe configuration space as defined in the PCIe specification.

The host can then read and/or write to registers in the PCIe configuration space. This allows the PCIe driver and the BIOS to interact with the device and perform the required setup.

It is possible to configure registers in the PCIe configuration space header as shown in the following diagram:

Info 0x0 is the only supported header type (general device).

The following registers are read-only, and they are used to identify the device:

Register Name Description Example Class Code Defines the functionality of the device Can be further split into 3 values {class : subclass: prog IF} 0x020000 Class: 0x02 (Network Controller) Subclass: 0x00 (Ethernet Controller) Prog IF: 0x00 (N/A) Revision ID Unique identifier of the device revision Vendor allocates ID by itself 0x01 (Rev 01) Vendor ID Unique identifier of the chipset vendor Vendor allocates ID from the PCI-SIG 0x15b3 Nvidia Device ID Unique identifier of the chipset Vendor allocates ID by itself 0xa2dc BlueField-3 integrated ConnectX-7 network controller Subsystem Vendor ID Unique identifier of the card vendor Vendor allocates ID from the PCI-SIG 0x15b3 Nvidia Subsystem ID Unique identifier of the card Vendor allocates ID by itself 0x0051

While the PCIe configuration space can be used to interact with the PCIe device, it is not enough to implement the functionality that is targeted by the device. Rather, it is only relevant for the PCIe layer.

To enable protocol-specific functionality, the device configures additional memory regions referred to as base address registers (BARs) that can be used by the host to interact with the device. Different from the PCIe configuration space, BARs are defined by the device and interactions with them is device-specific. For example, the PCIe driver interacts with an NVMe device's PCIe configuration space according to the PCIe spec, while the NVMe driver interacts with the BAR regions according to the NVMe spec.

Any read/write requests on the BAR are typically routed to the hardware, but in case of an emulated device, the requests are routed to the software.

The DOCA DevEmu PCI type library provides APIs that allow software to pick the mechanism used for routing the requests to software, while taking into consideration common design patterns utilized in existing devices.

Each PCIe device can have up to 6 BARs with varying properties. During the PCIe bus enumeration process, the PCIe device must be able to advertise information about the layout of each BAR. Based on the advertised information, the BIOS/OS then allocates a memory region for each BAR and assigns the address to the relevant BAR in the PCIe configuration space header. The driver can then use the assigned memory address to perform reads/writes to the BAR.

The PCIe device must be able to provide information with regards to each BAR's layout.

The layout can be split into 2 types, each with their own properties as detailed in the following subsections.

According to the PCIe specification, the following represents the I/O mapped BAR:

Additionally, the BAR register is responsible for advertising the requested size during enumeration.

Info The size must be a power of 2.

Users can use the following API to set a BAR as I/O mapped:

Copy Copied! doca_devemu_pci_type_set_io_bar_conf(struct doca_devemu_pci_type *pci_type, uint8_t id, uint8_t log_sz)

id – the BAR ID

log_sz – the log of the BAR size

According to the PCIe specification, the following represents the memory mapped BAR:

Additionally, the BAR register is responsible for advertising the requested size during enumeration.

Info The size must be a power of 2.

The memory mapped BAR allows a 64-bit address to be assigned. To achieve this, users must specify the bar Memory Type as 64-bit, and then set the next BAR's (BAR ID + 1) size to be 0.

Setting the pre-fetchable bit indicates that reads to the BAR have no side-effects.

Users can use the following API to set a BAR as memory mapped:

Copy Copied! doca_devemu_pci_type_set_memory_bar_conf(struct doca_devemu_pci_type *pci_type, uint8_t id, uint8_t log_sz, enum doca_devemu_pci_bar_mem_type memory_type, uint8_t prefetchable)

id – the BAR ID

log_sz – the log of the BAR size. If set to 0, then the size is considered as 0 (instead of 1).

memory_type – specifies the memory type of the BAR. If set to 64-bit, then the next BAR must have log_sz set to 0.

prefetchable – indicates whether the BAR memory is pre-fetchable or not (a value of 1 or 0 respectively)

BAR regions refer to memory regions that make up a BAR layout. This is not something that is part of the PCIe specification, rather it is a DOCA concept that allows the user to customize behavior of the BAR when interacted with by the host.

The BAR region defines the behavior when the host performs a read/write to an address within the BAR, such that every address falls in some memory region as defined by the user.

All BAR regions have these configurations in common:

id – the BAR ID that the region is part of

start_addr – the start address of the region within the BAR layout relative to the BAR. 0 indicates the start of the BAR layout.

size – the size of the BAR region

Currently, there are 4 BAR region types, defining different behavior:

Stateful

DB by offset

DB by data

MSIX table

MSIX PBA

Stateful region can be used as a shared memory, such that the contents are maintained in firmware. A read from the driver returns the latest value, while a write updates the value and triggers an event to software running on the DPU.

This can be useful for communication between the driver and the device, during the control path (e.g., exposing capabilities, initialization).

Info Some limitations apply, please see Limitations section

A read from the driver returns the latest value written to the region, whether written by the host or by the driver itself.





A write from the driver updates the value at the written address and notifies software running on the Arm that a write has occurred. The notification on the Arm arrives as an asynchronous event (see doca_devemu_pci_dev_event_bar_stateful_region_driver_write ).

Info The event that arrives to Arm software is asynchronous such that it may arrive after the driver has completed the write.





The DPU can read the values of the stateful region using doca_devemu_pci_dev_query_bar_stateful_region_values . This returns the latest snapshot of the stateful region values. It can be particularly useful to find what was written by the driver after the "stateful region driver write event" occurs.

The DPU can write the values of the stateful region using doca_devemu_pci_dev_modify_bar_stateful_region_values . This updates the values such that subsequent reads from the driver or the DPU returns these values.

The DPU is able to set default values to the stateful region. Default values come in 2 layers:

Type default values – these values are set for all devices that have the same type. This can be set only if no device currently exists.

Device default values – these values are set for a specific device and take affect on the next FLR cycle or the next hotplug of the device

A read of the stateful region follows the following hierarchy:

Return the latest value as written by the host or driver (whichever was done last). Return the device default values. Return the type default values. Return 0.

Doorbell (DB) regions can be used to implement a consumer-producer queue between the driver and the DPU, such that a write from the driver would trigger an event on the DPU through DPA, allowing it to fetch the written value. This can be useful for communication between the driver and the device, during the data path allowing IO processing.

While DBs are not part of the PCIe specification, it is a widely used mechanism by vendors (e.g., RDMA QP, NVMe SQ, virtio VQ, etc).

The same DB region can be used to manage multiple DBs, such that each DB can be used to implement a queue.

The DPU software can utilize DB resources individually:

Each DB resource has a unique zero-based index referred to as DB ID

DB resource can be managed (create/destroy/modify/query) individually

Each DB resource has a separate notification mechanism. That is, the notification on DPU is triggered for each DB separately.

The DB usually consists of a numeric value (e.g., uint32_t ) representing the consumer/producer index of the queue.

When the driver writes to the DB region, the related DB resource gets updated with the written value, and a notification is sent to the DPU.

When driver writes to the DB BAR region it must adhere to the following:

The size of the write must match the size of the DB value (e.g., uint32_t )

The offset within the region must be aligned to the DB stride size or the DB size

The flow would look something as the following:

Driver performs a write of the DB value at some offset within the DB BAR region

DPU calculates the DB ID that the write is intended for. Depending on the region type: DB by offset – DPU calculates the DB ID based on the write offset relative to the DB BAR region DB by data – DPU parses the written DB value and extracts the DB ID from it

DPU updates the DB resource with the matching DB ID to the value written by the driver

DPU sends a notification to the DPA application, informing it that the value of DB with DB ID has been updated by the driver

The driver should not attempt to read from the DB region. Doing so results in anomalous behavior.

The BlueField can update the value of each DB resource individually using doca_devemu_pci_db_modify_value . This produces similar side effects as though the driver updated the value using a write to the DB region.

The BlueField can read the value of each DB resource individually using one of the following methods:

Read the value from the BlueField Arm using doca_devemu_pci_db_query_value

Read the value from the DPA using doca_dpa_dev_devemu_pci_db_get_value

The first option is a time consuming operation and is only recommended for the control path. In the data path, it is recommended to use the second option only.

The API doca_devemu_pci_type_set_bar_db_region_by_offset_conf can be used to set up DB by offset region. When the driver writes a DB value using this region, the DPU receives a notification for the relevant DB resource, based on the write offset, such that the DB ID is calculated as follows: db_id = write_offset / db_stride_size .

Warning The area that is part of the stride but not part of the doorbell, should not be used for any read/write operation, doing so will result in undefined anomalous.





The API doca_devemu_pci_type_set_bar_db_region_by_data_conf can be used to set up DB by data region. When the driver writes a DB value using this region, the DPU receives a notification for the relevant DB resource based on the written DB value, such that there is no relation between the write offset and the DB triggered. This DB region assumes that the DB ID is embedded within the DB value written by the driver. When setting up this region, the user must specify where the Most Significant Byte (MSB) and Least Significant Byte (LSB) of the DB ID are embedded in the DB value.

The DPU follows these steps to extract the DB ID from the DB value:

Driver writes the DB value

BlueField extracts the bytes between MSB and LSB

DPU compares MSB index with LSB index If MSB index greater than LSB index: The extracted value is interpreted as Little Endian If LSB index greater than MSB index: The extracted value is interpreted as Big Endian



Example:

DB size is 4 bytes, LSB is 1, and MSB is 3.

Driver writes value 0xCCDDEEFF to DB region at index 0 in Little Endian The value is written to memory as follows: [0]=FF [1]=EE [2]=DD [3]=CC

The relevant bytes, are the following: [1]=EE [2]=DD [3]=CC

Since MSB (3) is greater than LSB (1), the value is interpreted as Little Endian: db_id = 0xCCDDEE

Message signaled interrupts extended (MSI-X) is commonly used by PCIe devices to send interrupts over the PCIe bus to the host driver. DOCA APIs allow users to expose the MSI-X capability as per the PCIe specification, and to later use it to send interrupts to the host driver.

To configure it, users must provide the following:

The number of MSI-X vectors which can be done using doca_devemu_pci_type_set_num_msix

Define an MSI-X table

Define an MSI-X PBA

As per the PCIe specification, to expose the MSI-X capability, the device must designate a memory region within its BAR as an MSI-X table region. In DOCA, this can be done using doca_devemu_pci_type_set_bar_msix_table_region_conf .

As per the PCIe specification, to expose the MSI-X capability, the device must designate a memory region within its BAR as an MSI-X pending bit array (PBA) region. In DOCA, this can be done using doca_devemu_pci_type_set_bar_msix_pba_region_conf .

It is possible to raise an MSI-X for each vector individually. This can be done only using the DPA API doca_dpa_dev_devemu_pci_msix_raise .

Some operations require accessing memory which is set up by the host driver. DOCA's device emulation APIs allow users to access such I/O memory using the DOCA mmap (see DOCA Core Memory Subsystem).

After starting the PCIe device, it is possible to acquire an mmap that references the host memory using doca_devemu_pci_mmap_create . After creating this mmap, it is possible to configure it by providing:

Access permissions

Host memory range

DOCA devices that can access the memory

The mmap can then be used to create buffers that reference memory on the host. The buffers' addresses would not be locally accessible (i.e., CPU cannot dereference the address), instead the addresses would be I/O addresses as defined by the host driver.

The buffers created from the mmap can then be used with other DOCA libraries and accept a doca_buf as an input. This includes:

FLR can be handled as described in DOCA DevEmu PCI FLR. Additionally, users must ensure that the following resources are destroyed before stopping the PCIe device:

Doorbells created using doca_devemu_pci_db_create_on_dpa

MSI-X vectors created using doca_devemu_pci_msix_create_on_dpa

Memory maps created using doca_devemu_pci_mmap_create

Based on explanation in "Driver Write", user can assume that DOCA DevEmu PCI Generic supports creating emulated PCI devices with the limitation that when a driver writes to a register, the value is immediately available for subsequent reads from the same register. However, this immediate availability does not ensure that any required internal actions triggered by the write have been completed. It is recommended to rely on specific different register values to confirm completion of the write action. For instance, when implementing a write-to-clear operation, e.g. writing 1 to register A to clear register B, it is advisable to poll register B until it indicates the desired state. This approach ensures that the write action has been successfully executed. If a device specification requires certain actions to be completed before exposing written values for subsequent reads, such a device cannot be emulated using the DOCA DevEmu PCI generic framework.