Performance Monitoring Counters
The performance modules in NVIDIA® BlueField® are present in several hardware blocks and each block has a certain set of supported events.
The mlx_pmc driver provides access to all of these performance modules through a sysfs interface. The driver creates a directory under /sys/class/hwmon under which each of the blocks explained above has a subdirectory. Please note that all directories under /sys/class/hwmon are named as "hwmon<N>" where N is the hwmon device number corresponding to the device. This is assigned by Linux and could change with the addition of more devices to the hwmon class. Each hwmon directory has a "name" node which can be used to identify the correct device. In this case, reading the "name" file should return "bfperf".
The hardware blocks that include performance modules are:
Tile (block containing 2 cores and a shared L2 cache) has 2 sets of counters, one set for HNF and HNF_NET events. These are present as "tile" and "tilenet" directories in the sysfs interface of the driver.
TRIO (PCIe root complex) has 3 sets of counters, one each for TRIO, SMGEN and PCIE TLR events. The sysfs directories for these are called "trio", "triogen" and "pcie" respectively.
MSS (memory sub-system containing the memory controller and L3 cache)
GIC and SMMU with one set of counters each for the SMGEN events. These are simply labelled "gic" and "smmu" respectively.
The number of Tile, TRIO and MSS blocks depends on the system. There is a maximum of 8 Tile, 3 TRIO and 2 MSS blocks in BlueField, and this is added as a suffix to the sysfs directory names. For example, this is a list of directories present in a BlueField-2 system:
ubuntu@bf:/$ ls /sys/class/hwmon/hwmon0/
device l3cachehalf0 pcie0 smmu0 tile1 tilenet0 tilenet3 triogen0
ecc l3cachehalf1 pcie1 subsystem tile2 tilenet1 trio0 triogen1
gic0 name power tile0 tile3 tilenet2 trio1 uevent
The PCIe TLR statistics for each TRIO are under the "pcie" block.
The performance data of the BlueField hardware is collected using two mechanisms:
Programming hardware counters to monitor specific events
Reading registers that hold performance/event statistics
All blocks except "ecc" and "pcie" use the mechanism 1.
Using Hardware Counters
For blocks that use hardware counters to collect data, each counter present in the block is represented by "event<N>" and "counter<N>" sysfs files.
For example:
ubuntu@bf:/$ ls /sys/class/hwmon/hwmon0/tile0/
counter0 counter1 counter2 counter3 event0 event1 event2 event3 event_list
An event<N> and counter<N> pair can be used to program and monitor events. The "event_list" sysfs file displays the list of events supported by that block along with the hexadecimal value corresponding to each event.
Use the echo command to write the event number to the event<N> file, and use the cat command to read the counter value from the corresponding counter (counter<N>).
The counters are enabled individually once the event number is written to the corresponding event file. However, the L3 cache performance counters cannot be enabled or disabled individually and can only be triggered or stopped all at the same time.
So in the example provided, all 4 event files may be programmed with the necessary event numbers and then the "enable" file may be used to start the counters. Writing 0 to the enable file stops the counters while 1 starts them.
Reading Registers
For "ecc" and "pcie" blocks, the counters cannot be started or stopped by the user, instead the statistics are automatically collected by HW and stored in registers. These register names are exposed within the directory and can be read by the user at any time.
SMGEN Performance Module
Hex Value |
Name |
Description |
0x0 |
AW_REQ |
Reserved for internal use |
0x1 |
AW_BEATS |
Reserved for internal use |
0x2 |
AW_TRANS |
Reserved for internal use |
0x3 |
AW_RESP |
Reserved for internal use |
0x4 |
AW_STL |
Reserved for internal use |
0x5 |
AW_LAT |
Reserved for internal use |
0x6 |
AW_REQ_TBU |
Reserved for internal use |
0x8 |
AR_REQ |
Reserved for internal use |
0x9 |
AR_BEATS |
Reserved for internal use |
0xa |
AR_TRANS |
Reserved for internal use |
0xb |
AR_STL |
Reserved for internal use |
0xc |
AR_LAT |
Reserved for internal use |
0xd |
AR_REQ_TBU |
Reserved for internal use |
0xe |
TBU_MISS |
The number of TBU miss |
0xf |
TX_DAT_AF |
Mesh Data channel write FIFO almost Full. This is from the TRIO toward the Arm memory. |
0x10 |
RX_DAT_AF |
Mesh Data channel read FIFO almost Full. This is from the Arm memory toward the TRIO. |
0x11 |
RETRYQ_CRED |
Reserved for internal use |
Tile HNF Performance Module
Hex Value |
Name |
Description |
0x45 |
HNF_REQUESTS |
Number of REQs that were processed in HNF |
0x46 |
HNF_REJECTS |
Reserved for internal use |
0x47 |
ALL_BUSY |
Reserved for internal use |
0x48 |
MAF_BUSY |
Reserved for internal use |
0x49 |
MAF_REQUESTS |
Reserved for internal use |
0x4a |
RNF_REQUESTS |
Number of REQs sent by the RN-F selected by HNF_PERF_CTL register RNF_SEL field |
0x4b |
REQUEST_TYPE |
Reserved for internal use |
0x4c |
MEMORY_READS |
Number of reads to MSS |
0x4d |
MEMORY_WRITES |
Number of writes to MSS |
0x4e |
VICTIM_WRITE |
Number of victim lines written to memory |
0x4f |
POC_FULL |
Reserved for internal use |
0x50 |
POC_FAIL |
Number of times that the POC Monitor sent RespErr Okay status to an Exclusive WriteNoSnp or CleanUnique REQ |
0x51 |
POC_SUCCESS |
Number of times that the POC Monitor sent RespErr ExOkay status to an Exclusive WriteNoSnp or CleanUnique REQ |
0x52 |
POC_WRITES |
Number of Exclusive WriteNoSnp or CleanUnique REQs processed by POC Monitor |
0x53 |
POC_READS |
Number of Exclusive ReadClean/ReadShared REQs processed by POC Monitor |
0x54 |
FORWARD |
Reserved for internal use |
0x55 |
RXREQ_HNF |
Reserved for internal use |
0x56 |
RXRSP_HNF |
Reserved for internal use |
0x57 |
RXDAT_HNF |
Reserved for internal use |
0x58 |
TXREQ_HNF |
Reserved for internal use |
0x59 |
TXRSP_HNF |
Reserved for internal use |
0x5a |
TXDAT_HNF |
Reserved for internal use |
0x5b |
TXSNP_HNF |
Reserved for internal use |
0x5c |
INDEX_MATCH |
Reserved for internal use |
0x5d |
A72_ACCESS |
Access requests (Reads, Writes, CopyBack, CMO, DVM) from A72 clusters |
0x5e |
IO_ACCESS |
Accesses requests (Reads, Writes) from DMA IO devices |
0x5f |
TSO_WRITE |
Total Store Order write Requests from DMA IO devices |
0x60 |
TSO_CONFLICT |
Reserved for internal use |
0x61 |
DIR_HIT |
Requests that hit in directory |
0x62 |
HNF_ACCEPTS |
Reserved for internal use |
0x63 |
REQ_BUF_EMPTY |
Number of cycles when request buffer is empty |
0x64 |
REQ_BUF_IDLE_MAF |
Reserved for internal use |
0x65 |
TSO_NOARB |
Reserved for internal use |
0x66 |
TSO_NOARB_CYCLES |
Reserved for internal use |
0x67 |
MSS_NO_CREDIT |
Number of cycles that a Request could not be sent to MSS due to lack of credits |
0x68 |
TXDAT_NO_LCRD |
Reserved for internal use |
0x69 |
TXSNP_NO_LCRD |
Reserved for internal use |
0x6a |
TXRSP_NO_LCRD |
Reserved for internal use |
0x6b |
TXREQ_NO_LCRD |
Reserved for internal use |
0x6c |
TSO_CL_MATCH |
Reserved for internal use |
0x6d |
MEMORY_READS_BYPASS |
Number of reads to MSS that bypass Home Node |
0x6e |
TSO_NOARB_TIMEOUT |
Reserved for internal use |
0x6f |
ALLOCATE |
Number of times that Directory entry was allocated |
0x70 |
VICTIM |
Number of times that Directory entry allocation did not find an Invalid way in the set |
0x71 |
A72_WRITE |
Write requests from A72 clusters |
0x72 |
A72_Read |
Read requests from A72 clusters |
0x73 |
IO_WRITE |
Write requests from DMA IO devices |
0x74 |
IO_Reads |
Read requests from DMA IO devices |
0x75 |
TSO_Reject |
Reserved for internal use |
0x80 |
TXREQ_RN |
Reserved for internal use |
0x81 |
TXRSP_RN |
Reserved for internal use |
0x82 |
TXDAT_RN |
Reserved for internal use |
0x83 |
RXSNP_RN |
Reserved for internal use |
0x84 |
RXRSP_RN |
Reserved for internal use |
0x85 |
RXDAT_RN |
Reserved for internal use |
TRIO Performance Module
Hex Value |
Name |
Description |
0xa0 |
TPIO_DATA_BEAT |
Data beats from Arm PIO to TRIO |
0xa1 |
TDMA_DATA_BEAT |
Data beats from Arm memory to PCI completion |
0xa2 |
MAP_DATA_BEAT |
Reserved for internal use |
0xa3 |
TXMSG_DATA_BEAT |
Reserved for internal use |
0xa4 |
TPIO_DATA_PACKET |
Data packets from Arm PIO to TRIO |
0xa5 |
TDMA_DATA_PACKET |
Data packets from Arm memory to PCI completion |
0xa6 |
MAP_DATA_PACKET |
Reserved for internal use |
0xa7 |
TXMSG_DATA_PACKET |
Reserved for internal use |
0xa8 |
TDMA_RT_AF |
The in-flight PCI DMA READ request queue is almost full |
0xa9 |
TDMA_PBUF_MAC_AF |
Indicator of the buffer of Arm memory reads is too full awaiting PCIe access |
0xaa |
TRIO_MAP_WRQ_BUF_EMPTY |
PCIe write transaction buffer is empty |
0xab |
TRIO_MAP_CPL_BUF_EMPTY |
Arm PIO request completion queue is empty |
0xac |
TRIO_MAP_RDQ0_BUF_EMPTY |
The buffer of MAC0's read transaction is empty |
0xad |
TRIO_MAP_RDQ1_BUF_EMPTY |
The buffer of MAC1's read transaction is empty |
0xae |
TRIO_MAP_RDQ2_BUF_EMPTY |
The buffer of MAC2's read transaction is empty |
0xaf |
TRIO_MAP_RDQ3_BUF_EMPTY |
The buffer of MAC3's read transaction is empty |
0xb0 |
TRIO_MAP_RDQ4_BUF_EMPTY |
The buffer of MAC4's read transaction is empty |
0xb1 |
TRIO_MAP_RDQ5_BUF_EMPTY |
The buffer of MAC5's read transaction is empty |
0xb2 |
TRIO_MAP_RDQ6_BUF_EMPTY |
The buffer of MAC6's read transaction is empty |
0xb3 |
TRIO_MAP_RDQ7_BUF_EMPTY |
The buffer of MAC7's read transaction is empty |
L3 Cache Performance Module
The L3 cache interfaces with the Arm cores via the SkyMesh. The CDN is used for control data. The NDN is used for responses. The DDN is for the actual data transfer.
Hex Value |
Name |
Description |
0x00 |
DISABLE |
Reserved for internal use |
0x01 |
CYCLES |
Timestamp counter |
0x02 |
TOTAL_RD_REQ_IN |
Read Transaction control request from the CDN of the SkyMesh |
0x03 |
TOTAL_WR_REQ_IN |
Write transaction control request from the CDN of the SkyMesh |
0x04 |
TOTAL_WR_DBID_ACK |
Write transaction control responses from the NDN of the SkyMesh |
0x05 |
TOTAL_WR_DATA_IN |
Write transaction data from the DDN of the SkyMesh |
0x06 |
TOTAL_WR_COMP |
Write completion response from the NDN of the SkyMesh |
0x07 |
TOTAL_RD_DATA_OUT |
Read transaction data from the DDN |
0x08 |
TOTAL_CDN_REQ_IN_BANK0 |
CHI CDN Transactions Bank 0 |
0x09 |
TOTAL_CDN_REQ_IN_BANK1 |
CHI CDN Transactions Bank 1 |
0x0a |
TOTAL_DDN_REQ_IN_BANK0 |
CHI DDN Transactions Bank 0 |
0x0b |
TOTAL_DDN_REQ_IN_BANK1 |
CHI DDN Transactions Bank 1 |
0x0c |
TOTAL_EMEM_RD_RES_IN_BANK0 |
Total EMEM Read Response Bank 0 |
0x0d |
TOTAL_EMEM_RD_RES_IN_BANK1 |
Total EMEM Read Response Bank 1 |
0x0e |
TOTAL_CACHE_RD_RES_IN_BANK0 |
Total Cache Read Response Bank 0 |
0x0f |
TOTAL_CACHE_RD_RES_IN_BANK1 |
Total Cache Read Response Bank 1 |
0x10 |
TOTAL_EMEM_RD_REQ_BANK0 |
Total EMEM Read Request Bank 0 |
0x11 |
TOTAL_EMEM_RD_REQ_BANK1 |
Total EMEM Read Request Bank 1 |
0x12 |
TOTAL_EMEM_WR_REQ_BANK0 |
Total EMEM Write Request Bank 0 |
0x13 |
TOTAL_EMEM_WR_REQ_BANK1 |
Total EMEM Write Request Bank 1 |
0x14 |
TOTAL_RD_REQ_OUT |
EMEM Read Transactions Out |
0x15 |
TOTAL_WR_REQ_OUT |
EMEM Write Transactions Out |
0x16 |
TOTAL_RD_RES_IN |
EMEM Read Transactions In |
0x17 |
HITS_BANK0 |
Number of Hits Bank 0 |
0x18 |
HITS_BANK1 |
Number of Hits Bank 1 |
0x19 |
MISSES_BANK0 |
Number of Misses Bank 0 |
0x1a |
MISSES_BANK1 |
Number of Misses Bank 1 |
0x1b |
ALLOCATIONS_BANK0 |
Number of Allocations Bank 0 |
0x1c |
ALLOCATIONS_BANK1 |
Number of Allocations Bank 1 |
0x1d |
EVICTIONS_BANK0 |
Number of Evictions Bank 0 |
0x1e |
EVICTIONS_BANK1 |
Number of Evictions Bank 1 |
0x1f |
DBID_REJECT |
Reserved for internal use |
0x20 |
WRDB_REJECT_BANK0 |
Reserved for internal use |
0x21 |
WRDB_REJECT_BANK1 |
Reserved for internal use |
0x22 |
CMDQ_REJECT_BANK0 |
Reserved for internal use |
0x23 |
CMDQ_REJECT_BANK1 |
Reserved for internal use |
0x24 |
COB_REJECT_BANK0 |
Reserved for internal use |
0x25 |
COB_REJECT_BANK1 |
Reserved for internal use |
0x26 |
TRB_REJECT_BANK0 |
Reserved for internal use |
0x27 |
TRB_REJECT_BANK1 |
Reserved for internal use |
0x28 |
TAG_REJECT_BANK0 |
Reserved for internal use |
0x29 |
TAG_REJECT_BANK1 |
Reserved for internal use |
0x2a |
ANY_REJECT_BANK0 |
Reserved for internal use |
0x2b |
ANY_REJECT_BANK1 |
Reserved for internal use |
PCIe TLR Statistics
Hex Value |
Name |
Description |
0x0 |
PCIE_TLR_IN_P_PKT_CNT |
Incoming posted packets |
0x10 |
PCIE_TLR_IN_NP_PKT_CNT |
Incoming non-posted packets |
0x18 |
PCIE_TLR_IN_C_PKT_CNT |
Incoming completion packets |
0x20 |
PCIE_TLR_OUT_P_PKT_CNT |
Outgoing posted packets |
0x28 |
PCIE_TLR_OUT_NP_PKT_CNT |
Outgoing non-posted packets |
0x30 |
PCIE_TLR_OUT_C_PKT_CNT |
Outgoing completion packets |
0x38 |
PCIE_TLR_IN_P_BYTE_CNT |
Incoming posted bytes |
0x40 |
PCIE_TLR_IN_NP_BYTE_CNT |
Incoming non-posted bytes |
0x48 |
PCIE_TLR_IN_C_BYTE_CNT |
Incoming completion bytes |
0x50 |
PCIE_TLR_OUT_C_BYTE_CNT |
Outgoing posted bytes |
0x58 |
PCIE_TLR_OUT_NP_BYTE_CNT |
Outgoing non-posted bytes |
0x60 |
PCIE_TLR_OUT_C_BYTE_CNT |
Outgoing completion bytes |
Tile HNFNET Performance Module
Hex Value |
Name |
Description |
0x12 |
CDN_REQ |
The number of CDN requests |
0x13 |
DDN_REQ |
The number of DDN requests |
0x14 |
NDN_REQ |
The number of NDN requests |
0x15 |
CDN_DIAG_N_OUT_OF_CRED |
Number of cycles that north input port FIFO runs out of credits in the CDN network |
0x16 |
CDN_DIAG_S_OUT_OF_CRED |
Number of cycles that south input port FIFO runs out of credits in the CDN network |
0x17 |
CDN_DIAG_E_OUT_OF_CRED |
Number of cycles that east input port FIFO runs out of credits in the CDN network |
0x18 |
CDN_DIAG_W_OUT_OF_CRED |
Number of cycles that west input port FIFO runs out of credits in the CDN network |
0x19 |
CDN_DIAG_C_OUT_OF_CRED |
Number of cycles that core input port FIFO runs out of credits in the CDN network |
0x1a |
CDN_DIAG_N_EGRESS |
Packets sent out from north port in the CDN network |
0x1b |
CDN_DIAG_S_EGRESS |
Packets sent out from south port in the CDN network |
0x1c |
CDN_DIAG_E_EGRESS |
Packets sent out from east port in the CDN network |
0x1d |
CDN_DIAG_W_EGRESS |
Packets sent out from west port in the CDN network |
0x1e |
CDN_DIAG_C_EGRESS |
Packets sent out from core port in the CDN network |
0x1f |
CDN_DIAG_N_INGRESS |
Packets received by north port in the CDN network |
0x20 |
CDN_DIAG_S_INGRESS |
Packets received by south port in the CDN network |
0x21 |
CDN_DIAG_E_INGRESS |
Packets received by east port in the CDN network |
0x22 |
CDN_DIAG_W_INGRESS |
Packets received by west port in the CDN network |
0x23 |
CDN_DIAG_C_INGRESS |
Packets received by core port in the CDN network |
0x24 |
CDN_DIAG_CORE_SENT |
Packets completed from core port in the CDN network |
0x25 |
DDN_DIAG_N_OUT_OF_CRED |
Number of cycles that north input port FIFO runs out of credits in the DDN network |
0x26 |
DDN_DIAG_S_OUT_OF_CRED |
Number of cycles that south input port FIFO runs out of credits in the DDN network |
0x27 |
DDN_DIAG_E_OUT_OF_CRED |
Number of cycles that east input port FIFO runs out of credits in the DDN network |
0x28 |
DDN_DIAG_W_OUT_OF_CRED |
Number of cycles that west input port FIFO runs out of credits in the DDN network |
0x29 |
DDN_DIAG_C_OUT_OF_CRED |
Number of cycles that core input port FIFO runs out of credits in the DDN network |
0x2a |
DDN_DIAG_N_EGRESS |
Packets sent out from north port in the DDN network |
0x2b |
DDN_DIAG_S_EGRESS |
Packets sent out from south port in the DDN network |
0x2c |
DDN_DIAG_E_EGRESS |
Packets sent out from east port in the DDN network |
0x2d |
DDN_DIAG_W_EGRESS |
Packets sent out from west port in the DDN network |
0x2e |
DDN_DIAG_C_EGRESS |
Packets sent out from core port in the DDN network |
0x2f |
DDN_DIAG_N_INGRESS |
Packets received by north port in the DDN network |
0x30 |
DDN_DIAG_S_INGRESS |
Packets received by south port in the DDN network |
0x31 |
DDN_DIAG_E_INGRESS |
Packets received by east port in the DDN network |
0x32 |
DDN_DIAG_W_INGRESS |
Packets received by west port in the DDN network |
0x33 |
DDN_DIAG_C_INGRESS |
Packets received by core port in the DDN network |
0x34 |
DDN_DIAG_CORE_SENT |
Packets completed from core port in the DDN network |
0x35 |
NDN_DIAG_N_OUT_OF_CRED |
Number of cycles that north input port FIFO runs out of credits in the NDN network |
0x36 |
NDN_DIAG_S_OUT_OF_CRED |
Number of cycles that south input port FIFO runs out of credits in the NDN network |
0x37 |
NDN_DIAG_E_OUT_OF_CRED |
Number of cycles that east input port FIFO runs out of credits in the NDN network |
0x38 |
NDN_DIAG_W_OUT_OF_CRED |
Number of cycles that west input port FIFO runs out of credits in the NDN network |
0x39 |
NDN_DIAG_C_OUT_OF_CRED |
Number of cycles that core input port FIFO runs out of credits in the NDN network |
0x3a |
NDN_DIAG_N_EGRESS |
Packets sent out from north port in the NDN network |
0x3b |
NDN_DIAG_S_EGRESS |
Packets sent out from south port in the NDN network |
0x3c |
NDN_DIAG_E_EGRESS |
Packets sent out from east port in the NDN network |
0x3d |
NDN_DIAG_W_EGRESS |
Packets sent out from west port in the NDN network |
0x3e |
NDN_DIAG_C_EGRESS |
Packets sent out from core port in the NDN network |
0x3f |
NDN_DIAG_N_INGRESS |
Packets received by north port in the NDN network |
0x40 |
NDN_DIAG_S_INGRESS |
Packets received by south port in the NDN network |
0x41 |
NDN_DIAG_E_INGRESS |
Packets received by east port in the NDN network |
0x42 |
NDN_DIAG_W_INGRESS |
Packets received by west port in the NDN network |
0x43 |
NDN_DIAG_C_INGRESS |
Packets received by core port in the NDN network |
0x44 |
NDN_DIAG_CORE_SENT |
Packets completed from core port in the NDN network |
To program a counter to monitor one of the events from the event list, the event name or number needs to be written to the corresponding event file.
Let us call the /sys/class/hwmon/hwmon<N> folder corresponding to this driver as BFPERF_DIR.
For example, to monitor the event HNF_REQUESTS (0x45) on tile2 using counter 3:
$ echo 0x45 > <BFPERF_DIR>/tile2/event3
Or:
$ echo HNF_REQUESTS > <BFPERF_DIR>/tile2/event3
Once this is done, counter3 resets the counter and starts monitoring the number of HNF_REQUESTS.
To read the counter value, run:
$ cat <BFPERF_DIR>/tile2/counter3
To see what event is currently being monitored by a counter, just read the corresponding event file to get the event name and number.
$ cat <BFPERF_DIR>/tile2/event3
In this case, reading the event3 file returns "0x45: HNF_REQUESTS".
To clear the counter, write 0 to the counter file.
$ echo 0 > <BFPERF_DIR>/tile2/counter3
This resets the accumulator and the counter continues monitoring the same event that has previously been programmed, but starts the count from 0 again. Writing non-zero values to the counter files is not allowed.
To stop monitoring an event, write 0xff to the corresponding event file.
This is slightly different for the l3cache blocks due to the restriction that all counters can only be enabled, disabled, or reset together. So once the event is written to the event file, the counters will have to be enabled to start monitoring their respective events by writing "1" to the "enable" file. Writing "0" to this file will stop all the counters. The most reliable way to get accurate counter values would be by disabling the counters after a certain time period and then proceeding to read the counter values.
Programming a counter to monitor a new event automatically stops all the counters. Also, enabling the counters resets the counters to 0 first.
For blocks that have performance statistics registers (mechanism 2), all of these statistics are directly made available to be read or reset.
For example, to read the number of incoming posted packets to TRIO2:
$ cat <BFPERF_DIR>/pcie2/IN_P_PKT_CNT
The count can be reset to 0 by writing 0 to the same file. Again, non-zero writes to these files are not allowed.