NVIDIA BlueField DPU BSP v4.5.0
1.0

Performance Monitoring Counters

The performance modules in NVIDIA® BlueField® are present in several hardware blocks and each block has a certain set of supported events.

The mlx_pmc driver provides access to all of these performance modules through a sysfs interface. The driver creates a directory under /sys/class/hwmon under which each of the blocks explained above has a subdirectory. Please note that all directories under /sys/class/hwmon are named as "hwmon<N>" where N is the hwmon device number corresponding to the device. This is assigned by Linux and could change with the addition of more devices to the hwmon class. Each hwmon directory has a "name" node which can be used to identify the correct device. In this case, reading the "name" file should return "bfperf".

The hardware blocks that include performance modules are:

  • Tile (block containing 2 cores and a shared L2 cache) has 2 sets of counters, one set for HNF and HNF_NET events. These are present as "tile" and "tilenet" directories in the sysfs interface of the driver.

  • TRIO (PCIe root complex) has 3 sets of counters, one each for TRIO, SMGEN and PCIE TLR events. The sysfs directories for these are called "trio", "triogen" and "pcie" respectively.

  • MSS (memory sub-system containing the memory controller and L3 cache)

  • GIC and SMMU with one set of counters each for the SMGEN events. These are simply labelled "gic" and "smmu" respectively.

The number of Tile, TRIO and MSS blocks depends on the system. There is a maximum of 8 Tile, 3 TRIO and 2 MSS blocks in BlueField, and this is added as a suffix to the sysfs directory names. For example, this is a list of directories present in a BlueField-2 system:

Copy
Copied!
            

ubuntu@dpu:/$ ls /sys/class/hwmon/hwmon0/ device l3cachehalf0 pcie0 smmu0 tile1 tilenet0 tilenet3 triogen0 ecc l3cachehalf1 pcie1 subsystem tile2 tilenet1 trio0 triogen1 gic0 name power tile0 tile3 tilenet2 trio1 uevent

The PCIe TLR statistics for each TRIO are under the "pcie" block.

The performance data of the BlueField hardware is collected using two mechanisms:

  1. Programming hardware counters to monitor specific events

  2. Reading registers that hold performance/event statistics

All blocks except "ecc" and "pcie" use the mechanism 1.

Using Hardware Counters

For blocks that use hardware counters to collect data, each counter present in the block is represented by "event<N>" and "counter<N>" sysfs files.

For example:

Copy
Copied!
            

ubuntu@dpu:/$ ls /sys/class/hwmon/hwmon0/tile0/ counter0 counter1 counter2 counter3 event0 event1 event2 event3 event_list

An event<N> and counter<N> pair can be used to program and monitor events. The "event_list" sysfs file displays the list of events supported by that block along with the hexadecimal value corresponding to each event.

Use the echo command to write the event number to the event<N> file, and use the cat command to read the counter value from the corresponding counter (counter<N>).

The counters are enabled individually once the event number is written to the corresponding event file. However, the L3 cache performance counters cannot be enabled or disabled individually and can only be triggered or stopped all at the same time.

So in the example provided, all 4 event files may be programmed with the necessary event numbers and then the "enable" file may be used to start the counters. Writing 0 to the enable file stops the counters while 1 starts them.

Reading Registers

For "ecc" and "pcie" blocks, the counters cannot be started or stopped by the user, instead the statistics are automatically collected by HW and stored in registers. These register names are exposed within the directory and can be read by the user at any time.

SMGEN Performance Module

Hex Value

Name

Description

0x0

AW_REQ

Reserved for internal use

0x1

AW_BEATS

Reserved for internal use

0x2

AW_TRANS

Reserved for internal use

0x3

AW_RESP

Reserved for internal use

0x4

AW_STL

Reserved for internal use

0x5

AW_LAT

Reserved for internal use

0x6

AW_REQ_TBU

Reserved for internal use

0x8

AR_REQ

Reserved for internal use

0x9

AR_BEATS

Reserved for internal use

0xa

AR_TRANS

Reserved for internal use

0xb

AR_STL

Reserved for internal use

0xc

AR_LAT

Reserved for internal use

0xd

AR_REQ_TBU

Reserved for internal use

0xe

TBU_MISS

The number of TBU miss

0xf

TX_DAT_AF

Mesh Data channel write FIFO almost Full.

This is from the TRIO toward the Arm memory.

0x10

RX_DAT_AF

Mesh Data channel read FIFO almost Full.

This is from the Arm memory toward the TRIO.

0x11

RETRYQ_CRED

Reserved for internal use


Tile HNF Performance Module

Hex Value

Name

Description

0x45

HNF_REQUESTS

Number of REQs that were processed in HNF

0x46

HNF_REJECTS

Reserved for internal use

0x47

ALL_BUSY

Reserved for internal use

0x48

MAF_BUSY

Reserved for internal use

0x49

MAF_REQUESTS

Reserved for internal use

0x4a

RNF_REQUESTS

Number of REQs sent by the RN-F selected by HNF_PERF_CTL register RNF_SEL field

0x4b

REQUEST_TYPE

Reserved for internal use

0x4c

MEMORY_READS

Number of reads to MSS

0x4d

MEMORY_WRITES

Number of writes to MSS

0x4e

VICTIM_WRITE

Number of victim lines written to memory

0x4f

POC_FULL

Reserved for internal use

0x50

POC_FAIL

Number of times that the POC Monitor sent RespErr Okay status to an Exclusive WriteNoSnp or CleanUnique REQ

0x51

POC_SUCCESS

Number of times that the POC Monitor sent RespErr ExOkay status to an Exclusive WriteNoSnp or CleanUnique REQ

0x52

POC_WRITES

Number of Exclusive WriteNoSnp or CleanUnique REQs processed by POC Monitor

0x53

POC_READS

Number of Exclusive ReadClean/ReadShared REQs processed by POC Monitor

0x54

FORWARD

Reserved for internal use

0x55

RXREQ_HNF

Reserved for internal use

0x56

RXRSP_HNF

Reserved for internal use

0x57

RXDAT_HNF

Reserved for internal use

0x58

TXREQ_HNF

Reserved for internal use

0x59

TXRSP_HNF

Reserved for internal use

0x5a

TXDAT_HNF

Reserved for internal use

0x5b

TXSNP_HNF

Reserved for internal use

0x5c

INDEX_MATCH

Reserved for internal use

0x5d

A72_ACCESS

Access requests (Reads, Writes, CopyBack, CMO, DVM) from A72 clusters

0x5e

IO_ACCESS

Accesses requests (Reads, Writes) from DMA IO devices

0x5f

TSO_WRITE

Total Store Order write Requests from DMA IO devices

0x60

TSO_CONFLICT

Reserved for internal use

0x61

DIR_HIT

Requests that hit in directory

0x62

HNF_ACCEPTS

Reserved for internal use

0x63

REQ_BUF_EMPTY

Number of cycles when request buffer is empty

0x64

REQ_BUF_IDLE_MAF

Reserved for internal use

0x65

TSO_NOARB

Reserved for internal use

0x66

TSO_NOARB_CYCLES

Reserved for internal use

0x67

MSS_NO_CREDIT

Number of cycles that a Request could not be sent to MSS due to lack of credits

0x68

TXDAT_NO_LCRD

Reserved for internal use

0x69

TXSNP_NO_LCRD

Reserved for internal use

0x6a

TXRSP_NO_LCRD

Reserved for internal use

0x6b

TXREQ_NO_LCRD

Reserved for internal use

0x6c

TSO_CL_MATCH

Reserved for internal use

0x6d

MEMORY_READS_BYPASS

Number of reads to MSS that bypass Home Node

0x6e

TSO_NOARB_TIMEOUT

Reserved for internal use

0x6f

ALLOCATE

Number of times that Directory entry was allocated

0x70

VICTIM

Number of times that Directory entry allocation did not find an Invalid way in the set

0x71

A72_WRITE

Write requests from A72 clusters

0x72

A72_Read

Read requests from A72 clusters

0x73

IO_WRITE

Write requests from DMA IO devices

0x74

IO_Reads

Read requests from DMA IO devices

0x75

TSO_Reject

Reserved for internal use

0x80

TXREQ_RN

Reserved for internal use

0x81

TXRSP_RN

Reserved for internal use

0x82

TXDAT_RN

Reserved for internal use

0x83

RXSNP_RN

Reserved for internal use

0x84

RXRSP_RN

Reserved for internal use

0x85

RXDAT_RN

Reserved for internal use


TRIO Performance Module

Hex Value

Name

Description

0xa0

TPIO_DATA_BEAT

Data beats from Arm PIO to TRIO

0xa1

TDMA_DATA_BEAT

Data beats from Arm memory to PCI completion

0xa2

MAP_DATA_BEAT

Reserved for internal use

0xa3

TXMSG_DATA_BEAT

Reserved for internal use

0xa4

TPIO_DATA_PACKET

Data packets from Arm PIO to TRIO

0xa5

TDMA_DATA_PACKET

Data packets from Arm memory to PCI completion

0xa6

MAP_DATA_PACKET

Reserved for internal use

0xa7

TXMSG_DATA_PACKET

Reserved for internal use

0xa8

TDMA_RT_AF

The in-flight PCI DMA READ request queue is almost full

0xa9

TDMA_PBUF_MAC_AF

Indicator of the buffer of Arm memory reads is too full awaiting PCIe access

0xaa

TRIO_MAP_WRQ_BUF_EMPTY

PCIe write transaction buffer is empty

0xab

TRIO_MAP_CPL_BUF_EMPTY

Arm PIO request completion queue is empty

0xac

TRIO_MAP_RDQ0_BUF_EMPTY

The buffer of MAC0's read transaction is empty

0xad

TRIO_MAP_RDQ1_BUF_EMPTY

The buffer of MAC1's read transaction is empty

0xae

TRIO_MAP_RDQ2_BUF_EMPTY

The buffer of MAC2's read transaction is empty

0xaf

TRIO_MAP_RDQ3_BUF_EMPTY

The buffer of MAC3's read transaction is empty

0xb0

TRIO_MAP_RDQ4_BUF_EMPTY

The buffer of MAC4's read transaction is empty

0xb1

TRIO_MAP_RDQ5_BUF_EMPTY

The buffer of MAC5's read transaction is empty

0xb2

TRIO_MAP_RDQ6_BUF_EMPTY

The buffer of MAC6's read transaction is empty

0xb3

TRIO_MAP_RDQ7_BUF_EMPTY

The buffer of MAC7's read transaction is empty


L3 Cache Performance Module

Warning

The L3 cache interfaces with the Arm cores via the SkyMesh. The CDN is used for control data. The NDN is used for responses. The DDN is for the actual data transfer.

Hex Value

Name

Description

0x00

DISABLE

Reserved for internal use

0x01

CYCLES

Timestamp counter

0x02

TOTAL_RD_REQ_IN

Read Transaction control request from the CDN of the SkyMesh

0x03

TOTAL_WR_REQ_IN

Write transaction control request from the CDN of the SkyMesh

0x04

TOTAL_WR_DBID_ACK

Write transaction control responses from the NDN of the SkyMesh

0x05

TOTAL_WR_DATA_IN

Write transaction data from the DDN of the SkyMesh

0x06

TOTAL_WR_COMP

Write completion response from the NDN of the SkyMesh

0x07

TOTAL_RD_DATA_OUT

Read transaction data from the DDN

0x08

TOTAL_CDN_REQ_IN_BANK0

CHI CDN Transactions Bank 0

0x09

TOTAL_CDN_REQ_IN_BANK1

CHI CDN Transactions Bank 1

0x0a

TOTAL_DDN_REQ_IN_BANK0

CHI DDN Transactions Bank 0

0x0b

TOTAL_DDN_REQ_IN_BANK1

CHI DDN Transactions Bank 1

0x0c

TOTAL_EMEM_RD_RES_IN_BANK0

Total EMEM Read Response Bank 0

0x0d

TOTAL_EMEM_RD_RES_IN_BANK1

Total EMEM Read Response Bank 1

0x0e

TOTAL_CACHE_RD_RES_IN_BANK0

Total Cache Read Response Bank 0

0x0f

TOTAL_CACHE_RD_RES_IN_BANK1

Total Cache Read Response Bank 1

0x10

TOTAL_EMEM_RD_REQ_BANK0

Total EMEM Read Request Bank 0

0x11

TOTAL_EMEM_RD_REQ_BANK1

Total EMEM Read Request Bank 1

0x12

TOTAL_EMEM_WR_REQ_BANK0

Total EMEM Write Request Bank 0

0x13

TOTAL_EMEM_WR_REQ_BANK1

Total EMEM Write Request Bank 1

0x14

TOTAL_RD_REQ_OUT

EMEM Read Transactions Out

0x15

TOTAL_WR_REQ_OUT

EMEM Write Transactions Out

0x16

TOTAL_RD_RES_IN

EMEM Read Transactions In

0x17

HITS_BANK0

Number of Hits Bank 0

0x18

HITS_BANK1

Number of Hits Bank 1

0x19

MISSES_BANK0

Number of Misses Bank 0

0x1a

MISSES_BANK1

Number of Misses Bank 1

0x1b

ALLOCATIONS_BANK0

Number of Allocations Bank 0

0x1c

ALLOCATIONS_BANK1

Number of Allocations Bank 1

0x1d

EVICTIONS_BANK0

Number of Evictions Bank 0

0x1e

EVICTIONS_BANK1

Number of Evictions Bank 1

0x1f

DBID_REJECT

Reserved for internal use

0x20

WRDB_REJECT_BANK0

Reserved for internal use

0x21

WRDB_REJECT_BANK1

Reserved for internal use

0x22

CMDQ_REJECT_BANK0

Reserved for internal use

0x23

CMDQ_REJECT_BANK1

Reserved for internal use

0x24

COB_REJECT_BANK0

Reserved for internal use

0x25

COB_REJECT_BANK1

Reserved for internal use

0x26

TRB_REJECT_BANK0

Reserved for internal use

0x27

TRB_REJECT_BANK1

Reserved for internal use

0x28

TAG_REJECT_BANK0

Reserved for internal use

0x29

TAG_REJECT_BANK1

Reserved for internal use

0x2a

ANY_REJECT_BANK0

Reserved for internal use

0x2b

ANY_REJECT_BANK1

Reserved for internal use


PCIe TLR Statistics

Hex Value

Name

Description

0x0

PCIE_TLR_IN_P_PKT_CNT

Incoming posted packets

0x10

PCIE_TLR_IN_NP_PKT_CNT

Incoming non-posted packets

0x18

PCIE_TLR_IN_C_PKT_CNT

Incoming completion packets

0x20

PCIE_TLR_OUT_P_PKT_CNT

Outgoing posted packets

0x28

PCIE_TLR_OUT_NP_PKT_CNT

Outgoing non-posted packets

0x30

PCIE_TLR_OUT_C_PKT_CNT

Outgoing completion packets

0x38

PCIE_TLR_IN_P_BYTE_CNT

Incoming posted bytes

0x40

PCIE_TLR_IN_NP_BYTE_CNT

Incoming non-posted bytes

0x48

PCIE_TLR_IN_C_BYTE_CNT

Incoming completion bytes

0x50

PCIE_TLR_OUT_C_BYTE_CNT

Outgoing posted bytes

0x58

PCIE_TLR_OUT_NP_BYTE_CNT

Outgoing non-posted bytes

0x60

PCIE_TLR_OUT_C_BYTE_CNT

Outgoing completion bytes


Tile HNFNET Performance Module

Hex Value

Name

Description

0x12

CDN_REQ

The number of CDN requests

0x13

DDN_REQ

The number of DDN requests

0x14

NDN_REQ

The number of NDN requests

0x15

CDN_DIAG_N_OUT_OF_CRED

Number of cycles that north input port FIFO runs out of credits in the CDN network

0x16

CDN_DIAG_S_OUT_OF_CRED

Number of cycles that south input port FIFO runs out of credits in the CDN network

0x17

CDN_DIAG_E_OUT_OF_CRED

Number of cycles that east input port FIFO runs out of credits in the CDN network

0x18

CDN_DIAG_W_OUT_OF_CRED

Number of cycles that west input port FIFO runs out of credits in the CDN network

0x19

CDN_DIAG_C_OUT_OF_CRED

Number of cycles that core input port FIFO runs out of credits in the CDN network

0x1a

CDN_DIAG_N_EGRESS

Packets sent out from north port in the CDN network

0x1b

CDN_DIAG_S_EGRESS

Packets sent out from south port in the CDN network

0x1c

CDN_DIAG_E_EGRESS

Packets sent out from east port in the CDN network

0x1d

CDN_DIAG_W_EGRESS

Packets sent out from west port in the CDN network

0x1e

CDN_DIAG_C_EGRESS

Packets sent out from core port in the CDN network

0x1f

CDN_DIAG_N_INGRESS

Packets received by north port in the CDN network

0x20

CDN_DIAG_S_INGRESS

Packets received by south port in the CDN network

0x21

CDN_DIAG_E_INGRESS

Packets received by east port in the CDN network

0x22

CDN_DIAG_W_INGRESS

Packets received by west port in the CDN network

0x23

CDN_DIAG_C_INGRESS

Packets received by core port in the CDN network

0x24

CDN_DIAG_CORE_SENT

Packets completed from core port in the CDN network

0x25

DDN_DIAG_N_OUT_OF_CRED

Number of cycles that north input port FIFO runs out of credits in the DDN network

0x26

DDN_DIAG_S_OUT_OF_CRED

Number of cycles that south input port FIFO runs out of credits in the DDN network

0x27

DDN_DIAG_E_OUT_OF_CRED

Number of cycles that east input port FIFO runs out of credits in the DDN network

0x28

DDN_DIAG_W_OUT_OF_CRED

Number of cycles that west input port FIFO runs out of credits in the DDN network

0x29

DDN_DIAG_C_OUT_OF_CRED

Number of cycles that core input port FIFO runs out of credits in the DDN network

0x2a

DDN_DIAG_N_EGRESS

Packets sent out from north port in the DDN network

0x2b

DDN_DIAG_S_EGRESS

Packets sent out from south port in the DDN network

0x2c

DDN_DIAG_E_EGRESS

Packets sent out from east port in the DDN network

0x2d

DDN_DIAG_W_EGRESS

Packets sent out from west port in the DDN network

0x2e

DDN_DIAG_C_EGRESS

Packets sent out from core port in the DDN network

0x2f

DDN_DIAG_N_INGRESS

Packets received by north port in the DDN network

0x30

DDN_DIAG_S_INGRESS

Packets received by south port in the DDN network

0x31

DDN_DIAG_E_INGRESS

Packets received by east port in the DDN network

0x32

DDN_DIAG_W_INGRESS

Packets received by west port in the DDN network

0x33

DDN_DIAG_C_INGRESS

Packets received by core port in the DDN network

0x34

DDN_DIAG_CORE_SENT

Packets completed from core port in the DDN network

0x35

NDN_DIAG_N_OUT_OF_CRED

Number of cycles that north input port FIFO runs out of credits in the NDN network

0x36

NDN_DIAG_S_OUT_OF_CRED

Number of cycles that south input port FIFO runs out of credits in the NDN network

0x37

NDN_DIAG_E_OUT_OF_CRED

Number of cycles that east input port FIFO runs out of credits in the NDN network

0x38

NDN_DIAG_W_OUT_OF_CRED

Number of cycles that west input port FIFO runs out of credits in the NDN network

0x39

NDN_DIAG_C_OUT_OF_CRED

Number of cycles that core input port FIFO runs out of credits in the NDN network

0x3a

NDN_DIAG_N_EGRESS

Packets sent out from north port in the NDN network

0x3b

NDN_DIAG_S_EGRESS

Packets sent out from south port in the NDN network

0x3c

NDN_DIAG_E_EGRESS

Packets sent out from east port in the NDN network

0x3d

NDN_DIAG_W_EGRESS

Packets sent out from west port in the NDN network

0x3e

NDN_DIAG_C_EGRESS

Packets sent out from core port in the NDN network

0x3f

NDN_DIAG_N_INGRESS

Packets received by north port in the NDN network

0x40

NDN_DIAG_S_INGRESS

Packets received by south port in the NDN network

0x41

NDN_DIAG_E_INGRESS

Packets received by east port in the NDN network

0x42

NDN_DIAG_W_INGRESS

Packets received by west port in the NDN network

0x43

NDN_DIAG_C_INGRESS

Packets received by core port in the NDN network

0x44

NDN_DIAG_CORE_SENT

Packets completed from core port in the NDN network


To program a counter to monitor one of the events from the event list, the event name or number needs to be written to the corresponding event file.

Let us call the /sys/class/hwmon/hwmon<N> folder corresponding to this driver as BFPERF_DIR.

For example, to monitor the event HNF_REQUESTS (0x45) on tile2 using counter 3:

Copy
Copied!
            

$ echo 0x45 > <BFPERF_DIR>/tile2/event3

Or:

Copy
Copied!
            

$ echo HNF_REQUESTS > <BFPERF_DIR>/tile2/event3

Once this is done, counter3 resets the counter and starts monitoring the number of HNF_REQUESTS.

To read the counter value, run:

Copy
Copied!
            

$ cat <BFPERF_DIR>/tile2/counter3

To see what event is currently being monitored by a counter, just read the corresponding event file to get the event name and number.

Copy
Copied!
            

$ cat <BFPERF_DIR>/tile2/event3

In this case, reading the event3 file returns "0x45: HNF_REQUESTS".

To clear the counter, write 0 to the counter file.

Copy
Copied!
            

$ echo 0 > <BFPERF_DIR>/tile2/counter3

This resets the accumulator and the counter continues monitoring the same event that has previously been programmed, but starts the count from 0 again. Writing non-zero values to the counter files is not allowed.

To stop monitoring an event, write 0xff to the corresponding event file.

This is slightly different for the l3cache blocks due to the restriction that all counters can only be enabled, disabled, or reset together. So once the event is written to the event file, the counters will have to be enabled to start monitoring their respective events by writing "1" to the "enable" file. Writing "0" to this file will stop all the counters. The most reliable way to get accurate counter values would be by disabling the counters after a certain time period and then proceeding to read the counter values.

Warning

Programming a counter to monitor a new event automatically stops all the counters. Also, enabling the counters resets the counters to 0 first.

For blocks that have performance statistics registers (mechanism 2), all of these statistics are directly made available to be read or reset.

For example, to read the number of incoming posted packets to TRIO2:

Copy
Copied!
            

$ cat <BFPERF_DIR>/pcie2/IN_P_PKT_CNT

The count can be reset to 0 by writing 0 to the same file. Again, non-zero writes to these files are not allowed.

© Copyright 2023, NVIDIA. Last updated on Jan 29, 2024.