Measuring Workload Performance with Hardware Performance Counters#

Many software performance analysis tools rely on event counts from hardware performance monitoring units (PMUs) to characterize workload performance. This chapter provides information about how data from PMUs can be gathered and combined to form metrics for performance optimization. For simplicity, the Linux perf tool is used, but the same metrics can be used in any tool that gathers hardware performance events from PMUs, for example, NVIDIA Nsight Systems.

Introduction to Linux perf#

The Linux perf tool is a widely available, open-source tool that is used to collect application-level and system-level performance data. perf can monitor a rich set of software and hardware performance events from different sources and, in many cases, does not require administrator privileges (for example, root).

Installing perf depends on the distribution:

Ubuntu: apt install linux-tools-$(uname -r)
Red Hat: dnf install perf-$(uname -r)
SLES: zypper install perf

perf gathers data from PMUs by using the perf_event system that is provided by the Linux kernel. Data can be gathered for the lifetime of a process or for a specific period.

perf supports the following measurement types:

Performance summary (perf stat): perf collects the total PMU event counts for a workload and provides a high-level summary of the basic performance characteristics.

This is a good approach for a first-pass performance analysis.
Event-based sampling (perf record): perf periodically gathers PMU event counters and relates this data to source code locations.

The event counts are gathered when a specially configured event counter overflows, which causes an interrupt that contains the instruction pointer addresses and register information. perf uses this information to build stack traces and function-level annotations to characterize the performance of functions of interest on specific call paths. At a high level, this is a good approach for detailed performance characterization after an initial workload characterization. After gathering data with the perf record command, to analyze the data, use the perf report or perf annotate commands.

Configuring Perf#

By default, unprivileged users can only gather information about context-switched events, which includes most of the predefined CPU core events, such as cycle counts, instruction counts, and some software events. To give unprivileged access to all PMUs and global measurements, the following system settings need to be configured:

Note

To configure these settings, you must have root access.

perf_event_paranoid: This setting controls privilege checks in the kernel.

Setting this to -1 or 0 allows non-root users to perform per-process and system-wide performance monitoring (refer to Unprivileged users for more information).
kptr_restrict: This setting affects how kernel addresses are exposed.

Setting it to 0 assists in kernel symbol resolution.

For example:

$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict

To make these settings reboot persistent, follow your Linux distribution’s instructions for configuring system parameters. Typically, you need to edit /etc/sysctl.conf or create a file in the /etc/sysctl.d/ folder that contains the following lines:

kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0

Warning

There are security implications for configuring these settings as shown in the example above. You must read and understand the relevant Linux kernel documentation and consult your system administrator.

To access the PMUs, configure the following kernel driver modules:

Core PMU:
- CONFIG_ARM_PMU_ACPI
- CONFIG_ARM_PMUV3
Statistical Profiling Extension PMU:
- CONFIG_ARM_SPE_PMU
Coresight System PMU:
- CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU
- CONFIG_NVIDIA_CORESIGHT_PMU_ARCH_SYSTEM_PMU

Gathering Hardware Performance Data with Perf#

To generate a high-level report of event counts, run the perf stat command. For example, to count cache miss events, CPU cycles, and CPU instructions for ten seconds:

$ perf stat -a -e cache-misses,cycles,instructions sleep 10

You can also gather information for a specific process:

$ perf stat -e cycles,stalled-cycles-backend ./stream.exe

This counts total CPU cycles and cycles where the CPU is stalled on the frontend or backend while stream.exe is executing.

The -e flag accepts a comma-separated list of performance events, which can be predefined, or raw, events. To see the predefined events, type perf list. Raw events are specified as rXXXX where XXXX is a hexadecimal event number. Refer to the Arm Neoverse V2 Core Technical Reference Manual for more information about event numbers.

For a more detailed analysis, and gather in event-based sampling mode, run the perf record command:

$ perf record -e cycles,instructions,dTLB-loads,dTLB-load-misses ./xhpl.exe
$ perf report
$ perf annotate

Additional information about basic perf usage is available in the perf man pages.

Grace CPU Performance Metrics#

This section provides formulas for useful performance metrics, which are the functions of a hardware event count that more fully express the performance characteristics of the system. For example, a simple count of instructions is less meaningful than the ratio of instructions-per-cycle, which characterizes the processor’s usage. These metrics can be used with any tool that gathers hardware performance event data from the Grace PMUs.

The counters are provided by name, instead of event number because most performance analysis tools provide names for common events. If your tool does not have a named counter for one of the following events, use the translation tables in the Arm Neoverse V2 Core Technical Reference Manual to convert the following event names to raw event numbers. For example, FP_SCALE_OPS_SPEC has event number 0x80C0 and FP_FIXED_OPS_SPEC has event number 0x80C1, so data for the FLOPS computational intensity metric can be gathered using perf by measuring raw events 0x80C0 and 0x80C1:

perf record -e r80C0 -e r80C1 ./a.out

Cycle and Instruction Accounting#

Instructions Per Cycle: Instructions retired per cycle.
```
INST_RETIRED/CPU_CYCLES
```
Retiring: Percentage of total slots that are retired operations and indicates efficient CPU usage.

100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES*8)))

Warning

The stall_slot_backend and stall_slot_frontend events cannot be used to accurately measure backend and frontend boundness (refer to https://developer.arm.com/documentation/SDEN2332927/latest/). These counters are also used in TopdownL1 metrics in perf, which also cannot be accurate. We recommed that you use stall_backend and stall_frontend instead.

Bad Speculation: Percentage of total slots that executed operations and didn’t retire due to a pipeline flush.

100 * ((1-OP_RETIRED/OP_SPEC) * (1-STALL_SLOT/(CPU_CYCLES * 8)) + BR_MIS_PRED * 4 / CPU_CYCLES)
Backend Stalled Cycles: Percentage of cycles that were stalled due to resource constraints in the backend unit of the processor.

(STALL_BACKEND/CPU_CYCLES)*100
Frontend Stalled Cycles: Percentage of cycles that were stalled due to resource constraints in the frontend unit of the processor.

(STALL_FRONTEND/CPU_CYCLES)*100

Computational Intensity#

SVE FLOPS: Floating point operations per second in any precision performed by the SVE instructions.

Fused instructions count as two operations, for example, a fused multiply-add instruction increases the count by twice the number of active SVE vector lanes. These operations do not count as floating point operations that are performed by scalar or NEON instructions.

FP_SCALE_OPS_SPEC/TIME
Non-SVE FLOPS: Floating point operations per second in any precision performed by an instruction that is not an SVE instruction.

Fused instructions count as two operations, for example, a scalar fused multiply-add instruction increases the count by two, and a fused multiply-add NEON instruction increases the count by twice the number of vector lanes. These operations do not count as floating point operations performed by SVE instructions.

FP_FIXED_OPS_SPEC/TIME
FLOPS: Floating point operations per second in any precision performed by any instruction.

Fused instructions count as two operations.

(FP_SCALE_OPS_SPEC + FP_FIXED_OPS_SPEC)/TIME

Operation Mix#

Load Operations Percentage: Percentage of total operations speculatively executed as load operations.

(LD_SPEC/INST_SPEC)*100
Store Operations Percentage: Percentage of total operations speculatively executed as store operations.

(ST_SPEC/INST_SPEC)*100
Branch Operations Percentage: Percentage of total operations speculatively executed as branch operations.

((BR_IMMED_SPEC + BR_INDIRECT_SPEC)/INST_SPEC)*100
Integer Operations Percentage: Percentage of total operations speculatively executed as scalar integer operations.

(DP_SPEC/INST_SPEC)*100

Note

The DP in DP_SPEC stands for Data Processing.
Floating Point Operations Percentage: Percentage of total operations speculatively executed as scalar floating point operations. instructions.

(VFP_SPEC/INST_SPEC)*100
Synchronization Percentage: Percentage of total operations speculatively executed as synchronization operations.

((ISB_SPEC + DSB_SPEC + DMB_SPEC)/INST_SPEC)*100
Crypto Operations Percentage: Percentage of total operations speculatively executed as crypto operations.

(CRYPTO_SPEC/INST_SPEC)*100
SVE Operations (Load/Store Inclusive) Percentage: Percentage of total operations speculatively executed as scalable vector operations, including loads and stores.

(SVE_INST_SPEC/INST_SPEC)*100
Advanced SIMD Operations Percentage: Percentage of total operations speculatively executed as advanced SIMD (NEON) operations.

(ASE_INST_SPEC/INST_SPEC)*100
SIMD Percentage: Percentage of total operations speculatively executed as integer or floating point vector/SIMD operations.

((SVE_INST_SPEC + ASE_INST_SPEC)/INST_SPEC)*100
FP16 Percentage: Percentage of total operations speculatively executed as half-precision floating point operations.

Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

(FP_HP_SPEC/INST_SPEC)*100
FP32 Percentage: Percentage of total operations speculatively executed as single-precision floating point operations.

Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

(FP_SP_SPEC/INST_SPEC)*100
FP64 Percentage: Percentage of total operations speculatively executed as double-precision floating point operations.

Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

(FP_DP_SPEC/INST_SPEC)*100

SVE Predication#

Full SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations with all active predicates.

(SVE_PRED_FULL_SPEC/INST_SPEC)*100
Partial SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations in which at least one element is FALSE.

(SVE_PRED_PARTIAL_SPEC/INST_SPEC)*100
Empty SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations with no active predicate.

(SVE_PRED_EMPTY_SPEC/INST_SPEC)*100

Cache Effectiveness#

L1 Data Cache Miss Ratio: Ratio of total level 1 data cache read or write accesses that miss.

L1D_CACHE includes reads and writes and is the sum of L1D_CACHE_RD and L1D_CACHE_WR.

L1D_CACHE_REFILL/L1D_CACHE
L1D Cache MPKI: The number of level 1 data cache accesses missed per thousand instructions executed.

(L1D_CACHE_REFILL/INST_RETIRED) * 1000
L1I Cache Miss Ratio: The ratio of level 1 instruction cache accesses missed to the total number of level 1 instruction cache accesses.

L1I_CACHE does not measure cache maintenance instructions or non-cacheable accesses.

L1I_CACHE_REFILL/L1I_CACHE
L1I Cache MPKI: The number of level 1 instruction cache accesses missed per thousand instructions executed.

(L1I_CACHE_REFILL/INST_RETIRED) * 1000
L2 Cache Miss Ratio: The ratio of level 2 cache accesses missed to the total number of level 2 cache accesses.

L2D_CACHE does not count cache maintenance operations or snoops from outside the core.

L2D_CACHE_REFILL/L2D_CACHE
L2 Cache MPKI: The number of level 2 unified cache accesses missed per thousand instructions executed.

(L2D_CACHE_REFILL/INST_RETIRED) * 1000
LL Cache Read Hit Ratio: The ratio of last level cache read accesses hit in the cache to the total number of last level cache accesses.

(LL_CACHE_RD - LL_CACHE_MISS_RD) / LL_CACHE_RD
LL Cache Read Miss Ratio: The ratio of last level cache read accesses missed to the total number of last level cache accesses.

LL_CACHE_MISS_RD / LL_CACHE_RD
LL Cache Read MPKI: The number of last level cache read accesses missed per thousand instructions executed.

(LL_CACHE_MISS_RD/INST_RETIRED) * 1000

TLB Effectiveness#

L1 Data TLB Miss Ratio: The ratio of level 1 data TLB accesses missed to the total number of level 1 data TLB accesses.

TLB maintenance instructions are not counted.

L1D_TLB_REFILL/L1D_TLB
L1 Data TLB MPKI: The number of level 1 data TLB accesses missed per thousand instructions executed.

(L1D_TLB_REFILL/INST_RETIRED) * 1000
L1 Instruction TLB Miss Ratio: The ratio of level 1 instruction TLB accesses missed to the total number of level 1 instruction TLB accesses.

TLB maintenance instructions are not counted.

L1I_TLB_REFILL/L1I_TLB
L1 Instruction TLB MPKI: The number of level 1 instruction TLB accesses missed per thousand instructions executed.

(L1I_TLB_REFILL/INST_RETIRED) * 1000
DTLB MPKI: The number of data TLB Walks per thousand instructions executed.

(DTLB_WALK/INST_RETIRED) * 1000
DTLB Walk Ratio: The ratio of data TLB Walks to the total number of data TLB accesses.

DTLB_WALK / L1D_TLB
ITLB MPKI: The number of instruction TLB Walks per thousand instructions executed.

(ITLB_WALK/INST_RETIRED) * 1000
ITLB Walk Ratio: The ratio of instruction TLB Walks to the total number of instruction TLB accesses.

ITLB_WALK / L1I_TLB
L2 Unified TLB Miss Ratio: The ratio of level 2 unified TLB accesses missed to the total number of level 2 unified TLB accesses.

L2D_TLB_REFILL / L2D_TLB
L2 Unified TLB MPKI: The number of level 2 unified TLB accesses missed per thousand instructions executed.

(L2D_TLB_REFILL/INST_RETIRED) * 1000

Branching#

Branch Misprediction Ratio: The ratio of branches mispredicted to the total number of branches architecturally executed.

BR_MIS_PRED_RETIRED/BR_RETIRED
Branch MPKI: The number of branch mispredictions per thousand instructions executed.

(BR_MIS_PRED_RETIRED/INST_RETIRED) * 1000

Statistical Profiling Extension#

Arm’s Statistical Profiling Extension (SPE) is designed to enhance software performance analysis by providing detailed profiling capabilities. This extension allows you to collect statistical data about software execution and records key execution data, including program counters, data addresses, and PMU events. SPE enhances performance analysis for branches, memory access, and so on, which makes it useful for software optimization. Refer to Statistical Profiling Extension for more information about SPE.

Grace Coresight System PMU Units#

Grace includes the Coresight System PMUs in the following table that are registered by the PMU driver with the specified naming conventions.

Grace Coresight System PMU Units#
System	Coresight System PMU name
Scalable Coherency Fabric (SCF)	nvidia_scf_pmu_<socket-id>
NVLINK-C2C0	nvidia_nvlink_c2c0_pmu_<socket-id>
NVLINK-C2C1 (Grace-Hopper Only)	nvidia_nvlink_c2c1_pmu_<socket-id>
PCIe	nvidia_pcie_pmu_<socket-id>
CNVLINK (Multi socket Grace-Hopper Only)	nvidia_cnvlink_pmu_<socket-id>

Coresight System PMU events are not attributable to a core, so the perf tool must be run in system-wide mode. If the measurement requires multiple events to be measured, perf tools support event grouping from the same PMU. For example, to monitor SCF CYCLES, CMEM_WB_ACCESS, and CMEM_WR_ACCESS events from the SCF PMU for socket 0:

$ perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/event=cycles/,nvidia_scf_pmu_0/cmem_wb_access/,nvidia_scf_pmu_0/cmem_wr_access/}'
cmem_write_test
 Performance counter stats for 'system wide':
         168225760 ns duration_time
          10515321 nvidia_scf_pmu_0/event=cycles/
            191567 nvidia_scf_pmu_0/cmem_wb_access/

                 0 nvidia_scf_pmu_0/cmem_wr_access/
      0.168225760 seconds time elapsed

Coresight System PMU Traffic Coverage#

The traffic pattern determines which PMU is used for measuring the different access types, and this might vary depending on the chip configuration.

Grace CPU Superchip#

The following table contains the summary of Coresight System PMU events that can be used to monitor different memory traffic types in Grace CPU Superchip.

PMU Accounting for Access Patterns on the Grace Superchip#
PMU	Event	Traffic
SCF PMU	SCF _CACHE	Access to last level cache.
	CMEM*	Memory traffic from all CPUs (local and remote socket) to local memory. For example, a measurement using socket 0 PMU counts the traffic from CPUs in socket 0 and socket 1 to the memory in socket 0.
	SOCKET* or REMOTE*	Memory traffic from CPUs in local socket to remote memory over C2C.
PCIe PMU	*LOC	Memory traffic from PCIe root ports to local memory.
	*REM	Memory traffic from PCIe root ports to remote memory.
NVLINK-C2C0 PMU	*LOC	Memory traffic from PCIe of remote socket to local memory over C2C. For example, a measurement using socket 0 PMU counts the traffic from PCIe of socket 1. Note: The write traffic is limited to PCIe relaxed-ordered write.
NVLINK-C2C1 PMU	Unused. Me asurement on this PMU will return zero count.
CNVLINK PMU	Unused. Me asurement on this PMU will return zero count.

*The prefix or suffix of the PMU event name.

CPU Traffic Measurement#

This section provides information about measuring CPU traffic.

The local CPU traffic generates CMEM events on the source socket’s SCF PMU. The remote CPU traffic generates the following events:

REMOTE* or SOCKET* events on the source socket’s SCF PMU.
CMEM* events on the destination socket’s SCF PMU.

Scalable Coherency Fabric PMU Accounting contains the performance metrics that can be derived from these events.

Here are some examples of CPU traffic measurements that capture the write size in bytes and read size in data beats, where each data beat transfers up to 32 bytes.

# Local traffic: Socket-0 CPU write to socket-0 DRAM.
# The workload performs a 1 GB write and the CPU and memory are pinned to socket-0.
# The result from SCF PMU in socket-0 shows 1,009,299,148 bytes of CPU writes to socket-0
# memory.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=0 --membind=0 dd if=/dev/zero of=/tmp/node0 bs=1M
count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.125672 s, 8.5 GB/s

Performance counter stats for 'system wide':

      27,496,157 ns duration_time
      1,009,299,148 nvidia_scf_pmu_0/cmem_wr_total_bytes/
      1,213,938 nvidia_scf_pmu_0/cmem_rd_data/
      1,355,223 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
            7,960 nvidia_scf_pmu_1/remote_socket_rd_data/
      0.127496157 seconds time elapsed

# Local traffic: Socket-0 CPU read from socket-0 DRAM
# The workload performs a 1 GB read and the CPU and memory are pinned to socket-0.

# The result from SCF PMU in socket-0 shows 35,572,420 data beats of CPU reads from socket-0
# memory. Converting it to bytes: 35,572,420 x 32 bytes = 1,138,317,440 bytes.**
nvidia@localhost:~$ sudo perf stat -a -e duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}' numactl --cpunodebind=0 --membind=0 dd of=/dev/null if=/tmp/node0 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0872731 s, 12.3 GB/s
Performance counter stats for 'system wide':
      88,826,372 ns duration_time
      36,057,808 nvidia_scf_pmu_0/cmem_wr_total_bytes/
      35,572,420 nvidia_scf_pmu_0/cmem_rd_data/**
            24,173 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
            4,728 nvidia_scf_pmu_1/remote_socket_rd_data/
      0.088826372 seconds time elapsed

# Remote traffic: Socket-1 CPU write to socket-0 DRAM
# The workload performs a 1 GB write. The CPU is pinned to socket-1 and memory is pinned to socket-0.
# The result from SCF PMU in socket-1 shows 961,728,219 bytes of remote writes to socket-0**
# memory. The remote writes are also captured by the destination (socket-0) SCF PMU that shows
# 993,278,696 bytes of any CPU writes to socket-0 memory.
nvidia@localhost:~$ sudo perf stat -a -e duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=1 --membind=0 dd if=/dev/zero of=/tmp/node0 bs=1M
count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.171349 s, 6.3 GB/s

 Performance counter stats for 'system wide':
       172,847,464 ns duration_time
       993,278,696 nvidia_scf_pmu_0/cmem_wr_total_bytes/
       1,040,902 nvidia_scf_pmu_0/cmem_rd_data/
       961,728,219 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/**
       1,065,136 nvidia_scf_pmu_1/remote_socket_rd_data/
       0.172847464 seconds time elapsed

# Remote traffic: Socket-1 CPU read from socket-0 DRAM**
# The workload performs a 1 GB read. The CPU is pinned to socket-1 and memory is pinned to socket-0.
# The result from SCF PMU in socket-1 shows 36,189,087 data beats of remote reads from socket-0
# memory. The remote reads are also captured by the destination (socket-0) SCF PMU that shows
# 33,542,984 data beats of any CPU read from socket-0 memory.**
# Converting it to bytes:
# - Socket-1 CPU remote read: 36,189,087 x 32 bytes = 1,158,050,784 bytes.
# - Any CPU read from socket-0 memory = 33,542,984 x 32 bytes = 1,073,375,488 bytes.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=1 --membind=0 dd of=/dev/null if=/tmp/node0 bs=1M
count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.133185 s, 8.1 GB/s

Performance counter stats for 'system wide':

     134,526,031 ns duration_time
     19,588,224 nvidia_scf_pmu_0/cmem_wr_total_bytes/
     33,542,984 nvidia_scf_pmu_0/cmem_rd_data/
     18,771,438 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
     36,189,087 nvidia_scf_pmu_1/remote_socket_rd_data/**
     0.134526031 seconds time elapsed

PCIe Traffic Measurement#

This section provides information about measuring PCIe traffic.

The local PCIe traffic generates *LOC events on the source socket’s PCIe PMU, and the remote PCIe traffic generates the following events:

*REM events on the source socket’s PCIe PMU.
*LOC events on the destination socket’s NVLINK-C2C0 PMU.

PCIe PMU Accounting contains the performance metrics that can be derived from these events.

The PCIe PMU monitors traffic from PCIe root ports to local/remote memory. Grace SOC supports up to 10 PCIe root ports. At least one of these root ports need to be specified in perf tool using the root_port bitmap filter. For example, the root_port=0xF filter corresponds to root ports 0 through 3.

Example: To count the rd_bytes_loc event from PCIe root port 0 and 1 of socket 0:

$ perf stat -a -e nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x3/

Example: To count the rd_bytes_rem event from PCIe root port 2 in socket 1 and measure the test duration (in nanoseconds):

$ perf stat -a -e duration_time,'{nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x4/}'

Here is an example of finding the root port number of a PCIe device.

# Find the root port number of the NVLINK drive on each socket.
# nvme0 is attached to domain number 0018, which corresponds to socket-1 root port 0x8.
# nvme1 is attached to domain number 0008, which corresponds to socket-0 root port 0x8.
nvidia@localhost:~$ readlink -f /sys/class/nvme/nvme*
/sys/devices/pci0018:00/**0018**:00:00.0/0018:01:00.0/nvme/nvme0
/sys/devices/pci0008:00/**0008**:00:00.0/0008:01:00.0/0008:02:00.0/0008:03:00.0/nvme/nvme1

# Query the mount point of each nvme drive to get the path for the workloads below.
nvidia@localhost:~$ df -H | grep nvme0
/dev/**nvme0n1p2** 983G 625G 309G 67% /
nvidia@localhost:~$ df -H | grep nvme1
/dev/**nvme1n1** 472G 1.1G 447G 1% /var/data0

Here are some examples of PCIe traffic measurements when reading data from DRAM and storing the result to an NVME drive. These capture the size of read/write to local/remote memory in bytes.

# Local traffic: Socket-0 PCIE read from socket-0 DRAM.
# The workload copies data from socket-0 DRAM to socket-0 NVME in root port 0x8. The NVME would
# perform a read request to DRAM.
# The root_port filter value: 1 << root port number = 1 << 8 = 0x100
# The result from PCIe PMU in socket-0 shows 1,168,472,064 bytes of PCIe read from socket-0 memory.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_0/wr_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_0/rd_bytes_rem,root_port=0x100/,nvidia_pcie_pmu_0/wr_bytes_rem,root_port=0x100/}'
numactl --cpunodebind=0 --membind=0 dd if=/dev/zero of=/var/data0/pcie_test_0 bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.96463 s, 547 MB/s

   Performance counter stats for 'system wide':
      1,966,391,711 ns duration_time
      1,168,472,064 nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x100/
      31,250,176 nvidia_pcie_pmu_0/wr_bytes_loc,root_port=0x100/
            49,152 nvidia_pcie_pmu_0/rd_bytes_rem,root_port=0x100/
            0 nvidia_pcie_pmu_0/wr_bytes_rem,root_port=0x100/
      1.966391711 seconds time elapsed
# Remote traffic: Socket-1 PCIE read from socket-0 DRAM.
# The workload copies data from socket-0 DRAM to socket-1 NVME in root port 0x8. The NVME would perform a read request to DRAM.
# The root_port filter value: 1 << root port number = 1 << 8 = 0x100
# The result from PCIe PMU in socket-1 shows 1,073,762,304 bytes of PCIe read from socket-0 memory.
# The remote reads are also captured by the destination (socket-0) NVLINK C2C0 PMU that shows 1,074,057,216 bytes of reads.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_pcie_pmu_1/rd_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_1/wr_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x100/,nvidia_pcie_pmu_1/wr_bytes_rem,root_port=0x100/}','{nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}' numactl --membind=0 dd if=/dev/zero of=/home/nvidia/pcie_test_0 bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.73373 s, 1.5 GB/s

Performance counter stats for 'system wide':
      735,201,612 ns duration_time
      6,398,720 nvidia_pcie_pmu_1/rd_bytes_loc,root_port=0x100/
         164,096 nvidia_pcie_pmu_1/wr_bytes_loc,root_port=0x100/
      1,073,762,304 nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x100/**
          0 nvidia_pcie_pmu_1/wr_bytes_rem,root_port=0x100/
      1,074,057,216 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/
         32,768 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/
      0.735201612 seconds time elapsed

Grace Hopper Superchip#

The following table contains a summary of the Coresight System PMU events that can be used to monitor different memory traffic types in Grace Hopper Superchip.

PMU Accounting for Access Patterns on the Grace Hopper Superchip#
PMU	Event	Traffic
SCF PMU	SCF_CACHE	Access to last level cache.
	CMEM*	Memory traffic from CPU to CPU memory.
	GMEM*	Memory traffic from CPU to GPU memory over C2C.
PCIe PMU	*LOC	Memory traffic from PCIe root ports to CPU or GPU memory.
NVLINK-C2C0 PMU	*LOC	ATS translated or EGM memory traffic from GPU to CPU memory over C2C.
NVLINK-C2C1 PMU	*LOC	Non-ATS translated memory traffic from GPU to CPU or GPU memory over C2C.
CNVLINK PMU	Unused. Measurement on this PMU will return a zero count.

* The prefix or suffix of a PMU event name.

Refer to Grace CPU Superchip for more information about CPU and PCIe traffic measurement.

GPU Traffic Measurement#

This section provides information about measuring the GPU traffic.

The GPU traffic generates *LOC events on the Grace Coresight System C2C PMUs, and the NVLINK-C2C0 PMU captures ATS translated or EGM memory traffic. NVLINK-C2C1 PMU captures non-ATS translated traffic. The solid line indicates only that the path from the SOC to memory has been captured. The C2C PMUs do not capture the latency effect of the C2C interconnect. Refer to NVLink C2C PMU Accounting for more information about the performance metrics that can be derived from these events.

Here are some examples of GPU traffic measurements that capture the size of read/write in bytes.

# GPU write to CPU DRAM.
# The workload performs a write by GPU copy engine to CPU DRAM.
# The result from NVLINK-C2C0 PMU shows 4,026,531,840 bytes of writes to DRAM.

nvidia@localhost:/home/nvidia/nvbandwidth$ sudo perf stat -a -e '{nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}','{nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/}'
./nvbandwidth -t 1
nvbandwidth Version: v0.4
Built from Git version: v0.4

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12050
CUDA Driver Version: 12050
Driver Version: 555.42.02
Device 0: NVIDIA GH200 120GB
Running device_to_host_memcpy_ce.memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)

      0
0 295.07

SUM **device_to_host_memcpy_ce** 295.07

  Performance counter stats for 'system wide':

      4,234,950,656 nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/
      208,418,816 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/
      **4,026,531,840 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/**
      26,981,632 nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/
      6,337,792 nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/
      20,643,840 nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/
      0.777059774 seconds time elapsed
**# GPU read from CPU DRAM.**
**# The workload performs a read by GPU copy engine from CPU DRAM.**
**# The result from NVLINK-C2C0 PMU shows 4,234,927,104 bytes of reads from DRAM.**
nvidia@localhost:/home/nvidia/nvbandwidth$ sudo perf stat -a -e'{nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}','{nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/}'
./nvbandwidth -t 0
nvbandwidth Version: v0.4
Built from Git version: v0.4
NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12050
CUDA Driver Version: 12050
Driver Version: 555.42.02

Device 0: NVIDIA GH200 120GB

Running host_to_device_memcpy_ce.memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)

       0
0 408.42

SUM **host_to_device_memcpy_ce** 408.42
 Performance counter stats for 'system wide':
      4,436,253,696 nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/
      **4,234,927,104 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/**
      201,326,592 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/
      26,949,888 nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/
      6,356,480 nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/
      20,593,408 nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/
      0.771266524 seconds time elapsed

Scalable Coherency Fabric PMU Accounting#

SCF PMU Events#

The following tables provide the events exposed by the PMU driver and perf tool.

SCF PMU Events for Cycle Counter#
Event Name	ID	Description
CYCLES	0x100000000	SCF Clock Cycles.
BUS_CYCLES	0x1d	Same with cycles

SCF PMU Events for CPU Last-level Cache Events#
Event Name	ID	Description
SCF_CACHE_ALLOCATE	0xf0	Read / Write / Atomic / Prefetch requests which are allocated in LL Cache.
SCF_CACHE_REFILL	0xf1	Read / Atomic/ StashOnce requests which miss in Last Level cache and inner caches resulting in request to next level memory hierarchy (# of Read Requests which miss in CTAG and DIR and fetch data from L4).
SCF_CACHE	0xf2	LL Cache Read / Write / Atomic / Snoop / Prefetch requests not resulting in decode errors. NOTE: Dropped L3 prefetches not counted.
SCF_CACHE_WB	0xf3	Number of capacity evictions from LL Cache.

SCF PMU Events for CPU Memory Transport Events#
Event Name	ID	Description
CMEM_RD_DATA	0x1a5	Counts the total number of SCF Read data beats transferred from Local Grace DRAM to Local or Remote CPU. Each data beat transfers up to 32 bytes.
CMEM_RD_ACCESS	0x1a6	Counts the number of SCF Read accesses from Local or Remote CPU to Local Grace DRAM. Increments by total new Read accesses each cycle.
CMEM_RD_OUTSTANDING	0x1a7	Counts the total number of outstanding SCF Read accesses from Local or Remote CPU to Local socket Grace DRAM. Increments by total outstanding count each cycle.
CMEM_WB_DATA	0x1ab	Counts the total number of SCF Write-back data beats (clean/dirty) transferred from Local or Remote CPU to Local Grace DRAM (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
CMEM_WB_ACCESS	0x1ac	Counts the number of SCF Write-back accesses (clean/dirty) from Local or Remote CPU to Local Grace DRAM (excluding Write-unique/non-coherent Writes). Increments by total new Write accesses each cycle. Each Write-back access is a 64-byte transfer.
CMEM_WR_DATA	0x1b1	Counts the number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes (including for WriteZeros which is counted as 64 bytes) from Local or Remote CPU to Local Grace DRAM.
CMEM_WR_TOTAL_BYTES	0x1db	Counts the total number of bytes transferred to Local Grace DRAM by Write-backs, Write-unique and non-coherent Writes from Local or Remote CPU. `CMEM_WR_TOTAL_BYTES = CMEM_WB_DATA*32 + CMEM_WR_DATA`
CMEM_WR_ACCESS	0x1ca	Counts the total number of SCF Write-unique and non-coherent Write (including WriteZeros) requests from Local or Remote CPU to Local Grace DRAM.

SCF PMU Events for GPU Memory Transport Events in Grace Hopper Superchip#
Event Name	ID	Description
GMEM_RD_DATA	0x16d	Counts the total number of Read data beats transferred from Local coherent GPU DRAM to Local or Remote CPU. Each data beat transfers up to 32 bytes.
GMEM_RD_ACCESS	0x16e	Counts the number of SCF Read accesses from Local or Remote CPU to Local coherent GPU DRAM. Increments by total new Read accesses each cycle.
GMEM_RD_OUTSTANDING	0x16f	Indicates the total number of outstanding Read requests from Local or Remote CPU to coherent GPU DRAM. Increments by outstanding Read request count each cycle.
GMEM_WB_DATA	0x173	Counts the total number of SCF Write-back data beats transferred from Local or Remote CPU to Local coherent GPU DRAM (excluding Write-unique / non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
GMEM_WB_ACCESS	0x174	Counts the number of SCF Write-back accesses (clean and dirty) from Local or Remote CPU to Local coherent GPU DRAM (excluding Write-unique / non-coherent Writes) . Increments by total new Write accesses each cycle. Each Write-back access is a 64-byte transfer.
GMEM_WR_DATA	0x179	Counts the number of bytes transferred based on size field for SCF Write-unique and non-coherent Writes from Local or Remote CPU to Local coherent GPU DRAM.
GMEM_WR_ACCESS	0x17b	Counts the total number of SCF Write-unique and non-coherent Write requests from Local or Remote CPU to Local coherent GPU DRAM.
GMEM_WR_TOTAL_BYTES	0x1a0	Counts the total number of bytes transferred to Local coherent GPU DRAM by Write-backs, Write-unique and non-coherent Writes. `GMEM_WR_TOTAL_BYTES = GMEM_WB_DATA*32 + GMEM_WR_DATA`

SCF PMU Events for Remote Socket Transport Events#
Event Name	ID	Description
SOCKET_0_RD_DATA	0x101	Counts the total number of SCF Read data beats transferred from CNVLINK Socket 0 memory or NVLINK-C2C socket 0 memory to Local CPU. Each data beat transfers up to 32 bytes.
SOCKET_1_RD_DATA	0x102	Counts the total number of SCF Read data beats transferred from CNVLINK Socket 1 memory or NVLINK-C2C socket 1 memory to Local CPU. Each data beat transfers up to 32 bytes.
SOCKET_2_RD_DATA	0x103	Counts the total number of SCF Read data beats transferred from CNVLINK Socket 2 memory to Local CPU. Each data beat transfers up to 32 bytes.
SOCKET_3_RD_DATA	0x104	Counts the total number of SCF Read data beats transferred from CNVLINK Socket 3 memory to Local CPU. Each data beat transfers up to 32 bytes.
SOCKET_0_WB_DATA	0x109	Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
SOCKET_1_WB_DATA	0x10a	Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
SOCKET_2_WB_DATA	0x10b	Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 2 memory (excluding Write unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
SOCKET_3_WB_DATA	0x10c	Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 3 memory (excluding Write unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.
SOCKET_0_WR_DATA	0x17c	Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.
SOCKET_1_WR_DATA	0x17d	Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.
SOCKET_2_WR_DATA	0x17e	Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 2 memory.
SOCKET_3_WR_DATA	0x17f	Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 3 memory.
SOCKET_0_RD_OUTSTANDING	0x115	Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory. Increments by total outstanding count each cycle.
SOCKET_1_RD_OUTSTANDING	0x116	Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory. Increments by total outstanding count each cycle.
SOCKET_2_RD_OUTSTANDING	0x117	Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 2 memory. Increments by total outstanding count each cycle.
SOCKET_3_RD_OUTSTANDING	0x118	Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 3 memory. Increments by total outstanding count each cycle.
SOCKET_0_RD_ACCESS	0x12d	Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 0 memory or NVLINK-C2C socket 0 memory. Increments by total new Read accesses each cycle.
SOCKET_1_RD_ACCESS	0x12e	Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 1 memory or NVLINK-C2C socket 1 memory. Increments by total new Read accesses each cycle.
SOCKET_2_RD_ACCESS	0x12f	Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 2 memory. Increments by total new Read accesses each cycle.
SOCKET_3_RD_ACCESS	0x130	Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 3 memory. Increments by total new Read accesses each cycle.
SOCKET_0_WB_ACCESS	0x135	Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.
SOCKET_1_WB_ACCESS	0x136	Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.
SOCKET_2_WB_ACCESS	0x137	Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 2 memory.
SOCKET_3_WB_ACCESS	0x138	Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 3 memory.
SOCKET_0_WR_ACCESS	0x139	Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.
SOCKET_1_WR_ACCESS	0x13a	Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.
SOCKET_2_WR_ACCESS	0x13b	Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 2 memory.
SOCKET_3_WR_ACCESS	0x13c	Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 3 memory.
REMOTE_SOCKET_WR_TOTAL_BYTES	0x1a1	Counts the total number of bytes transferred, to Remote sockets’ memory by Write-backs, Write-unique, non-coherent Writes, and all atomics. For the non-coherent Writes, number of bytes is based on size field. Based on Grace socket index, Remote destinations are sockets other than Local socket index. Example: The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU: `REMOTE_SOCKET_WR_TOTAL_BYTES = (SOCKET_1_WB_DATA + SOCKET_2_WB_DATA + SOCKET_3_WB_DATA)*32 + (SOCKET_1_WR_DATA + SOCKET_2_WR_DATA + SOCKET_3_WR_DATA)`
REMOTE_SOCKET_RD_DATA	0x1a2	Counts the total number of Read data beats transferred from all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Each data beat transfers up to 32 bytes. Based on Grace socket index, Remote destinations are sockets other than Local socket index. Example: The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU: `REMOTE_SOCKET_RD_DATA = SOCKET_1_RD_DATA + SOCKET_2_RD_DATA + SOCKET_3_RD_DATA` NOTE: In Grace Superchip configuration, this event count need to match with `SOCKET_0_RD_DATA` or `SOCKET_1_RD_DATA`, depending on the measuring socket’s index.
REMOTE_SOCKET_RD_OUTSTANDING	0x1a3	Counts the total number of outstanding Read accesses to all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Increments by total outstanding count each cycle. Based on Grace socket index, Remote destinations are sockets other than Local socket index. Example: The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU: `REMOTE_SOCKET_RD_OUTSTANDING = SOCKET_1_RD_OUTSTANDING + SOCKET_2_RD_OUTSTANDING + SOCKET_3_RD_OUTSTANDING` NOTE: In Grace Superchip configuration, this event count need to match with `SOCKET_0_RD_OUTSTANDING` or `SOCKET_1_RD_OUTSTANDING`, depending on the measuring socket’s index.
REMOTE_SOCKET_RD_ACCESS	0x1a4	Counts the number of Read accesses to all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Increments by total new Read accesses each cycle. Based on Grace socket index, Remote destinations are sockets other than Local socket index. Example: The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU: `REMOTE_SOCKET_RD_ACCESS = SOCKET_1_RD_ACCESS + SOCKET_2_RD_ACCESS + SOCKET_3_RD_ACCESS` NOTE: In Grace superchip configuration, this event count need to match with `SOCKET_0_RD_ACCESS` or `SOCKET_1_RD_ACCESS`, depending on the measuring socket’s index.

SCF Maximum Transport Request Utilization#

SCF Maximum Transport Request Utilization#
Target	Max Read Requests per Cycle	Max Write Requests per Cycle
CMEM	8 (RD_ACCESS)	8 (WR_ACCESS or WB_ACCESS)
GMEM	4 (RD_ACCESS)	4 (WR_ACCESS or WB_ACCESS)
REMOTE	2 (RD_ACCESS)	2 (WR_ACCESS or WB_ACCESS)

SCF PMU Metrics#

This section provides additional formulas for useful performance metrics based on events provided by the Scalable Coherency Fabric (SCF) PMU.

Duration: Duration of the measurement in nanoseconds.
- Event: DURATION_TIME
Cycles: SCF cycle count
- Event: CYCLES
SCF Frequency: Frequency of SCF cycles in GHz
- Metric formula: CYCLES/DURATION_TIME
Local CPU memory write bandwidth: Write bandwidth to local CPU memory, in GBps.
- Source:
  - Grace CPU Superchip: local and remote CPU
  - Single socket Grace Hopper Superchip: local CPU
- Destination: local CPU memory
- Metric formula: CMEM_WR_TOTAL_BYTES/DURATION_TIME
Local CPU memory read bandwidth: Read bandwidth from local CPU memory, in GBps.
- Source:
  - Grace CPU Superchip: local and remote CPU
  - Single socket Grace Hopper Superchip: local CPU
- Destination: local CPU memory
- Metric formula: CMEM_RD_DATA*32/DURATION_TIME
Local GPU memory write bandwidth: Write bandwidth to local GPU memory, in GBps.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: N/A
  - Single socket Grace Hopper Superchip: local GPU memory
- Metric formula: GMEM_WR_TOTAL_BYTES/DURATION_TIME
Local GPU memory read bandwidth: Read bandwidth from local GPU memory, in GBps.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: N/A
  - Single socket Grace Hopper Superchip: local GPU memory
- Metric formula: GMEM_RD_DATA*32/DURATION_TIME
Remote memory write bandwidth: Write bandwidth to remote socket memory, in GBps.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: remote CPU memory
  - Single socket Grace Hopper Superchip: N/A
- Metric formula: REMOTE_SOCKET_WR_TOTAL_BYTES/DURATION_TIME
Remote memory read bandwidth: Read bandwidth from remote socket memory, in GBps.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: remote CPU memory
  - Single socket Grace Hopper Superchip: N/A
- Metric formula: REMOTE_SOCKET_RD_DATA*32/DURATION_TIME
Local CPU memory write utilization percentage: Percent usage of writes to local CPU memory.
- Source:
  - Grace CPU Superchip: local and remote CPU
  - Single socket Grace Hopper Superchip: local CPU
- Destination: local CPU memory
- Metric formula: ((CMEM_WB_ACCESS + CMEM_WR_ACCESS)/(8*CYCLES)) * 100.0
Local GPU memory write utilization percentage: Percent usage of writes to local GPU memory.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: N/A
  - Single socket Grace Hopper Superchip: local GPU memory
- Metric formula: ((GMEM_WB_ACCESS + GMEM_WR_ACCESS)/(4*CYCLES)) * 100.0
Remote memory write utilization percentage: Percent usage of writes to the remote socket memory.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: remote CPU memory
  - Single socket Grace Hopper Superchip: N/A
- Metric formula:
  - Socket 0 access to socket 1 memory (measured in socket 0) : ((SOCKET_1_WB_ACCESS + SOCKET_1_WR_ACCESS)/(2*CYCLES)) * 100.0
  - Socket 1 access to socket 0 memory (measured in socket 1) : ((SOCKET_0_WB_ACCESS + SOCKET_0_WR_ACCESS)/(2*CYCLES)) * 100.0
Local CPU memory read utilization percentage: Percent usage of reads from local CPU memory.
- Source:
  - Grace CPU Superchip: local and remote CPU
  - Single socket Grace Hopper Superchip: local CPU
- Destination: local CPU memory
- Metric formula: ((CMEM_RD_ACCESS)/(8*CYCLES)) * 100.0
Local GPU memory read utilization percentage: Percent usage of reads from local GPU memory.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: N/A
  - Single socket Grace Hopper Superchip: local GPU memory
- Metric formula: ((GMEM_RD_ACCESS)/(4*CYCLES)) * 100.0
Remote memory read utilization percentage: Percent usage of reads from remote socket memory.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: remote CPU memory
  - Single socket Grace Hopper Superchip: N/A
- Metric formula:
  - Socket 0 access to socket 1 memory (measured in socket 0) : ((SOCKET_1_RD_ACCESS)/(2*CYCLES)) * 100.0
  - Socket 1 access to socket 0 memory (measured in socket 1) : ((SOCKET_0_RD_ACCESS)/(2 * CYCLES)) * 100.0
Local CPU memory read latency: Average latency of reads from local CPU memory, in nanoseconds.
- Source:
  - Grace CPU Superchip: local and remote CPU
  - Single socket Grace Hopper Superchip: local CPU
- Destination: local CPU memory
- Metric formula: (CMEM_RD_OUTSTANDING/CMEM_RD_ACCESS)/(CYCLES/DURATION)
Local GPU memory read latency: Average latency of reads from local GPU memory, in nanoseconds.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: N/A
  - Single socket Grace Hopper Superchip: local GPU memory
- Metric formula: (GMEM_RD_OUTSTANDING/GMEM_RD_ACCESS)/(CYCLES/DURATION)
Remote memory read latency: Average latency of reads from remote memory in nanoseconds.
- Source: local CPU
- Destination:
  - Grace CPU Superchip: remote CPU memory
  - Single socket Grace Hopper Superchip: N/A
- Metric formula:
  - Socket 0 access to socket 1 memory (measured in socket 0): (SOCKET_1_RD_OUTSTANDING/SOCKET_1_RD_ACCESS)/(CYCLES/DURATION)
  - Socket 1 access to socket 0 memory (measured in socket 1): (SOCKET_0_RD_OUTSTANDING/SOCKET_0_RD_ACCESS)/(CYCLES/DURATION)

PCIe PMU Accounting#

PCIe PMU Events#

The following tables provide the events exposed by the PMU driver and perf tool.

PCIe Transport Events#
Event Name	ID	Description
CYCLES	0x100000000	PCIe PMU cycle counter.
WR_REQ_LOC	0x8	Counts the number of PCIe Write requests to Local CMEM or GMEM.
RD_REQ_LOC	0x6	Counts the number of PCIe Read requests to Local CMEM or GMEM.
TOTAL_REQ_LOC	0xa	Counts the number of PCIe Read/Write requests to Local CMEM or GMEM.
WR_BYTES_LOC	0x2	Counts the number of PCIe Write bytes to Local CMEM or GMEM.
RD_BYTES_LOC	0x0	Counts the number of PCIe Read bytes to Local CMEM or GMEM.
TOTAL_BYTES_LOC	0x4	Counts the number of PCIe Read/Write bytes to Local CMEM or GMEM.
RD_CUM_OUTS_LOC	0xc	Accumulates PCIe outstanding Read request cycles to Local CMEM or GMEM.
WR_REQ_REM	0x9	Counts the number of PCIe Write requests to Remote CMEM or GMEM.
RD_REQ_REM	0x7	Counts the number of PCIe Read requests to Remote CMEM or GMEM.
TOTAL_REQ_REM	0xb	Counts the number of PCIe Read/Write requests to Remote CMEM or GMEM.
WR_BYTES_REM	0x3	Counts the number of PCIe Write bytes to Remote CMEM or GMEM.
RD_BYTES_REM	0x1	Counts the number of PCIe Read bytes to Remote CMEM or GMEM.
TOTAL_BYTES_REM	0x5	Counts the number of PCIe Read/Write bytes to Remote CMEM or GMEM.
RD_CUM_OUTS_REM	0xd	Accumulates PCIe outstanding Read request cycles to Remote CMEM or GMEM

PCIe Maximum Transport Request Utilization#

PCIe Maximum Transport Request Utilization#
Source (Target: Local/Remote)	Max Read Requests per Cycle	Max Write Requests per Cycle
Root Ports 0-9	1 per port. Total 10 requests per cycle. (RD_REQ)	1 32B-request per port. Total 10 32B-requests per cycle. (WR_REQ)

PCIe PMU Metrics#

This section provides additional formulas for useful performance metrics based on events provided by the PCIe PMU.

Note

All PCIe metrics that follow are affected by the root port filter. If you do not define a root port filter, the metric results will be zero (refer to PCIe Traffic Management for more information).

PCIe RP Frequency: Frequency of PCIe RP memory client interface cycles in GHz.
- Metric formula: CYCLES/DURATION_TIME
PCIe RP read bandwidth: PCIe root port read bandwidth, in GBps.
- Source: local PCIe root port
- Destination: local and remote memory
- Metric formula: (RD_BYTES_LOC + RD_BYTES_REM)/DURATION_TIME
PCIe RP write bandwidth: PCIe root port write bandwidth, in GBps.
- Source: local PCIe root port
- Destination: local and remote memory
- Metric formula: (WR_BYTES_LOC + WR_BYTES_REM)/DURATION_TIME
PCIe RP bidirectional bandwidth: PCIe root port read and write bandwidth, in GBps.
- Source: local PCIe root port
- Destination: local and remote memory
- Metric formula: (RD_BYTES_LOC + RD_BYTES_REM + WR_BYTES_LOC + WR_BYTES_REM)/DURATION_TIME
PCIe RP read utilization percentage: Average percent utilization of PCIe root port for reads.
- Source: local PCIe root port
- Destination: local and remote memory
- Metric formula: ((RD_REQ_LOC + RD_REQ_REM)/(10 * CYCLES) ) * 100.0
PCIe RP write utilization percentage: Average percent utilization of PCIe root port for writes.
- Source: local PCIe root port
- Destination: local and remote memory
- Metric formula: ( (WR_REQ_LOC + WR_REQ_REM)/(10 * CYCLES) ) * 100.0
PCIE RP local memory read latency: Average latency of PCIe root port reads to local memory, in nanoseconds.
Note

This metric does not include the latency of the PCIe interconnect.
- Source: local PCIe root port
- Destination: local memory
- Metric formula: (RD_CUM_OUTS_LOC/RD_REQ_LOC)/(CYCLES/DURATION_TIME)
PCIE RP remote memory read latency: Average latency of PCIe root port reads to remote memory, in nanoseconds.
Note

This metric does not include the latency of the PCIe interconnect.
- Source: local PCIe root port
- Destination: remote memory
- Metric formula: (RD_CUM_OUTS_REM/RD_REQ_REM)/(CYCLES/DURATION_TIME)

NVLink C2C PMU Accounting#

Note

This PMU does not monitor CPU traffic. Users should use SCF PMU to capture remote CPU traffic.
The counts from this PMU do not include the interconnect latency.

The NVLink C2C PMU monitors traffic from NVLink Chip-2-Chip (C2C) interconnect to local memory. The traffic captured by this PMU is limited to the following traffic:

Requests from remote PCIe devices in the Grace CPU Superchip system.
Requests from a coherent GPU in the Grace Hopper Superchip system.

Two C2C PMUs, NVLINK-C2C0, and NVLINK-C2C1 are available and cover different types of traffic. Refer to PMU Accounting for Access Patterns on the Grace Superchip and PMU Accounting for Access Patterns on the Grace Hopper Superchip tables above for more information about the traffic patterns covered by each PMU. The NVLINK-C2C1 PMU is unused on the Grace Superchip.

NVLink C2C PMU Events#

The following tables provide the events exposed by the PMU driver and perf tool.

NVLink-C2C0 PMU Events for GPU ATS-Translated or Remote PCIe Transport Events#
Event Name	ID	Description
CYCLES	0x100000000	NVLink-C2C0 PMU cycle counter.
WR_REQ_LOC	0x8	Counts the number of GPU ATS-translated Write requests in Grace Coherent GPU or Remote PCIe Relaxed Order Write requests in Grace CPU Superchip to Local CMEM or GMEM.
RD_REQ_LOC	0x6	Counts the number of GPU ATS-translated Read requests in Grace Coherent GPU or Remote PCIe Read requests in Super Grace to Local CMEM or GMEM.
TOTAL_REQ_LOC	0xa	Counts the number of GPU ATS-translated Read/Write requests in Grace Coherent GPU or Remote PCIe Read/Relaxed Order Write requests in Grace CPU Superchip to Local CMEM or GMEM.
WR_BYTES_LOC	0x2	Counts the number of GPU ATS-translated Write bytes in Grace Coherent GPU or Remote PCIe Relaxed Order Write bytes in Grace CPU Superchip to Local CMEM or GMEM.
RD_BYTES_LOC	0x0	Counts the number of GPU ATS-translated Read bytes in Grace Coherent GPU or Remote PCIe Read bytes in Grace CPU Superchip to Local CMEM or GMEM.
TOTAL_BYTES_LOC	0x4	Counts the number of GPU ATS-translated Read/Write bytes in Grace Coherent GPU or Remote PCIe Read/Relaxed Order Write bytes in Grace CPU Superchip to Local CMEM or GMEM.
RD_CUM_OUTS_LOC	0xc	Accumulates GPU ATS-translated outstanding Read request cycles in Grace Coherent GPU or Remote PCIe Read outstanding Read request cycles in Grace CPU Superchip cycles to Local CMEM or GMEM.
WR_REQ_REM	0x9	Counts the number of GPU ATS-translated Write requests in Grace Coherent GPU to Remote CMEM or GMEM.
RD_REQ_REM	0x7	Counts the number of GPU ATS-translated Read requests in Grace Coherent GPU to Remote CMEM or GMEM.
TOTAL_REQ_REM	0xb	Counts the number of GPU ATS-translated Read/Write requests in Grace Coherent GPU to Remote CMEM or GMEM.
WR_BYTES_REM	0x3	Counts the number of GPU ATS-translated Write bytes in Grace Coherent GPU to Remote CMEM or GMEM.
RD_BYTES_REM	0x1	Counts the number of GPU ATS-translated Read bytes in Grace Coherent GPU to Remote CMEM or GMEM.
TOTAL_BYTES_REM	0x5	Counts the number of GPU ATS-translated Read/Write bytes in Grace Coherent GPU to Remote CMEM or GMEM.
RD_CUM_OUTS_REM	0xd	Accumulates GPU ATS-translated outstanding Read request cycles in Grace Coherent GPU to Remote CMEM or GMEM.

NVLink-C2C1 PMU Events for GPU Not ATS-Translated Transport Events#
Event Name	ID	Description
CYCLES	0x100000000	NVLink-C2C1 PMU cycle counter.
WR_REQ_LOC	0x8	Counts the number of GPU Not ATS-translated Write requests in Grace Coherent GPU to Local CMEM or GMEM.
RD_REQ_LOC	0x6	Counts the number of GPU Not ATS-translated Read requests in Grace Coherent GPU to Local CMEM or GMEM.
TOTAL_REQ_LOC	0xa	Counts the number of GPU Not ATS-translated Read/Write requests in Grace Coherent GPU to Local CMEM or GMEM.
WR_BYTES_LOC	0x2	Counts the number of GPU Not ATS-translated Write bytes in Grace Coherent GPU to Local CMEM or GMEM.
RD_BYTES_LOC	0x0	Counts the number of GPU Not ATS-translated Read bytes in Grace Coherent GPU to Local CMEM or GMEM.
TOTAL_BYTES_LOC	0x4	Counts the number of GPU Not ATS-translated Read/Write bytes in Grace Coherent GPU to Local CMEM or GMEM.
RD_CUM_OUTS_LOC	0xc	Accumulates GPU Not ATS-translated outstanding Read request cycles in Grace Coherent GPU cycles to Local CMEM or GMEM.
WR_REQ_REM	0x9	Counts the number of GPU Not ATS-translated Write requests in Grace Coherent GPU to Remote CMEM or GMEM.
RD_REQ_REM	0x7	Counts the number of GPU Not ATS-translated Read requests in Grace Coherent GPU to Remote CMEM or GMEM.
TOTAL_REQ_REM	0xb	Counts the number of GPU Not ATS-translated Read/Write requests in Grace Coherent GPU to Remote CMEM or GMEM.
WR_BYTES_REM	0x3	Counts the number of GPU Not ATS-translated Write bytes in Grace Coherent GPU to Remote CMEM or GMEM.
RD_BYTES_REM	0x1	Counts the number of GPU Not ATS-translated Read bytes in Grace Coherent GPU to Remote CMEM or GMEM.
TOTAL_BYTES_REM	0x5	Counts the number of GPU Not ATS-translated Read/Write bytes in Grace Coherent GPU to Remote CMEM or GMEM.
RD_CUM_OUTS_REM	0xd	Accumulates GPU Not ATS-translated outstanding Read request cycles in Grace Coherent GPU to Remote CMEM or GMEM.

NVLink C2C Maximum Transport Request Utilization#

NVLink C2C Maximum Transport Request Utilization#
Source (Target: Local/Remote)	Max Read Requests per Cycle	Max Write Requests per Cycle
Grace Hopper Superchip in one GPU mode: 10 C2C ports/GPU Grace Hopper Superchip in two GPU mode: 5 C2C ports/GPU	1 per port. Total 10 requests per cycle. (RD_REQ)	1 32B-request per port. Total 10 32B-requests per cycle. (WR_REQ)

NVLink C2C PMU Metrics#

This section provides additional formulas for useful performance metrics based on events provided by the NVLINK-C2C PMU.

NVLink C2C Client Interface Frequency: Frequency of NVLink C2C memory client interface cycles in GHz.
- Metric formula: CYCLES/DURATION_TIME
GPU or remote PCIe traffic read bandwidth: Bandwidth of read traffic, in GBps.
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: (RD_BYTES_LOC)/DURATION_TIME
GPU or remote PCIe traffic write bandwidth: Bandwidth of write traffic, in GBps.
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: (WR_BYTES_LOC)/DURATION_TIME
GPU or Remote PCIe traffic bidirectional bandwidth: Bandwidth of read and write traffic, in GBps.
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: (RD_BYTES_LOC + WR_BYTES_LOC )/DURATION_TIME
GPU or Remote PCIe traffic read utilization percentage: Average percent usage of the NVLink-C2C client interface for reads.
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: ((RD_REQ_LOC)/(10 * CYCLES) ) * 100.0
GPU or Remote PCIe traffic write utilization percentage: Average usage of the NVLink-C2C client interface for writes.
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: ((WR_REQ_LOC)/(10 * CYCLES) ) * 100.0
GPU or Remote PCIe to local memory read latency: Average latency of reads to local memory, in nanoseconds.
Note

This metric does not include the latency of the NVLink-C2C interconnect (refer to GPU Traffic Measurement for more information).
- Source: GPU or PCIe of remote socket
- Destination: local memory
- Metric formula: (RD_CUM_OUTS_LOC/RD_REQ_LOC)/(CYCLES/DURATION_TIME)

Profiling CPU Behavior with Nsight Systems#

The Nsight Systems tool (also referred to as nsys) profiles the system’s compute units including the CPUs and GPUs (refer to Nsight Systems for more information). The tool can trace more than 25 APIs including CUDA APIs, sample CPU instruction pointers/backtraces, sample the CPU and SoC event counts, and sample GPU hardware event counts to provide a system-wide view of a workload’s behavior.

nsys can sample CPU and SoC events and graph their rates on the nsys UI timeline. It can generate the metrics described in Grace CPU Performance Metrics to NVLink C2C Accounting and also graph them on the nsys UI timeline. If CPU IP/backtrace data is gathered concurrently, users can determine when CPU and SoC events are extremely active (or inactive) and correlate that information with the IP/backtrace data to determine which workload aspect was actively running at that time.

The graphic below shows a sample nsys profile timeline. In this case, two Grace C2C0 socket metrics were collected in addition to the IPC (Instructions per Cycle) core metric on each CPU. The C2C0 metrics (C2C0 read and write utilization percentage) show GPU access to the CPU’s memory. The IPC metric shows that thread 2965407, which is running on CPU 146, is memory bound (the IPC value is ~0.05) right before the C2C0 activity. The orange-yellow tick marks under thread 2965407 represent individual instruction pointer/backtrace samples. Users can hover over these samples to get a backtrace that represents the code that the thread was executing at that time. This data can be used to understand what the workload is doing at that time.

_images/an_example_nsys_timeline.png — An Example nsys Timeline#

Use the --cpu-core-metrics, --cpu-socket-metrics, and --sample nsys CLI switches to collect the above data. Also, see the --cpu-core-events, --cpu-socket-events, and --cpuctxsw nsys CLI switches that are used to profile CPU and/or SoC performance issues. For more information, run the nsys profile --help command.

nsys also provides specific support for multi-node performance analysis.

When you run nsys in a multi-node system, it typically generates one result file per rank on the system. While you can load multiple result files into the nsys UI for visualization, the Nsight Systems Multi-Report Analysis system allows you to run statistical analysis across all of the result files (refer to Nsight Systems Multi-Report Analysis for more information).

The nsys Multi-Report Analysis system includes multiple python scripts, called recipes, that provide different analysis. Example recipes include:

cuda_api_sum: CUDA API Summary
cuda_gpu_kern_hist: CUDA GPU Kernel Duration Histogram
gpu_metric_util_map: GPU Metric Utilization Heatmap
mpi_gpu_time_util_map: MPI and GPU Time Utilization Heatmap
network_sum: Network Traffic Summary
nvtx_sum: NVTX Range Summary
ucx_gpu_time_util_map: UCX and GPU Time Utilization Heatmap

Multiple nsys reports can be viewed on a common timeline within the nsys UI. Opening multiple reports together can help identify multi-node issues (refer to Viewing Multiple Nsight Systems Reports for more information on how to open multiple reports on a common timeline). Note that opening multiple reports together can require a large amount of memory. To mitigate this issue, collect the minimum amount of data required to capture the workload’s behavior of interest.

nsys provides specific support for handling application launchers used in multi-node environments such as MPI, OpenSHMEM, UCX, NCCL, NVSHMEM, and SLURM. Refer to Viewing Multiple Nsight Systems Reports for more information.

nsys includes support for profiling many different network communication protocols including MPI, OpenSHMEM, and UCX. It can also profile Network Interface Cards(NICs) and Infiniband Network Switches (refer to Viewing Multiple Nsight Systems Reports for more information).

Additonal Tools#

HPC Benchmarks#

NVIDIA provides the NVIDIA HPC Benchmarks collection via NGC that can be used to confirm your multi-node setup is working as expected. Checkout the HPL, HPL-MxP, and HPCG benchmarks (refer to NVIDIA HPC Benchmarks for more information).

PyTorch Support#

The open source NVIDIA nvidia-resiliency-ext python package can help when running PyTorch-based workloads in a multi-node environment. The nvidia-resiliency-ext tools provides fault tolerance, straggler detection, and checkpointing support (refer to NVIDIA nvidia-resiliency-ext for more information). The checkpoint functionality minimizes wasted time by allowing workloads to be restarted between failures.