Measuring Workload Performance with Hardware Performance Counters#

Many software performance analysis tools rely on event counts from hardware performance monitoring units (PMUs) to characterize workload performance. This chapter provides information about how data from PMUs can be gathered and combined to form metrics for performance optimization. For simplicity, the Linux perf tool is used, but the same metrics can be used in any tool that gathers hardware performance events from PMUs, for example, NVIDIA Nsight Systems.

Introduction to Linux perf#

The Linux perf tool is a widely available, open-source tool that is used to collect application-level and system-level performance data. perf can monitor a rich set of software and hardware performance events from different sources and, in many cases, does not require administrator privileges (for example, root).

Installing perf depends on the distribution:

  • Ubuntu: apt install linux-tools-$(uname -r)

  • Red Hat: dnf install perf-$(uname -r)

  • SLES: zypper install perf

perf gathers data from PMUs by using the perf_event system that is provided by the Linux kernel. Data can be gathered for the lifetime of a process or for a specific period.

perf supports the following measurement types:

  • Performance summary (perf stat): perf collects the total PMU event counts for a workload and provides a high-level summary of the basic performance characteristics.

    This is a good approach for a first-pass performance analysis.

  • Event-based sampling (perf record): perf periodically gathers PMU event counters and relates this data to source code locations.

    The event counts are gathered when a specially configured event counter overflows, which causes an interrupt that contains the instruction pointer addresses and register information. perf uses this information to build stack traces and function-level annotations to characterize the performance of functions of interest on specific call paths. At a high level, this is a good approach for detailed performance characterization after an initial workload characterization. After gathering data with the perf record command, to analyze the data, use the perf report or perf annotate commands.

Configuring Perf#

By default, unprivileged users can only gather information about context-switched events, which includes most of the predefined CPU core events, such as cycle counts, instruction counts, and some software events. To give unprivileged access to all PMUs and global measurements, the following system settings need to be configured:

Note

To configure these settings, you must have root access.

  • perf_event_paranoid: This setting controls privilege checks in the kernel.

    Setting this to -1 or 0 allows non-root users to perform per-process and system-wide performance monitoring (refer to Unprivileged users for more information).

  • kptr_restrict: This setting affects how kernel addresses are exposed.

    Setting it to 0 assists in kernel symbol resolution.

For example:

$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict

To make these settings reboot persistent, follow your Linux distribution’s instructions for configuring system parameters. Typically, you need to edit /etc/sysctl.conf or create a file in the /etc/sysctl.d/ folder that contains the following lines:

kernel.perf_event_paranoid=-1
kernel.kptr_restrict=0

Warning

There are security implications for configuring these settings as shown in the example above. You must read and understand the relevant Linux kernel documentation and consult your system administrator.

To access the PMUs, configure the following kernel driver modules:

  • Core PMU:

    • CONFIG_ARM_PMU_ACPI

    • CONFIG_ARM_PMUV3

  • Statistical Profiling Extension PMU:

    • CONFIG_ARM_SPE_PMU

  • Coresight System PMU:

    • CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU

    • CONFIG_NVIDIA_CORESIGHT_PMU_ARCH_SYSTEM_PMU

Gathering Hardware Performance Data with Perf#

To generate a high-level report of event counts, run the perf stat command. For example, to count cache miss events, CPU cycles, and CPU instructions for ten seconds:

$ perf stat -a -e cache-misses,cycles,instructions sleep 10

You can also gather information for a specific process:

$ perf stat -e cycles,stalled-cycles-backend ./stream.exe

This counts total CPU cycles and cycles where the CPU is stalled on the frontend or backend while stream.exe is executing.

The -e flag accepts a comma-separated list of performance events, which can be predefined, or raw, events. To see the predefined events, type perf list. Raw events are specified as rXXXX where XXXX is a hexadecimal event number. Refer to the Arm Neoverse V2 Core Technical Reference Manual for more information about event numbers.

For a more detailed analysis, and gather in event-based sampling mode, run the perf record command:

$ perf record -e cycles,instructions,dTLB-loads,dTLB-load-misses ./xhpl.exe
$ perf report
$ perf annotate

Additional information about basic perf usage is available in the perf man pages.

Grace CPU Performance Metrics#

This section provides formulas for useful performance metrics, which are the functions of a hardware event count that more fully express the performance characteristics of the system. For example, a simple count of instructions is less meaningful than the ratio of instructions-per-cycle, which characterizes the processor’s usage. These metrics can be used with any tool that gathers hardware performance event data from the Grace PMUs.

The counters are provided by name, instead of event number because most performance analysis tools provide names for common events. If your tool does not have a named counter for one of the following events, use the translation tables in the Arm Neoverse V2 Core Technical Reference Manual to convert the following event names to raw event numbers. For example, FP_SCALE_OPS_SPEC has event number 0x80C0 and FP_FIXED_OPS_SPEC has event number 0x80C1, so data for the FLOPS computational intensity metric can be gathered using perf by measuring raw events 0x80C0 and 0x80C1:

perf record -e r80C0 -e r80C1 ./a.out

Cycle and Instruction Accounting#

  • Instructions Per Cycle: Instructions retired per cycle.

    INST_RETIRED/CPU_CYCLES
    
  • Retiring: Percentage of total slots that are retired operations and indicates efficient CPU usage.

    100 * (OP_RETIRED/OP_SPEC) * (1 - (STALL_SLOT/(CPU_CYCLES*8)))

Warning

The stall_slot_backend and stall_slot_frontend events cannot be used to accurately measure backend and frontend boundness (refer to https://developer.arm.com/documentation/SDEN2332927/latest/). These counters are also used in TopdownL1 metrics in perf, which also cannot be accurate. We recommed that you use stall_backend and stall_frontend instead.

  • Bad Speculation: Percentage of total slots that executed operations and didn’t retire due to a pipeline flush.

    100 * ((1-OP_RETIRED/OP_SPEC) * (1-STALL_SLOT/(CPU_CYCLES * 8)) + BR_MIS_PRED * 4 / CPU_CYCLES)

  • Backend Stalled Cycles: Percentage of cycles that were stalled due to resource constraints in the backend unit of the processor.

    (STALL_BACKEND/CPU_CYCLES)*100

  • Frontend Stalled Cycles: Percentage of cycles that were stalled due to resource constraints in the frontend unit of the processor.

    (STALL_FRONTEND/CPU_CYCLES)*100

Computational Intensity#

  • SVE FLOPS: Floating point operations per second in any precision performed by the SVE instructions.

    Fused instructions count as two operations, for example, a fused multiply-add instruction increases the count by twice the number of active SVE vector lanes. These operations do not count as floating point operations that are performed by scalar or NEON instructions.

    FP_SCALE_OPS_SPEC/TIME

  • Non-SVE FLOPS: Floating point operations per second in any precision performed by an instruction that is not an SVE instruction.

    Fused instructions count as two operations, for example, a scalar fused multiply-add instruction increases the count by two, and a fused multiply-add NEON instruction increases the count by twice the number of vector lanes. These operations do not count as floating point operations performed by SVE instructions.

    FP_FIXED_OPS_SPEC/TIME

  • FLOPS: Floating point operations per second in any precision performed by any instruction.

    Fused instructions count as two operations.

    (FP_SCALE_OPS_SPEC + FP_FIXED_OPS_SPEC)/TIME

Operation Mix#

  • Load Operations Percentage: Percentage of total operations speculatively executed as load operations.

    (LD_SPEC/INST_SPEC)*100

  • Store Operations Percentage: Percentage of total operations speculatively executed as store operations.

    (ST_SPEC/INST_SPEC)*100

  • Branch Operations Percentage: Percentage of total operations speculatively executed as branch operations.

    ((BR_IMMED_SPEC + BR_INDIRECT_SPEC)/INST_SPEC)*100

  • Integer Operations Percentage: Percentage of total operations speculatively executed as scalar integer operations.

    (DP_SPEC/INST_SPEC)*100

    Note

    The DP in DP_SPEC stands for Data Processing.

  • Floating Point Operations Percentage: Percentage of total operations speculatively executed as scalar floating point operations. instructions.

    (VFP_SPEC/INST_SPEC)*100

  • Synchronization Percentage: Percentage of total operations speculatively executed as synchronization operations.

    ((ISB_SPEC + DSB_SPEC + DMB_SPEC)/INST_SPEC)*100

  • Crypto Operations Percentage: Percentage of total operations speculatively executed as crypto operations.

    (CRYPTO_SPEC/INST_SPEC)*100

  • SVE Operations (Load/Store Inclusive) Percentage: Percentage of total operations speculatively executed as scalable vector operations, including loads and stores.

    (SVE_INST_SPEC/INST_SPEC)*100

  • Advanced SIMD Operations Percentage: Percentage of total operations speculatively executed as advanced SIMD (NEON) operations.

    (ASE_INST_SPEC/INST_SPEC)*100

  • SIMD Percentage: Percentage of total operations speculatively executed as integer or floating point vector/SIMD operations.

    ((SVE_INST_SPEC + ASE_INST_SPEC)/INST_SPEC)*100

  • FP16 Percentage: Percentage of total operations speculatively executed as half-precision floating point operations.

    Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

    (FP_HP_SPEC/INST_SPEC)*100

  • FP32 Percentage: Percentage of total operations speculatively executed as single-precision floating point operations.

    Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

    (FP_SP_SPEC/INST_SPEC)*100

  • FP64 Percentage: Percentage of total operations speculatively executed as double-precision floating point operations.

    Includes scalar, fused, and SIMD instructions and cannot be used to measure computational intensity.

    (FP_DP_SPEC/INST_SPEC)*100

SVE Predication#

  • Full SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations with all active predicates.

    (SVE_PRED_FULL_SPEC/INST_SPEC)*100

  • Partial SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations in which at least one element is FALSE.

    (SVE_PRED_PARTIAL_SPEC/INST_SPEC)*100

  • Empty SVE Operations Percentage: Percentage of total operations speculatively executed as SVE SIMD operations with no active predicate.

    (SVE_PRED_EMPTY_SPEC/INST_SPEC)*100

Cache Effectiveness#

  • L1 Data Cache Miss Ratio: Ratio of total level 1 data cache read or write accesses that miss.

    L1D_CACHE includes reads and writes and is the sum of L1D_CACHE_RD and L1D_CACHE_WR.

    L1D_CACHE_REFILL/L1D_CACHE

  • L1D Cache MPKI: The number of level 1 data cache accesses missed per thousand instructions executed.

    (L1D_CACHE_REFILL/INST_RETIRED) * 1000

  • L1I Cache Miss Ratio: The ratio of level 1 instruction cache accesses missed to the total number of level 1 instruction cache accesses.

    L1I_CACHE does not measure cache maintenance instructions or non-cacheable accesses.

    L1I_CACHE_REFILL/L1I_CACHE

  • L1I Cache MPKI: The number of level 1 instruction cache accesses missed per thousand instructions executed.

    (L1I_CACHE_REFILL/INST_RETIRED) * 1000

  • L2 Cache Miss Ratio: The ratio of level 2 cache accesses missed to the total number of level 2 cache accesses.

    L2D_CACHE does not count cache maintenance operations or snoops from outside the core.

    L2D_CACHE_REFILL/L2D_CACHE

  • L2 Cache MPKI: The number of level 2 unified cache accesses missed per thousand instructions executed.

    (L2D_CACHE_REFILL/INST_RETIRED) * 1000

  • LL Cache Read Hit Ratio: The ratio of last level cache read accesses hit in the cache to the total number of last level cache accesses.

    (LL_CACHE_RD - LL_CACHE_MISS_RD) / LL_CACHE_RD

  • LL Cache Read Miss Ratio: The ratio of last level cache read accesses missed to the total number of last level cache accesses.

    LL_CACHE_MISS_RD / LL_CACHE_RD

  • LL Cache Read MPKI: The number of last level cache read accesses missed per thousand instructions executed.

    (LL_CACHE_MISS_RD/INST_RETIRED) * 1000

TLB Effectiveness#

  • L1 Data TLB Miss Ratio: The ratio of level 1 data TLB accesses missed to the total number of level 1 data TLB accesses.

    TLB maintenance instructions are not counted.

    L1D_TLB_REFILL/L1D_TLB

  • L1 Data TLB MPKI: The number of level 1 data TLB accesses missed per thousand instructions executed.

    (L1D_TLB_REFILL/INST_RETIRED) * 1000

  • L1 Instruction TLB Miss Ratio: The ratio of level 1 instruction TLB accesses missed to the total number of level 1 instruction TLB accesses.

    TLB maintenance instructions are not counted.

    L1I_TLB_REFILL/L1I_TLB

  • L1 Instruction TLB MPKI: The number of level 1 instruction TLB accesses missed per thousand instructions executed.

    (L1I_TLB_REFILL/INST_RETIRED) * 1000

  • DTLB MPKI: The number of data TLB Walks per thousand instructions executed.

    (DTLB_WALK/INST_RETIRED) * 1000

  • DTLB Walk Ratio: The ratio of data TLB Walks to the total number of data TLB accesses.

    DTLB_WALK / L1D_TLB

  • ITLB MPKI: The number of instruction TLB Walks per thousand instructions executed.

    (ITLB_WALK/INST_RETIRED) * 1000

  • ITLB Walk Ratio: The ratio of instruction TLB Walks to the total number of instruction TLB accesses.

    ITLB_WALK / L1I_TLB

  • L2 Unified TLB Miss Ratio: The ratio of level 2 unified TLB accesses missed to the total number of level 2 unified TLB accesses.

    L2D_TLB_REFILL / L2D_TLB

  • L2 Unified TLB MPKI: The number of level 2 unified TLB accesses missed per thousand instructions executed.

    (L2D_TLB_REFILL/INST_RETIRED) * 1000

Branching#

  • Branch Misprediction Ratio: The ratio of branches mispredicted to the total number of branches architecturally executed.

    BR_MIS_PRED_RETIRED/BR_RETIRED

  • Branch MPKI: The number of branch mispredictions per thousand instructions executed.

    (BR_MIS_PRED_RETIRED/INST_RETIRED) * 1000

Statistical Profiling Extension#

Arm’s Statistical Profiling Extension (SPE) is designed to enhance software performance analysis by providing detailed profiling capabilities. This extension allows you to collect statistical data about software execution and records key execution data, including program counters, data addresses, and PMU events. SPE enhances performance analysis for branches, memory access, and so on, which makes it useful for software optimization. Refer to Statistical Profiling Extension for more information about SPE.

Grace Coresight System PMU Units#

Grace includes the Coresight System PMUs in the following table that are registered by the PMU driver with the specified naming conventions.

Grace Coresight System PMU Units#

System

Coresight System PMU name

Scalable Coherency Fabric (SCF)

nvidia_scf_pmu_<socket-id>

NVLINK-C2C0

nvidia_nvlink_c2c0_pmu_<socket-id>

NVLINK-C2C1 (Grace-Hopper Only)

nvidia_nvlink_c2c1_pmu_<socket-id>

PCIe

nvidia_pcie_pmu_<socket-id>

CNVLINK (Multi socket Grace-Hopper Only)

nvidia_cnvlink_pmu_<socket-id>

Coresight System PMU events are not attributable to a core, so the perf tool must be run in system-wide mode. If the measurement requires multiple events to be measured, perf tools support event grouping from the same PMU. For example, to monitor SCF CYCLES, CMEM_WB_ACCESS, and CMEM_WR_ACCESS events from the SCF PMU for socket 0:

$ perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/event=cycles/,nvidia_scf_pmu_0/cmem_wb_access/,nvidia_scf_pmu_0/cmem_wr_access/}'
cmem_write_test
 Performance counter stats for 'system wide':
         168225760 ns duration_time
          10515321 nvidia_scf_pmu_0/event=cycles/
            191567 nvidia_scf_pmu_0/cmem_wb_access/

                 0 nvidia_scf_pmu_0/cmem_wr_access/
      0.168225760 seconds time elapsed

Coresight System PMU Traffic Coverage#

The traffic pattern determines which PMU is used for measuring the different access types, and this might vary depending on the chip configuration.

Grace CPU Superchip#

The following table contains the summary of Coresight System PMU events that can be used to monitor different memory traffic types in Grace CPU Superchip.

PMU Accounting for Access Patterns on the Grace Superchip#

PMU

Event

Traffic

SCF PMU

*SCF _CACHE*

Access to last level cache.

CMEM*

Memory traffic from all CPUs (local and remote socket) to local memory. For example, a measurement using socket 0 PMU counts the traffic from CPUs in socket 0 and socket 1 to the memory in socket 0.

SOCKET* or REMOTE*

Memory traffic from CPUs in local socket to remote memory over C2C.

PCIe PMU

*LOC

Memory traffic from PCIe root ports to local memory.

*REM

Memory traffic from PCIe root ports to remote memory.

NVLINK-C2C0 PMU

*LOC

Memory traffic from PCIe of remote socket to local memory over C2C. For example, a measurement using socket 0 PMU counts the traffic from PCIe of socket 1.

Note: The write traffic is limited to PCIe relaxed-ordered write.

NVLINK-C2C1 PMU

Unused. Me asurement on this PMU will return zero count.

CNVLINK PMU

Unused. Me asurement on this PMU will return zero count.

*The prefix or suffix of the PMU event name.

CPU Traffic Measurement#

This section provides information about measuring CPU traffic.

_images/cpu_traffic_measurement.png

CPU Traffic Measurement#

The local CPU traffic generates CMEM events on the source socket’s SCF PMU. The remote CPU traffic generates the following events:

  • REMOTE* or SOCKET* events on the source socket’s SCF PMU.

  • CMEM* events on the destination socket’s SCF PMU.

Scalable Coherency Fabric PMU Accounting contains the performance metrics that can be derived from these events.

Here are some examples of CPU traffic measurements that capture the write size in bytes and read size in data beats, where each data beat transfers up to 32 bytes.

# Local traffic: Socket-0 CPU write to socket-0 DRAM.
# The workload performs a 1 GB write and the CPU and memory are pinned to socket-0.
# The result from SCF PMU in socket-0 shows 1,009,299,148 bytes of CPU writes to socket-0
# memory.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=0 --membind=0 dd if=/dev/zero of=/tmp/node0 bs=1M
count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.125672 s, 8.5 GB/s

Performance counter stats for 'system wide':

      27,496,157 ns duration_time
      1,009,299,148 nvidia_scf_pmu_0/cmem_wr_total_bytes/
      1,213,938 nvidia_scf_pmu_0/cmem_rd_data/
      1,355,223 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
            7,960 nvidia_scf_pmu_1/remote_socket_rd_data/
      0.127496157 seconds time elapsed

# Local traffic: Socket-0 CPU read from socket-0 DRAM
# The workload performs a 1 GB read and the CPU and memory are pinned to socket-0.

# The result from SCF PMU in socket-0 shows 35,572,420 data beats of CPU reads from socket-0
# memory. Converting it to bytes: 35,572,420 x 32 bytes = 1,138,317,440 bytes.**
nvidia@localhost:~$ sudo perf stat -a -e duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}' numactl --cpunodebind=0 --membind=0 dd of=/dev/null if=/tmp/node0 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0872731 s, 12.3 GB/s
Performance counter stats for 'system wide':
      88,826,372 ns duration_time
      36,057,808 nvidia_scf_pmu_0/cmem_wr_total_bytes/
      35,572,420 nvidia_scf_pmu_0/cmem_rd_data/**
            24,173 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
            4,728 nvidia_scf_pmu_1/remote_socket_rd_data/
      0.088826372 seconds time elapsed

# Remote traffic: Socket-1 CPU write to socket-0 DRAM
# The workload performs a 1 GB write. The CPU is pinned to socket-1 and memory is pinned to socket-0.
# The result from SCF PMU in socket-1 shows 961,728,219 bytes of remote writes to socket-0**
# memory. The remote writes are also captured by the destination (socket-0) SCF PMU that shows
# 993,278,696 bytes of any CPU writes to socket-0 memory.
nvidia@localhost:~$ sudo perf stat -a -e duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=1 --membind=0 dd if=/dev/zero of=/tmp/node0 bs=1M
count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.171349 s, 6.3 GB/s

 Performance counter stats for 'system wide':
       172,847,464 ns duration_time
       993,278,696 nvidia_scf_pmu_0/cmem_wr_total_bytes/
       1,040,902 nvidia_scf_pmu_0/cmem_rd_data/
       961,728,219 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/**
       1,065,136 nvidia_scf_pmu_1/remote_socket_rd_data/
       0.172847464 seconds time elapsed

# Remote traffic: Socket-1 CPU read from socket-0 DRAM**
# The workload performs a 1 GB read. The CPU is pinned to socket-1 and memory is pinned to socket-0.
# The result from SCF PMU in socket-1 shows 36,189,087 data beats of remote reads from socket-0
# memory. The remote reads are also captured by the destination (socket-0) SCF PMU that shows
# 33,542,984 data beats of any CPU read from socket-0 memory.**
# Converting it to bytes:
# - Socket-1 CPU remote read: 36,189,087 x 32 bytes = 1,158,050,784 bytes.
# - Any CPU read from socket-0 memory = 33,542,984 x 32 bytes = 1,073,375,488 bytes.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_scf_pmu_0/cmem_wr_total_bytes/,nvidia_scf_pmu_0/cmem_rd_data/}','{nvidia_scf_pmu_1/remote_socket_wr_total_bytes/,nvidia_scf_pmu_1/remote_socket_rd_data/}'
numactl --cpunodebind=1 --membind=0 dd of=/dev/null if=/tmp/node0 bs=1M
count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.133185 s, 8.1 GB/s

Performance counter stats for 'system wide':

     134,526,031 ns duration_time
     19,588,224 nvidia_scf_pmu_0/cmem_wr_total_bytes/
     33,542,984 nvidia_scf_pmu_0/cmem_rd_data/
     18,771,438 nvidia_scf_pmu_1/remote_socket_wr_total_bytes/
     36,189,087 nvidia_scf_pmu_1/remote_socket_rd_data/**
     0.134526031 seconds time elapsed

PCIe Traffic Measurement#

This section provides information about measuring PCIe traffic.

_images/pcie_traffic_measurement.png

PCIe Traffic Measurement#

The local PCIe traffic generates *LOC events on the source socket’s PCIe PMU, and the remote PCIe traffic generates the following events:

  • *REM events on the source socket’s PCIe PMU.

  • *LOC events on the destination socket’s NVLINK-C2C0 PMU.

PCIe PMU Accounting contains the performance metrics that can be derived from these events.

The PCIe PMU monitors traffic from PCIe root ports to local/remote memory. Grace SOC supports up to 10 PCIe root ports. At least one of these root ports need to be specified in perf tool using the root_port bitmap filter. For example, the root_port=0xF filter corresponds to root ports 0 through 3.

Example: To count the rd_bytes_loc event from PCIe root port 0 and 1 of socket 0:

$ perf stat -a -e nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x3/

Example: To count the rd_bytes_rem event from PCIe root port 2 in socket 1 and measure the test duration (in nanoseconds):

$ perf stat -a -e duration_time,'{nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x4/}'

Here is an example of finding the root port number of a PCIe device.

# Find the root port number of the NVLINK drive on each socket.
# nvme0 is attached to domain number 0018, which corresponds to socket-1 root port 0x8.
# nvme1 is attached to domain number 0008, which corresponds to socket-0 root port 0x8.
nvidia@localhost:~$ readlink -f /sys/class/nvme/nvme*
/sys/devices/pci0018:00/**0018**:00:00.0/0018:01:00.0/nvme/nvme0
/sys/devices/pci0008:00/**0008**:00:00.0/0008:01:00.0/0008:02:00.0/0008:03:00.0/nvme/nvme1

# Query the mount point of each nvme drive to get the path for the workloads below.
nvidia@localhost:~$ df -H | grep nvme0
/dev/**nvme0n1p2** 983G 625G 309G 67% /
nvidia@localhost:~$ df -H | grep nvme1
/dev/**nvme1n1** 472G 1.1G 447G 1% /var/data0

Here are some examples of PCIe traffic measurements when reading data from DRAM and storing the result to an NVME drive. These capture the size of read/write to local/remote memory in bytes.

# Local traffic: Socket-0 PCIE read from socket-0 DRAM.
# The workload copies data from socket-0 DRAM to socket-0 NVME in root port 0x8. The NVME would
# perform a read request to DRAM.
# The root_port filter value: 1 << root port number = 1 << 8 = 0x100
# The result from PCIe PMU in socket-0 shows 1,168,472,064 bytes of PCIe read from socket-0 memory.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_0/wr_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_0/rd_bytes_rem,root_port=0x100/,nvidia_pcie_pmu_0/wr_bytes_rem,root_port=0x100/}'
numactl --cpunodebind=0 --membind=0 dd if=/dev/zero of=/var/data0/pcie_test_0 bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.96463 s, 547 MB/s

   Performance counter stats for 'system wide':
      1,966,391,711 ns duration_time
      1,168,472,064 nvidia_pcie_pmu_0/rd_bytes_loc,root_port=0x100/
      31,250,176 nvidia_pcie_pmu_0/wr_bytes_loc,root_port=0x100/
            49,152 nvidia_pcie_pmu_0/rd_bytes_rem,root_port=0x100/
            0 nvidia_pcie_pmu_0/wr_bytes_rem,root_port=0x100/
      1.966391711 seconds time elapsed
# Remote traffic: Socket-1 PCIE read from socket-0 DRAM.
# The workload copies data from socket-0 DRAM to socket-1 NVME in root port 0x8. The NVME would perform a read request to DRAM.
# The root_port filter value: 1 << root port number = 1 << 8 = 0x100
# The result from PCIe PMU in socket-1 shows 1,073,762,304 bytes of PCIe read from socket-0 memory.
# The remote reads are also captured by the destination (socket-0) NVLINK C2C0 PMU that shows 1,074,057,216 bytes of reads.
nvidia@localhost:~$ sudo perf stat -a -e
duration_time,'{nvidia_pcie_pmu_1/rd_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_1/wr_bytes_loc,root_port=0x100/,nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x100/,nvidia_pcie_pmu_1/wr_bytes_rem,root_port=0x100/}','{nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}' numactl --membind=0 dd if=/dev/zero of=/home/nvidia/pcie_test_0 bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.73373 s, 1.5 GB/s

Performance counter stats for 'system wide':
      735,201,612 ns duration_time
      6,398,720 nvidia_pcie_pmu_1/rd_bytes_loc,root_port=0x100/
         164,096 nvidia_pcie_pmu_1/wr_bytes_loc,root_port=0x100/
      1,073,762,304 nvidia_pcie_pmu_1/rd_bytes_rem,root_port=0x100/**
          0 nvidia_pcie_pmu_1/wr_bytes_rem,root_port=0x100/
      1,074,057,216 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/
         32,768 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/
      0.735201612 seconds time elapsed

Grace Hopper Superchip#

The following table contains a summary of the Coresight System PMU events that can be used to monitor different memory traffic types in Grace Hopper Superchip.

PMU Accounting for Access Patterns on the Grace Hopper Superchip#

PMU

Event

Traffic

SCF PMU

*SCF_CACHE*

Access to last level cache.

CMEM*

Memory traffic from CPU to CPU memory.

GMEM*

Memory traffic from CPU to GPU memory over C2C.

PCIe PMU

*LOC

Memory traffic from PCIe root ports to CPU or GPU memory.

NVLINK-C2C0 PMU

*LOC

ATS translated or EGM memory traffic from GPU to CPU memory over C2C.

NVLINK-C2C1 PMU

*LOC

Non-ATS translated memory traffic from GPU to CPU or GPU memory over C2C.

CNVLINK PMU

Unused. Measurement on this PMU will return a zero count.

* The prefix or suffix of a PMU event name.

Refer to Grace CPU Superchip for more information about CPU and PCIe traffic measurement.

GPU Traffic Measurement#

This section provides information about measuring the GPU traffic.

_images/gpu_traffic_measurement.png

GPU Traffic Measurement#

The GPU traffic generates *LOC events on the Grace Coresight System C2C PMUs, and the NVLINK-C2C0 PMU captures ATS translated or EGM memory traffic. NVLINK-C2C1 PMU captures non-ATS translated traffic. The solid line indicates only that the path from the SOC to memory has been captured. The C2C PMUs do not capture the latency effect of the C2C interconnect. Refer to NVLink C2C PMU Accounting for more information about the performance metrics that can be derived from these events.

Here are some examples of GPU traffic measurements that capture the size of read/write in bytes.

# GPU write to CPU DRAM.
# The workload performs a write by GPU copy engine to CPU DRAM.
# The result from NVLINK-C2C0 PMU shows 4,026,531,840 bytes of writes to DRAM.

nvidia@localhost:/home/nvidia/nvbandwidth$ sudo perf stat -a -e '{nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}','{nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/}'
./nvbandwidth -t 1
nvbandwidth Version: v0.4
Built from Git version: v0.4

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12050
CUDA Driver Version: 12050
Driver Version: 555.42.02
Device 0: NVIDIA GH200 120GB
Running device_to_host_memcpy_ce.memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)

      0
0 295.07

SUM **device_to_host_memcpy_ce** 295.07

  Performance counter stats for 'system wide':

      4,234,950,656 nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/
      208,418,816 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/
      **4,026,531,840 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/**
      26,981,632 nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/
      6,337,792 nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/
      20,643,840 nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/
      0.777059774 seconds time elapsed
**# GPU read from CPU DRAM.**
**# The workload performs a read by GPU copy engine from CPU DRAM.**
**# The result from NVLINK-C2C0 PMU shows 4,234,927,104 bytes of reads from DRAM.**
nvidia@localhost:/home/nvidia/nvbandwidth$ sudo perf stat -a -e'{nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/}','{nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/,nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/}'
./nvbandwidth -t 0
nvbandwidth Version: v0.4
Built from Git version: v0.4
NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12050
CUDA Driver Version: 12050
Driver Version: 555.42.02

Device 0: NVIDIA GH200 120GB

Running host_to_device_memcpy_ce.memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)

       0
0 408.42

SUM **host_to_device_memcpy_ce** 408.42
 Performance counter stats for 'system wide':
      4,436,253,696 nvidia_nvlink_c2c0_pmu_0/total_bytes_loc/
      **4,234,927,104 nvidia_nvlink_c2c0_pmu_0/rd_bytes_loc/**
      201,326,592 nvidia_nvlink_c2c0_pmu_0/wr_bytes_loc/
      26,949,888 nvidia_nvlink_c2c1_pmu_0/total_bytes_loc/
      6,356,480 nvidia_nvlink_c2c1_pmu_0/rd_bytes_loc/
      20,593,408 nvidia_nvlink_c2c1_pmu_0/wr_bytes_loc/
      0.771266524 seconds time elapsed

Scalable Coherency Fabric PMU Accounting#

SCF PMU Events#

The following tables provide the events exposed by the PMU driver and perf tool.

SCF PMU Events for Cycle Counter#

Event Name

ID

Description

CYCLES

0x100000000

SCF Clock Cycles.

BUS_CYCLES

0x1d

Same with cycles

SCF PMU Events for CPU Last-level Cache Events#

Event Name

ID

Description

SCF_CACHE_ALLOCATE

0xf0

Read / Write / Atomic / Prefetch requests which are allocated in LL Cache.

SCF_CACHE_REFILL

0xf1

Read / Atomic/ StashOnce requests which miss in Last Level cache and inner caches resulting in request to next level memory hierarchy (# of Read Requests which miss in CTAG and DIR and fetch data from L4).

SCF_CACHE

0xf2

LL Cache Read / Write / Atomic / Snoop / Prefetch requests not resulting in decode errors.
NOTE: Dropped L3 prefetches not counted.

SCF_CACHE_WB

0xf3

Number of capacity evictions from LL Cache.

SCF PMU Events for CPU Memory Transport Events#

Event Name

ID

Description

CMEM_RD_DATA

0x1a5

Counts the total number of SCF Read data beats transferred from Local Grace DRAM to Local or Remote CPU. Each data beat transfers up to 32 bytes.

CMEM_RD_ACCESS

0x1a6

Counts the number of SCF Read accesses from Local or Remote CPU to Local Grace DRAM. Increments by total new Read accesses each cycle.

CMEM_RD_OUTSTANDING

0x1a7

Counts the total number of outstanding SCF Read accesses from Local or Remote CPU to Local socket Grace DRAM. Increments by total outstanding count each cycle.

CMEM_WB_DATA

0x1ab

Counts the total number of SCF Write-back data beats (clean/dirty) transferred from Local or Remote CPU to Local Grace DRAM (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

CMEM_WB_ACCESS

0x1ac

Counts the number of SCF Write-back accesses (clean/dirty) from Local or Remote CPU to Local Grace DRAM (excluding Write-unique/non-coherent Writes). Increments by total new Write accesses each cycle. Each Write-back access is a 64-byte transfer.

CMEM_WR_DATA

0x1b1

Counts the number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes (including for WriteZeros which is counted as 64 bytes) from Local or Remote CPU to Local Grace DRAM.

CMEM_WR_TOTAL_BYTES

0x1db

Counts the total number of bytes transferred to Local Grace DRAM by Write-backs, Write-unique and non-coherent Writes from Local or Remote CPU. CMEM_WR_TOTAL_BYTES = CMEM_WB_DATA*32 + CMEM_WR_DATA

CMEM_WR_ACCESS

0x1ca

Counts the total number of SCF Write-unique and non-coherent Write (including WriteZeros) requests from Local or Remote CPU to Local Grace DRAM.

SCF PMU Events for GPU Memory Transport Events in Grace Hopper Superchip#

Event Name

ID

Description

GMEM_RD_DATA

0x16d

Counts the total number of Read data beats transferred from Local coherent GPU DRAM to Local or Remote CPU. Each data beat transfers up to 32 bytes.

GMEM_RD_ACCESS

0x16e

Counts the number of SCF Read accesses from Local or Remote CPU to Local coherent GPU DRAM. Increments by total new Read accesses each cycle.

GMEM_RD_OUTSTANDING

0x16f

Indicates the total number of outstanding Read requests from Local or Remote CPU to coherent GPU DRAM. Increments by outstanding Read request count each cycle.

GMEM_WB_DATA

0x173

Counts the total number of SCF Write-back data beats transferred from Local or Remote CPU to Local coherent GPU DRAM (excluding Write-unique / non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

GMEM_WB_ACCESS

0x174

Counts the number of SCF Write-back accesses (clean and dirty) from Local or Remote CPU to Local coherent GPU DRAM (excluding Write-unique / non-coherent Writes) . Increments by total new Write accesses each cycle. Each Write-back access is a 64-byte transfer.

GMEM_WR_DATA

0x179

Counts the number of bytes transferred based on size field for SCF Write-unique and non-coherent Writes from Local or Remote CPU to Local coherent GPU DRAM.

GMEM_WR_ACCESS

0x17b

Counts the total number of SCF Write-unique and non-coherent Write requests from Local or Remote CPU to Local coherent GPU DRAM.

GMEM_WR_TOTAL_BYTES

0x1a0

Counts the total number of bytes transferred to Local coherent GPU DRAM by Write-backs, Write-unique and non-coherent Writes.
GMEM_WR_TOTAL_BYTES = GMEM_WB_DATA*32 + GMEM_WR_DATA

SCF PMU Events for Remote Socket Transport Events#

Event Name

ID

Description

SOCKET_0_RD_DATA

0x101

Counts the total number of SCF Read data beats transferred from CNVLINK Socket 0 memory or NVLINK-C2C socket 0 memory to Local CPU. Each data beat transfers up to 32 bytes.

SOCKET_1_RD_DATA

0x102

Counts the total number of SCF Read data beats transferred from CNVLINK Socket 1 memory or NVLINK-C2C socket 1 memory to Local CPU. Each data beat transfers up to 32 bytes.

SOCKET_2_RD_DATA

0x103

Counts the total number of SCF Read data beats transferred from CNVLINK Socket 2 memory to Local CPU. Each data beat transfers up to 32 bytes.

SOCKET_3_RD_DATA

0x104

Counts the total number of SCF Read data beats transferred from CNVLINK Socket 3 memory to Local CPU. Each data beat transfers up to 32 bytes.

SOCKET_0_WB_DATA

0x109

Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

SOCKET_1_WB_DATA

0x10a

Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory (excluding Write-unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

SOCKET_2_WB_DATA

0x10b

Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 2 memory (excluding Write unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

SOCKET_3_WB_DATA

0x10c

Counts the total number of SCF Write-back data beats transferred from Local CPU to CNVLINK socket 3 memory (excluding Write unique/non-coherent Writes). Each wb_data beat transfers 32 bytes of data.

SOCKET_0_WR_DATA

0x17c

Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.

SOCKET_1_WR_DATA

0x17d

Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.

SOCKET_2_WR_DATA

0x17e

Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 2 memory.

SOCKET_3_WR_DATA

0x17f

Counts the total number of bytes transferred, based on size field, for SCF Write-unique and non-coherent Writes from Local CPU to CNVLINK socket 3 memory.

SOCKET_0_RD_OUTSTANDING

0x115

Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory. Increments by total outstanding count each cycle.

SOCKET_1_RD_OUTSTANDING

0x116

Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory. Increments by total outstanding count each cycle.

SOCKET_2_RD_OUTSTANDING

0x117

Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 2 memory. Increments by total outstanding count each cycle.

SOCKET_3_RD_OUTSTANDING

0x118

Counts the total number of outstanding SCF Read accesses from Local CPU to CNVLINK socket 3 memory. Increments by total outstanding count each cycle.

SOCKET_0_RD_ACCESS

0x12d

Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 0 memory or NVLINK-C2C socket 0 memory. Increments by total new Read accesses each cycle.

SOCKET_1_RD_ACCESS

0x12e

Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 1 memory or NVLINK-C2C socket 1 memory. Increments by total new Read accesses each cycle.

SOCKET_2_RD_ACCESS

0x12f

Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 2 memory. Increments by total new Read accesses each cycle.

SOCKET_3_RD_ACCESS

0x130

Counts the number of SCF Read accesses from Local CPU to CNVLINK Socket 3 memory. Increments by total new Read accesses each cycle.

SOCKET_0_WB_ACCESS

0x135

Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.

SOCKET_1_WB_ACCESS

0x136

Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.

SOCKET_2_WB_ACCESS

0x137

Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 2 memory.

SOCKET_3_WB_ACCESS

0x138

Count all Write-back (clean and dirty) accesses from Local CPU to CNVLINK socket 3 memory.

SOCKET_0_WR_ACCESS

0x139

Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 0 memory or NVLINK-C2C socket 0 memory.

SOCKET_1_WR_ACCESS

0x13a

Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 1 memory or NVLINK-C2C socket 1 memory.

SOCKET_2_WR_ACCESS

0x13b

Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 2 memory.

SOCKET_3_WR_ACCESS

0x13c

Count the Write-unique and non-coherent Write accesses from Local CPU to CNVLINK socket 3 memory.

REMOTE_SOCKET_WR_TOTAL_BYTES

0x1a1

Counts the total number of bytes transferred, to Remote sockets’ memory by Write-backs, Write-unique, non-coherent Writes, and all atomics. For the non-coherent Writes, number of bytes is based on size field.

Based on Grace socket index, Remote destinations are sockets other than Local socket index.

Example:
The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU:
REMOTE_SOCKET_WR_TOTAL_BYTES = (SOCKET_1_WB_DATA + SOCKET_2_WB_DATA + SOCKET_3_WB_DATA)*32 + (SOCKET_1_WR_DATA + SOCKET_2_WR_DATA + SOCKET_3_WR_DATA)

REMOTE_SOCKET_RD_DATA

0x1a2

Counts the total number of Read data beats transferred from all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Each data beat transfers up to 32 bytes.

Based on Grace socket index, Remote destinations are sockets other than Local socket index.

Example:
The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU:
REMOTE_SOCKET_RD_DATA = SOCKET_1_RD_DATA + SOCKET_2_RD_DATA + SOCKET_3_RD_DATA

NOTE:
In Grace Superchip configuration, this event count need to match with SOCKET_0_RD_DATA or SOCKET_1_RD_DATA, depending on the measuring socket’s index.

REMOTE_SOCKET_RD_OUTSTANDING

0x1a3

Counts the total number of outstanding Read accesses to all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Increments by total outstanding count each cycle.

Based on Grace socket index, Remote destinations are sockets other than Local socket index.

Example:
The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU:
REMOTE_SOCKET_RD_OUTSTANDING = SOCKET_1_RD_OUTSTANDING + SOCKET_2_RD_OUTSTANDING + SOCKET_3_RD_OUTSTANDING

NOTE:
In Grace Superchip configuration, this event count need to match with SOCKET_0_RD_OUTSTANDING or SOCKET_1_RD_OUTSTANDING, depending on the measuring socket’s index.

REMOTE_SOCKET_RD_ACCESS

0x1a4

Counts the number of Read accesses to all CNVLINK Remote socket memory or NVLINK-C2C Remote socket memory. Increments by total new Read accesses each cycle.

Based on Grace socket index, Remote destinations are sockets other than Local socket index.

Example:
The following metrics corresponds to socket 1, 2, and 3 memory destination metrics captured on socket 0 SCF PMU:
REMOTE_SOCKET_RD_ACCESS = SOCKET_1_RD_ACCESS + SOCKET_2_RD_ACCESS + SOCKET_3_RD_ACCESS

NOTE:
In Grace superchip configuration, this event count need to match with SOCKET_0_RD_ACCESS or SOCKET_1_RD_ACCESS, depending on the measuring socket’s index.

SCF Maximum Transport Request Utilization#

SCF Maximum Transport Request Utilization#

Target

Max Read Requests per Cycle

Max Write Requests per Cycle

CMEM

8 (RD_ACCESS)

8 (WR_ACCESS or WB_ACCESS)

GMEM

4 (RD_ACCESS)

4 (WR_ACCESS or WB_ACCESS)

REMOTE

2 (RD_ACCESS)

2 (WR_ACCESS or WB_ACCESS)

SCF PMU Metrics#

This section provides additional formulas for useful performance metrics based on events provided by the Scalable Coherency Fabric (SCF) PMU.

  • Duration: Duration of the measurement in nanoseconds.

    • Event: DURATION_TIME

  • Cycles: SCF cycle count

    • Event: CYCLES

  • SCF Frequency: Frequency of SCF cycles in GHz

    • Metric formula: CYCLES/DURATION_TIME

  • Local CPU memory write bandwidth: Write bandwidth to local CPU memory, in GBps.

    • Source:

      • Grace CPU Superchip: local and remote CPU

      • Single socket Grace Hopper Superchip: local CPU

    • Destination: local CPU memory

    • Metric formula: CMEM_WR_TOTAL_BYTES/DURATION_TIME

  • Local CPU memory read bandwidth: Read bandwidth from local CPU memory, in GBps.

    • Source:

      • Grace CPU Superchip: local and remote CPU

      • Single socket Grace Hopper Superchip: local CPU

    • Destination: local CPU memory

    • Metric formula: CMEM_RD_DATA*32/DURATION_TIME

  • Local GPU memory write bandwidth: Write bandwidth to local GPU memory, in GBps.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: N/A

      • Single socket Grace Hopper Superchip: local GPU memory

    • Metric formula: GMEM_WR_TOTAL_BYTES/DURATION_TIME

  • Local GPU memory read bandwidth: Read bandwidth from local GPU memory, in GBps.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: N/A

      • Single socket Grace Hopper Superchip: local GPU memory

    • Metric formula: GMEM_RD_DATA*32/DURATION_TIME

  • Remote memory write bandwidth: Write bandwidth to remote socket memory, in GBps.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: remote CPU memory

      • Single socket Grace Hopper Superchip: N/A

    • Metric formula: REMOTE_SOCKET_WR_TOTAL_BYTES/DURATION_TIME

  • Remote memory read bandwidth: Read bandwidth from remote socket memory, in GBps.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: remote CPU memory

      • Single socket Grace Hopper Superchip: N/A

    • Metric formula: REMOTE_SOCKET_RD_DATA*32/DURATION_TIME

  • Local CPU memory write utilization percentage: Percent usage of writes to local CPU memory.

    • Source:

      • Grace CPU Superchip: local and remote CPU

      • Single socket Grace Hopper Superchip: local CPU

    • Destination: local CPU memory

    • Metric formula: ((CMEM_WB_ACCESS + CMEM_WR_ACCESS)/(8*CYCLES)) * 100.0

  • Local GPU memory write utilization percentage: Percent usage of writes to local GPU memory.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: N/A

      • Single socket Grace Hopper Superchip: local GPU memory

    • Metric formula: ((GMEM_WB_ACCESS + GMEM_WR_ACCESS)/(4*CYCLES)) * 100.0

  • Remote memory write utilization percentage: Percent usage of writes to the remote socket memory.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: remote CPU memory

      • Single socket Grace Hopper Superchip: N/A

    • Metric formula:

      • Socket 0 access to socket 1 memory (measured in socket 0) : ((SOCKET_1_WB_ACCESS + SOCKET_1_WR_ACCESS)/(2*CYCLES)) * 100.0

      • Socket 1 access to socket 0 memory (measured in socket 1) : ((SOCKET_0_WB_ACCESS + SOCKET_0_WR_ACCESS)/(2*CYCLES)) * 100.0

  • Local CPU memory read utilization percentage: Percent usage of reads from local CPU memory.

    • Source:

      • Grace CPU Superchip: local and remote CPU

      • Single socket Grace Hopper Superchip: local CPU

    • Destination: local CPU memory

    • Metric formula: ((CMEM_RD_ACCESS)/(8*CYCLES)) * 100.0

  • Local GPU memory read utilization percentage: Percent usage of reads from local GPU memory.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: N/A

      • Single socket Grace Hopper Superchip: local GPU memory

    • Metric formula: ((GMEM_RD_ACCESS)/(4*CYCLES)) * 100.0

  • Remote memory read utilization percentage: Percent usage of reads from remote socket memory.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: remote CPU memory

      • Single socket Grace Hopper Superchip: N/A

    • Metric formula:

      • Socket 0 access to socket 1 memory (measured in socket 0) : ((SOCKET_1_RD_ACCESS)/(2*CYCLES)) * 100.0

      • Socket 1 access to socket 0 memory (measured in socket 1) : ((SOCKET_0_RD_ACCESS)/(2 * CYCLES)) * 100.0

  • Local CPU memory read latency: Average latency of reads from local CPU memory, in nanoseconds.

    • Source:

      • Grace CPU Superchip: local and remote CPU

      • Single socket Grace Hopper Superchip: local CPU

    • Destination: local CPU memory

    • Metric formula: (CMEM_RD_OUTSTANDING/CMEM_RD_ACCESS)/(CYCLES/DURATION)

  • Local GPU memory read latency: Average latency of reads from local GPU memory, in nanoseconds.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: N/A

      • Single socket Grace Hopper Superchip: local GPU memory

    • Metric formula: (GMEM_RD_OUTSTANDING/GMEM_RD_ACCESS)/(CYCLES/DURATION)

  • Remote memory read latency: Average latency of reads from remote memory in nanoseconds.

    • Source: local CPU

    • Destination:

      • Grace CPU Superchip: remote CPU memory

      • Single socket Grace Hopper Superchip: N/A

    • Metric formula:

      • Socket 0 access to socket 1 memory (measured in socket 0): (SOCKET_1_RD_OUTSTANDING/SOCKET_1_RD_ACCESS)/(CYCLES/DURATION)

      • Socket 1 access to socket 0 memory (measured in socket 1): (SOCKET_0_RD_OUTSTANDING/SOCKET_0_RD_ACCESS)/(CYCLES/DURATION)

PCIe PMU Accounting#

PCIe PMU Events#

The following tables provide the events exposed by the PMU driver and perf tool.

PCIe Transport Events#

Event Name

ID

Description

CYCLES

0x100000000

PCIe PMU cycle counter.

WR_REQ_LOC

0x8

Counts the number of PCIe Write requests to Local CMEM or GMEM.

RD_REQ_LOC

0x6

Counts the number of PCIe Read requests to Local CMEM or GMEM.

TOTAL_REQ_LOC

0xa

Counts the number of PCIe Read/Write requests to Local CMEM or GMEM.

WR_BYTES_LOC

0x2

Counts the number of PCIe Write bytes to Local CMEM or GMEM.

RD_BYTES_LOC

0x0

Counts the number of PCIe Read bytes to Local CMEM or GMEM.

TOTAL_BYTES_LOC

0x4

Counts the number of PCIe Read/Write bytes to Local CMEM or GMEM.

RD_CUM_OUTS_LOC

0xc

Accumulates PCIe outstanding Read request cycles to Local CMEM or GMEM.

WR_REQ_REM

0x9

Counts the number of PCIe Write requests to Remote CMEM or GMEM.

RD_REQ_REM

0x7

Counts the number of PCIe Read requests to Remote CMEM or GMEM.

TOTAL_REQ_REM

0xb

Counts the number of PCIe Read/Write requests to Remote CMEM or GMEM.

WR_BYTES_REM

0x3

Counts the number of PCIe Write bytes to Remote CMEM or GMEM.

RD_BYTES_REM

0x1

Counts the number of PCIe Read bytes to Remote CMEM or GMEM.

TOTAL_BYTES_REM

0x5

Counts the number of PCIe Read/Write bytes to Remote CMEM or GMEM.

RD_CUM_OUTS_REM

0xd

Accumulates PCIe outstanding Read request cycles to Remote CMEM or GMEM

PCIe Maximum Transport Request Utilization#

PCIe Maximum Transport Request Utilization#

Source (Target: Local/Remote)

Max Read Requests per Cycle

Max Write Requests per Cycle

Root Ports 0-9

1 per port.
Total 10 requests per cycle.
(RD_REQ)

1 32B-request per port.
Total 10 32B-requests per cycle.
(WR_REQ)

PCIe PMU Metrics#

This section provides additional formulas for useful performance metrics based on events provided by the PCIe PMU.

Note

All PCIe metrics that follow are affected by the root port filter. If you do not define a root port filter, the metric results will be zero (refer to PCIe Traffic Management for more information).

  • PCIe RP Frequency: Frequency of PCIe RP memory client interface cycles in GHz.

    • Metric formula: CYCLES/DURATION_TIME

  • PCIe RP read bandwidth: PCIe root port read bandwidth, in GBps.

    • Source: local PCIe root port

    • Destination: local and remote memory

    • Metric formula: (RD_BYTES_LOC + RD_BYTES_REM)/DURATION_TIME

  • PCIe RP write bandwidth: PCIe root port write bandwidth, in GBps.

    • Source: local PCIe root port

    • Destination: local and remote memory

    • Metric formula: (WR_BYTES_LOC + WR_BYTES_REM)/DURATION_TIME

  • PCIe RP bidirectional bandwidth: PCIe root port read and write bandwidth, in GBps.

    • Source: local PCIe root port

    • Destination: local and remote memory

    • Metric formula: (RD_BYTES_LOC + RD_BYTES_REM + WR_BYTES_LOC + WR_BYTES_REM)/DURATION_TIME

  • PCIe RP read utilization percentage: Average percent utilization of PCIe root port for reads.

    • Source: local PCIe root port

    • Destination: local and remote memory

    • Metric formula: ((RD_REQ_LOC + RD_REQ_REM)/(10 * CYCLES) ) * 100.0

  • PCIe RP write utilization percentage: Average percent utilization of PCIe root port for writes.

    • Source: local PCIe root port

    • Destination: local and remote memory

    • Metric formula: ( (WR_REQ_LOC + WR_REQ_REM)/(10 * CYCLES) ) * 100.0

  • PCIE RP local memory read latency: Average latency of PCIe root port reads to local memory, in nanoseconds.

    Note

    This metric does not include the latency of the PCIe interconnect.

    • Source: local PCIe root port

    • Destination: local memory

    • Metric formula: (RD_CUM_OUTS_LOC/RD_REQ_LOC)/(CYCLES/DURATION_TIME)

  • PCIE RP remote memory read latency: Average latency of PCIe root port reads to remote memory, in nanoseconds.

    Note

    This metric does not include the latency of the PCIe interconnect.

    • Source: local PCIe root port

    • Destination: remote memory

    • Metric formula: (RD_CUM_OUTS_REM/RD_REQ_REM)/(CYCLES/DURATION_TIME)

Profiling CPU Behavior with Nsight Systems#

The Nsight Systems tool (also referred to as nsys) profiles the system’s compute units including the CPUs and GPUs (refer to Nsight Systems for more information). The tool can trace more than 25 APIs including CUDA APIs, sample CPU instruction pointers/backtraces, sample the CPU and SoC event counts, and sample GPU hardware event counts to provide a system-wide view of a workload’s behavior.

nsys can sample CPU and SoC events and graph their rates on the nsys UI timeline. It can generate the metrics described in Grace CPU Performance Metrics to NVLink C2C Accounting and also graph them on the nsys UI timeline. If CPU IP/backtrace data is gathered concurrently, users can determine when CPU and SoC events are extremely active (or inactive) and correlate that information with the IP/backtrace data to determine which workload aspect was actively running at that time.

The graphic below shows a sample nsys profile timeline. In this case, two Grace C2C0 socket metrics were collected in addition to the IPC (Instructions per Cycle) core metric on each CPU. The C2C0 metrics (C2C0 read and write utilization percentage) show GPU access to the CPU’s memory. The IPC metric shows that thread 2965407, which is running on CPU 146, is memory bound (the IPC value is ~0.05) right before the C2C0 activity. The orange-yellow tick marks under thread 2965407 represent individual instruction pointer/backtrace samples. Users can hover over these samples to get a backtrace that represents the code that the thread was executing at that time. This data can be used to understand what the workload is doing at that time.

_images/an_example_nsys_timeline.png

An Example nsys Timeline#

Use the --cpu-core-metrics, --cpu-socket-metrics, and --sample nsys CLI switches to collect the above data. Also, see the --cpu-core-events, --cpu-socket-events, and --cpuctxsw nsys CLI switches that are used to profile CPU and/or SoC performance issues. For more information, run the nsys profile --help command.

nsys also provides specific support for multi-node performance analysis.

When you run nsys in a multi-node system, it typically generates one result file per rank on the system. While you can load multiple result files into the nsys UI for visualization, the Nsight Systems Multi-Report Analysis system allows you to run statistical analysis across all of the result files (refer to Nsight Systems Multi-Report Analysis for more information).

The nsys Multi-Report Analysis system includes multiple python scripts, called recipes, that provide different analysis. Example recipes include:

  • cuda_api_sum: CUDA API Summary

  • cuda_gpu_kern_hist: CUDA GPU Kernel Duration Histogram

  • gpu_metric_util_map: GPU Metric Utilization Heatmap

  • mpi_gpu_time_util_map: MPI and GPU Time Utilization Heatmap

  • network_sum: Network Traffic Summary

  • nvtx_sum: NVTX Range Summary

  • ucx_gpu_time_util_map: UCX and GPU Time Utilization Heatmap

Multiple nsys reports can be viewed on a common timeline within the nsys UI. Opening multiple reports together can help identify multi-node issues (refer to Viewing Multiple Nsight Systems Reports for more information on how to open multiple reports on a common timeline). Note that opening multiple reports together can require a large amount of memory. To mitigate this issue, collect the minimum amount of data required to capture the workload’s behavior of interest.

nsys provides specific support for handling application launchers used in multi-node environments such as MPI, OpenSHMEM, UCX, NCCL, NVSHMEM, and SLURM. Refer to Viewing Multiple Nsight Systems Reports for more information.

nsys includes support for profiling many different network communication protocols including MPI, OpenSHMEM, and UCX. It can also profile Network Interface Cards(NICs) and Infiniband Network Switches (refer to Viewing Multiple Nsight Systems Reports for more information).

Additonal Tools#

HPC Benchmarks#

NVIDIA provides the NVIDIA HPC Benchmarks collection via NGC that can be used to confirm your multi-node setup is working as expected. Checkout the HPL, HPL-MxP, and HPCG benchmarks (refer to NVIDIA HPC Benchmarks for more information).

PyTorch Support#

The open source NVIDIA nvidia-resiliency-ext python package can help when running PyTorch-based workloads in a multi-node environment. The nvidia-resiliency-ext tools provides fault tolerance, straggler detection, and checkpointing support (refer to NVIDIA nvidia-resiliency-ext for more information). The checkpoint functionality minimizes wasted time by allowing workloads to be restarted between failures.