DPL Program Metrics - NVIDIA Docs

Introduction

The DPL compiler produces metrics information for compiled DPL programs to provide developers with insight into how their programs are implemented in the target hardware architecture. This information is critical for identifying potential performance bottlenecks, such as expensive P4 tables, complex P4 actions, or critical processing paths through a complete program.

The information in this section describes what visibility the DPL compiler currently provides and how it can be used to guide DPL program development to achieve required performance.

Configuration and Output

By default, the DPL compiler outputs metrics information in the p4-build directory alongside other standard compiler outputs

Enable/Disable Metrics

Default Behavior: Equivalent to including the compilation flag --enable metrics.
Disable Metrics: Generation can be turned off using the flag --disable metrics.

Generated Files

program_name.metrics.log: A text-based log summary of the compiled program's metrics.
program_name.metrics.json: A JSON-structured representation, useful for tracking metrics programmatically over time.

Report Details (program_name.metrics.log)

The metrics log is based on a "lowered" representation of the DPL program, mapping high-level P4 constructs to hardware-aligned blocks called matchers and rules.

Matcher: Generally correlates to a P4 Table.
Rule: Generally correlates to a P4 Action.
STE (Steering Table Entry): The fundamental hardware unit of work, consisting of a match phase and an action phase.

The log file contains three main sections:

DOCA HWS Metrics Report

This section summarizes resource usage and associated costs for matchers and rules. It begins with a summary of the total estimated memory usage and total unique STEs.

Column definitions:

Column	Description
ID	A unique identifier generated by the compiler, corresponding to ID values found in the `program_name.program.json` file
Table Name	The P4 table from which the matcher was derived (blank for synthesized matchers)
Domain	The hardware domain the matcher resides in. Currently allowed domains: FDB , FDB_RX , and FDB_TX .
Table Size	The capacity of rules (entries) the table can hold. This is expected to be a power of two due to memory allocation constraints.
Action Name	The P4 action from which the rule was derived.
Memory	Matcher: Bytes allocated to implement the matcher (enough for worst-case rules). Rule: Bytes required for a single rule of that type.
Packet Processing	A unitless estimate of execution cost. Higher numbers indicate more processing time. Matcher: Cost of processing a table miss (search + branching). Rule: Cost to execute the rule.
Insertion	A unitless estimate of the cost to insert a rule into a given matcher. Complex rules with many STEs take longer to insert.
Match STEs	The number of STEs required to perform the match operation (typically 1 or 2 for BlueField targets).
Action STEs	The number of STEs required to perform all of a rule's actions. A value of 0 indicates actions are performed in the same STE as the match.

DOCA HWS Metrics Report

This section summarizes the worst-case execution paths through the DPL program based on two different criteria:

Steps: The path requiring the highest number of STE steps.
Cost: The path with the highest cumulative packet processing cost.

Column definitions:

Column	Description
Step	The step order of the execution path, at the granularity of a hardware rule.
Table	The matcher ID visited at this step (includes P4 table name when possible).
Entry	The rule ID visited at this step (includes P4 action name when possible.
STEs	The number of STEs executed to perform the rule.
Pkt Processing	A unitless estimate of the cost of exercising the rule at this specific step.
Total	The sum of STE count and Packet Processing cost for the entire execution path.

Static entry hash distribution

When a program uses static entry tables (tables that have constant entries known at compile-time), the metrics.log will output a section that shows the hashing properties of each table. This section shows how the static entries are distributed across the allocated hash table. An example is shown below.

Copy
Copied!

            
            Static entry hash distribution:
Table 101 :
  0x00000000: x
  0x00000004: x 
Table 102 :
  0x00000001: x
  0x00000003: x 
  0x00000178: xxx
  0x00000563: x

In this example, two tables are implemented that have static entries, table 101 and table 102. For each table, a histogram of entries per (occupied) hash bucket is shown. For example, Table 102 has six entries (denoted by x's in the histogram). One entry is at bucket 0x1, one entry is at bucket 0x3, three entries are at bucket 0x178, and one entry is at bucket 0x563. We say that bucket 0x178 has hash collisions, because more than one entry is found within the hash bucket.

Hash collisions incur a cost in terms of performance, because more memory locations have to be examined to resolve the collision. The best performant table would have no hash collisions. While the compiler attempts to implement static entry tables to avoid hash collisions, it is not always possible. Refer to the Performance User Guide for tips on improving hashing performance.

Optimization Guidelines

The metrics output serves as a guide for optimizing DPL programs. Consider the following principles:

Minimize STE Execution: The total number of STEs executed has a direct adverse effect on packets per second (PPS). Fewer STEs result in faster processing times per packet.
Simplify Rules: Rule complexity impacts both memory footprint and insertion cost. Smaller, simpler rules execute faster and are faster to insert from the control plane.
Focus on Critical Paths: Complex tables or actions usually only impact performance when located on a critical program execution path. Optimizing rarely used paths (e.g., error handling) may not yield observable performance improvements.

Disclaimers

Estimates Only: The output provided in these metrics are estimates.
Runtime Variables: The compiler cannot predict system caching performance or the specific characteristics of rules inserted at runtime.
Memory Impact: Because fetch/retrieval times from system memory contain unknowns, actual packet processing costs may vary from these static estimates

On This Page