Metrics#

The OAM Metrics API is used internally by cuPHY-CP components to report metrics (counters, gauges, and histograms). The metrics are exposed via a Prometheus Aerial exporter.

Host Metrics#

Host metrics are provided via the Prometheus node exporter. The node exporter provides many thousands of metrics about the host hardware and OS, such as but not limited to:

  • CPU statistics

  • Disk statistics

  • Filesystem statistics

  • Memory statistics

  • Network statistics

See prometheus/node_exporter and https://prometheus.io/docs/guides/node-exporter/ for detailed documentation on the node exporter.

GPU Metrics#

GPU hardware metrics are provided through the GPU Operator via the Prometheus DCGM-Exporter. The DCGM-Exporter provides many thousands of metrics about the GPU and PCIe bus connection, such as but not limited to:

  • GPU hardware clock rates

  • GPU hardware temperatures

  • GPU hardware power consumption

  • GPU memory utilization

  • GPU hardware errors including ECC

  • PCIe throughput

See NVIDIA/gpu-operator for details on the GPU operator.

See NVIDIA/gpu-monitoring-tools for detailed documentation on the DCGM-Exporter.

An example Grafana dashboard is available at https://grafana.com/grafana/dashboards/12239.

Aerial Metric Naming Conventions#

In addition to metrics available through the node exporter and DCGM-Exporter, Aerial exposes several application metrics.

Metric names are per https://prometheus.io/docs/practices/naming/ and follows the format aerial_<component>_<sub-component>_<metricdescription>_<units>.

Metric types are per https://prometheus.io/docs/concepts/metric_types/.

The component and sub-component definitions are in the table below. For each metric, the description, metric type, and metric tags are provided. Tags are a way of providing granularity to metrics without creating new metrics.

Comp onent

Sub -Component

Description

cuphycp

cuPHY Control Plane application

fapi

L2/L1 interface metrics

cplane

Fronthaul C-plane metrics

uplane

Fronthaul U-plane metrics

net

Generic network interface metrics

cuphy

cuPHY L1 library

pbch

Physical Broadcast Channel metrics

pdsch

Physical Downlink Shared Channel metrics

pdcch

Physical Downlink Common Channel metrics

pusch

Physical Uplink Shared Channel metrics

pucch

Physical Uplink Common Channel metrics

prach

Physical Random Access Channel metrics

Metrics Exporter Port#

Aerial metrics are exported on port 8081. Configurable in cuphycontroller YAML file via ‘aerial_metrics_backend_address’.

L2/L1 Interface Metrics#

aerial_cuphycp_slots_total#

Counts the total number of processed slots.

Metric type: counter

Metric tags:

  • type: “UL” or “DL”

  • cell: “cell number”

aerial_cuphycp_fapi_rx_packets#

Counts the total number of messages L1 receives from L2.

Metric type: counter

Metric tags:

  • msg_type: “type of PDU”

  • cell: “cell number”

aerial_cuphycp_fapi_tx_packets#

Counts the total number of messages L1 transmits to L2.

Metric type: counter

Metric tags:

  • msg_type: “type of PDU”

  • cell: “cell number”

Fronthaul Interface Metrics#

aerial_cuphycp_cplane_tx_packets_total#

Counts the total number of C-plane packets transmitted by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_cplane_tx_bytes_total#

Counts the total number of C-plane bytes transmitted by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_uplane_rx_packets_total#

Counts the total number of U-plane packets received by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_uplane_rx_bytes_total#

Counts the total number of U-plane bytes received by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_uplane_tx_packets_total#

Counts the total number of U-plane packets transmitted by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_uplane_tx_bytes_total#

Counts the total number of U-plane bytes transmitted by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_uplane_lost_prbs_total#

Counts the total number of PRBs expected but not received by L1 over O-RAN Fronthaul interface.

Metric type: counter

Metric tags:

  • cell: “cell number”

  • channel: One of “prach” or “pusch”

NIC Metrics#

aerial_cuphycp_net_rx_failed_packets_total#

Counts the total number of erroneous packets received.

Metric type: counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_rx_nombuf_packets_total#

Counts the total number of receive packets dropped due to the lack of free mbufs.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_rx_dropped_packets_total#

Counts the total number of receive packets dropped by the NIC hardware.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_failed_packets_total#

Counts the total number of instances a packet failed to transmit.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_missed_interrupt_errors_total#

Counts the total number of instances accurate send scheduling missed an interrupt.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_rearm_queue_errors_total#

Counts the total number of accurate send scheduling rearm queue errors.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_clock_queue_errors_total#

Counts the total number accurate send scheduling clock queue errors.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_timestamp_past_errors_total#

Counts the total number of accurate send scheduling timestamp in the past errors.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_timestamp_future_errors_total#

Counts the total number of accurate send scheduling timestamp in the future errors.

Metric type: Counter

Metric tags:

  • nic: “nic port BDF address”

aerial_cuphycp_net_tx_accu_sched_clock_queue_jitter_ns#

Current measurement of accurate send scheduling clock queue jitter, in units of nanoseconds.

Metric type: Gauge

Metric tags:

  • nic: “nic port BDF address”

Details:

This gauge shows the TX scheduling timestamp jitter, that is, how far each individual Clock Queue (CQ) completion is from UTC time.

If you set CQ completion frequency to 2MHz (tx_pp=500), you might see the following completions:
cqe 0 at 0 ns
cqe 1 at 505 ns
cqe 2 at 996 ns
cqe 3 at 1514 ns

tx_pp_jitter is the time difference between two consecutive CQ completions.

aerial_cuphycp_net_tx_accu_sched_clock_queue_wander_ns#

Current measurement of the divergence of Clock Queue (CQ) completions from UTC time over a longer time period (~8s).

Metric type: Gauge

Metric tags:

  • nic: “nic port BDF address”

Application Performance Metrics#

aerial_cuphycp_slot_processing_duration_us#

Counts the total number of slots with GPU processing duration in each 250us-wide histogram bin.

Metric type: Histogram

Metric tags:

  • cell: “cell number”

  • channel: one of “pbch”, “pdcch”, “pdsch”, “prach”, or “pusch”

  • le: histogram less-than-or-equal-to 250us-wide histogram bins, for 250, 500, …, 2000, +inf bins.

aerial_cuphycp_slot_pusch_processing_duration_us#

Counts the total number of PUSCH slots with GPU processing duration in each 250us-wide histogram bin.

Metric type: Histogram

Metric tags:

  • cell: “cell number”

  • le: histogram less-than-or-equal-to 250us-wide histogram bins, range 0 to 2000us.

aerial_cuphycp_pusch_rx_tb_bytes_total#

Counts the total number of transport block bytes received in the PUSCH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_pusch_rx_tb_total#

Counts the total number of transport blocks received in the PUSCH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_pusch_rx_tb_crc_error_total#

Counts the total number of transport blocks received with CRC errors in the PUSCH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_pusch_nrofuesperslot#

Counts the total number of UEs processed in each slot per histogram bin PUSCH channel.

Metric type: Histogram

Metric tags:

  • cell: “cell number”

  • le: Histogram bin less-than-or-equal-to for 2, 4, …, 24, +inf bins.

PRACH Metrics#

aerial_cuphy_prach_rx_preambles_total#

Counts the total number of detected preambles in PRACH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

PDSCH Metrics#

aerial_cuphycp_slot_pdsch_processing_duration_us#

Counts the total number of PDSCH slots with GPU processing duration in each 250us-wide histogram bin.

Metric type: Histogram

Metric tags:

  • cell: “cell number”

  • le: histogram less-than-or-equal-to 250us-wide histogram bins, range 0 to 2000us.

aerial_cuphy_pdsch_tx_tb_bytes_total#

Counts the total number of transport block bytes transmitted in the PDSCH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

aerial_cuphy_pdsch_tx_tb_total#

Counts the total number of transport blocks transmitted in the PDSCH channel.

Metric type: Counter

Metric tags:

  • cell: “cell number”

aerial_cuphycp_pdsch_nrofuesperslot#

Counts the total number of UEs processed in each slot per histogram bin PDSCH channel.

Metric type: Histogram

Metric tags:

  • cell: “cell number”

  • le: Histogram bin less-than-or-equal-to for 2, 4, …, 24, +inf bins.