Link Counter Detection | NVIDIA NVSentinel Documentation

Appendix A: Quick Reference - Counter Thresholds

Related Documents:

Link State Detection - UP/DOWN state monitoring
Syslog Detection & Correlation - Kernel log monitoring and repeat failure detection

1. Overview

1.1 Problem Statement

Modern GPU clusters suffer from Grey Failures (subtle degradations) and straggler effects where a single degraded link throttles thousands of GPUs. Simple UP/DOWN polling is insufficient; a deterministic degradation detection system is required that can detect both hard failures and gradual degradation before FEC exhaustion causes catastrophic packet loss.

1.2 Scope of Link Counter Detection

This document covers the Degradation Monitoring component of the NIC Health Monitor, which detects:

Fatal counter violations - Counters that guarantee workload failure when incremented
Rate-based degradation - Error rates exceeding thresholds that predict impending failure
Pre-failure prediction - Detecting BER climbing before FEC exhaustion

1.3 Binary Severity Model

This monitor uses a binary severity model based on workload impact:

Severity	Meaning	Example
Fatal	Workload WILL fail or HAS failed	`link_downed` (any), `excessive_buffer_overrun_errors` (any)
Non-Fatal	Degradation detected, workload continues	Symbol errors, congestion, link flapping

Key Design Principle: The only question that matters is “Will the running workload fail because of this?“

1.4 Counter Detection Overview Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      LINK COUNTER DETECTION FLOW                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                     DATA SOURCES (sysfs)                                 │    │
│  ├─────────────────────────────────────────────────────────────────────────┤    │
│  │  /sys/class/infiniband/<dev>/ports/<port>/                              │    │
│  │  ├── counters/                                                           │    │
│  │  │   ├── symbol_error                →  PHY bit errors (before FEC)     │    │
│  │  │   ├── link_error_recovery         →  Link retraining events          │    │
│  │  │   ├── link_downed                 →  Port training failures (FATAL)  │    │
│  │  │   ├── port_rcv_errors             →  Malformed packets               │    │
│  │  │   ├── local_link_integrity_errors →  Physical errors (FATAL)         │    │
│  │  │   ├── excessive_buffer_overrun    →  Lossless violation (FATAL)      │    │
│  │  │   └── port_xmit_discards          →  TX discards (congestion)        │    │
│  │  │                                                                       │    │
│  │  └── hw_counters/                    →  Extended counters               │    │
│  │      ├── roce_slow_restart           →  Victim flow oscillation         │    │
│  │      ├── local_ack_timeout_err       →  ACK timeout (path issues)       │    │
│  │      ├── rnr_nak_retry_err           →  Connection severed (FATAL)      │    │
│  │      └── req_transport_retries_exceeded → IB only (FATAL)               │    │
│  │                                                                          │    │
│  │  /sys/class/net/<interface>/statistics/                                  │    │
│  │  └── carrier_changes                 →  Link flap counter               │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                     │                                            │
│                                     ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │              DEGRADATION MONITOR (1s polling interval)                   │    │
│  ├─────────────────────────────────────────────────────────────────────────┤    │
│  │                                                                          │    │
│  │  CALCULATES (locally, for threshold comparison):                         │    │
│  │  ├── Δ (delta)      →  current_value − snapshot.value                    │    │
│  │  ├── Δt (elapsed)   →  now − snapshot.timestamp (wall-clock)             │    │
│  │  └── Δ/Δt (rate)    →  Errors per unit time, only after Δt ≥ window     │    │
│  │                                                                          │    │
│  │  PERSISTS (hostPath-backed state file):                                  │    │
│  │  ├── Per-counter snapshot (value + timestamp for delta/velocity)         │    │
│  │  ├── Per-counter breach flag (for recovery event emission)              │    │
│  │  └── Boot ID (clear state + emit healthy baselines on reboot)            │    │
│  │                                                                          │    │
│  │  FATAL COUNTERS (immediate event):                                       │    │
│  │  ├── link_downed (Delta > 0)               →  FATAL                     │    │
│  │  ├── excessive_buffer_overrun (any)        →  FATAL                     │    │
│  │  ├── local_link_integrity_errors (any)     →  FATAL                     │    │
│  │  ├── rnr_nak_retry_err (any)               →  FATAL                     │    │
│  │  └── symbol_error_fatal (> 120/hour)       →  FATAL                     │    │
│  │                                                                          │    │
│  │  NON-FATAL THRESHOLDS (degradation event):                               │    │
│  │  ├── symbol_error           > 10/sec       →  NON-FATAL                 │    │
│  │  ├── link_error_recovery    > 5/min        →  NON-FATAL                 │    │
│  │  ├── roce_slow_restart      > 10/sec       →  NON-FATAL                 │    │
│  │  └── carrier_changes        > 2/interval   →  NON-FATAL                 │    │
│  │                                                                          │    │
│  │  RECOVERY (when previously breached counter clears):                     │    │
│  │  └── Admin counter reset detected          →  RECOVERY (IsHealthy=true) │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                     │                                            │
│                                     ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │       RAW EVENTS + RECOVERY EVENTS → PLATFORM CONNECTOR → MongoDB       │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                     │                                            │
│                                     ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │            HEALTH EVENTS ANALYZER (Escalation Rules)                     │    │
│  ├─────────────────────────────────────────────────────────────────────────┤    │
│  │  • RepeatedNICDegradation: "5+ non-fatal events in 24h → FATAL"         │    │
│  │  • Pattern detection across time windows                                 │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

2. Theoretical Foundation

2.1 The Physics of High-Speed Signaling Degradation

Modern interconnects (HDR/NDR InfiniBand, 100/200/400GbE) use PAM4 modulation (Pulse Amplitude Modulation, 4-level) to achieve high bandwidth. This represents a fundamental paradigm shift from previous generations.

2.1.1 PAM4 vs NRZ: Why Velocity Monitoring is Required

Aspect	NRZ (EDR/100GbE)	PAM4 (NDR/400GbE)
Bits per symbol	1	2
Voltage levels	2 (0, 1)	4 (00, 01, 10, 11)
Eye height	Maximum	1/3 of NRZ
SNR	High	Drastically reduced
Raw bit errors	Rare anomaly	Guaranteed and constant
Monitoring approach	”Any error is bad”	Velocity-based only

Critical: In PAM4 systems, raw bit errors are a physical certainty. A monitor that alerts on “Any Error > 0” would be permanently alarming. The velocity-based approach is the only valid monitoring strategy for 400G+ networks. (Reference: PAM4 Test Challenges)

2.2 Signal Degradation Progression

Degradation flow: Physical impairment (cable/SFP) → Eye diagram closes (DSP struggles) → Symbol errors (PHY layer) → FEC corrections (recoverable) → CRC failures (unrecoverable) → Packet loss (FATAL)

Monitoring opportunity: Detect degradation at the symbol_error stage, before FEC exhaustion causes packet loss.

2.3 Bit Error Rate (BER), FEC, and the “Cliff Effect”

Because errors are inevitable in PAM4, Forward Error Correction (FEC) is mandatory for 200G/400G/NDR links.

Link Health State	Bit Error Rate	Symbol Errors	Action
Healthy	< 10E-15	~0 post-FEC	None
Failed (Fatal)	> 10E-12	FEC margin exhausted	Fatal (REPLACE_VM)

2.3.1 The FEC “Cliff Effect”

FEC masks physical degradation until the error rate exceeds correction capacity—then packet loss spikes instantly from 0% to ~100% (the “cliff”). The Degradation Monitor tracks Pre-FEC BER via symbol_error velocity, enabling node draining before the cliff is reached.

PAM4 Note (HDR/NDR): On 200G/400G adapters, non-zero raw BER is expected. Use rate-based thresholds (e.g., symbol_error > 10/sec) for degradation detection, not symbol_error > 0.

2.4 The Lossless Assumption and Deterministic Failure Horizons

Unlike general-purpose TCP/IP networks, which are architected to be resilient to packet loss, latency variation, and out-of-order delivery, RDMA fabrics—specifically InfiniBand (IB) and RDMA over Converged Ethernet (RoCE)—are designed under a “lossless” assumption. This architectural premise dictates that once a packet is admitted to the fabric, its delivery is guaranteed by credit-based flow control (in IB) or Priority Flow Control (in RoCE), relieving the transport layer of heavy congestion management overhead.

Key Insight: This reliance on near-perfect transmission introduces a binary fragility to the system. When the physical or link layer violates the lossless assumption, the impact on the application is often not merely performance degradation, but catastrophic failure. For tightly coupled distributed workloads using MPI or NCCL, a failure in a single link deterministically terminates the entire job.

2.4.1 Soft vs Hard Errors: The Determinism Boundary

The critical operational requirement is distinguishing between:

Error Type	Characteristics	Impact
Soft Errors	Probabilistic, recoverable via FEC/retries	Performance degradation, workload continues
Hard Errors	Deterministic, exceed recovery capacity	Application failure guaranteed

The boundary between soft and hard errors is defined by:

Counter thresholds that indicate recovery mechanism exhaustion
Rate of change that exceeds retry bandwidth
Specific counter types that indicate fundamental violation of the lossless contract

2.4.2 The 10E-12 BER Threshold

The InfiniBand specification defines a compliant link as maintaining a Bit Error Rate (BER) of better than 10E-12. This physical constant provides the basis for threshold calculations:

At a BER of 10E-12, a link running at high speed (e.g., HDR 200Gb/s) experiences a predictable number of errors per unit time
IBTA-compliant threshold: Maximum allowable symbol error rate is 120 errors per hour (IBTA Specification / Oracle Documentation)
Below this rate, FEC algorithms can typically correct errors without retransmission
Above this rate, the “effective” error rate (post-FEC) rises, leading to packet corruption and Link Level Retransmission (LLR) or transport layer retries

Monitoring Implication: While a single SymbolError is not fatal, a rate exceeding 120/hour (≈2/minute) is a deterministic predictor of impending link instability. Monitoring systems should treat this as a Fatal condition requiring node replacement.

2.4.3 Deterministic Failure Mechanisms

The following counters represent absolute deterministic failure when they increment:

Counter	Mechanism	Why Deterministic
link_downed	Port Training State Machine fails to maintain LinkUp	Standard HPC applications do not support transparent dynamic rerouting of active QPs
excessive_buffer_overrun_errors	HCA internal ingress buffer overflows	Violates fundamental “lossless” contract; packet causing overrun is dropped immediately
RNR_nak_retry_err	Receiver Not Ready NAK retry exhausted	Terminal state of error handling; connection is severed
local_link_integrity_errors	Raw physical errors exceed LocalPhyErrors hardware limit	Link is operating outside design specifications

Note: These four counters represent absolute deterministic failure. Additionally, symbol_error has a fatal threshold at > 120/hour (IBTA BER spec violation) via the symbol_error_fatal config entry. All other counters (symbol_error at > 10/sec, port_rcv_errors, etc.) are non-fatal degradation indicators.

2.5 The Transport Layer Retry Window

When hardware counters increment, they don’t directly cause application failure—they trigger a reaction in the software stack. Understanding this interaction defines the “Fatal” threshold:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    TRANSPORT LAYER RETRY WINDOW                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  Hardware: SymbolError ──► FEC fails ──► Packet Corrupted                       │
│                │                                                                 │
│                ▼                                                                 │
│  Receiver: drops packet ──► PortRcvErrors increments                            │
│                │                                                                 │
│                ▼                                                                 │
│  Sender: waits for ACK ──► Timeout ──► Retry (1) ──► ... ──► Retry (N)         │
│                │                                             │                   │
│                │                                             ▼                   │
│                │                                      GIVE UP                    │
│                │                                             │                   │
│                ▼                                             ▼                   │
│  Application: NCCL_IB_RETRY_CNT (default: 7) exhausted                          │
│                │                                                                 │
│                ▼                                                                 │
│  Result: QP transitions to ERROR state ──► Application crashes                  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

2.6 Transport Retry Count Exceeded (Error 12)

When the NIC sends a packet and the ACK never arrives:

Send Packet → Wait for ACK → Timeout → Retry (1) → Timeout → ... → Retry (N) → GIVE UP

After retry_cnt attempts (default: 7), the NIC tears down the connection and the application receives IBV_WC_RETRY_EXC_ERR.

Implications:

Confirms Logical Link is broken even if physical link is UP
Often indicates “Silent Drop” or Black Hole in the fabric
Local symptom of a remote problem

Important: Application-Triggered Timeouts. A rising local_ack_timeout_err counter does NOT necessarily indicate a local NIC fault. If a remote NCCL rank crashes or hangs, the remote NIC stops responding to RDMA requests. The local NIC retries and eventually exhausts retry_cnt, incrementing local_ack_timeout_err on the local side. This means the counter can be triggered by: (1) fabric black hole (network issue), (2) remote NIC failure, or (3) remote application crash/hang — which is not a NIC problem at all. This is why local_ack_timeout_err is classified as Non-Fatal (IsFatal=false) — it requires correlation with other signals (port state, remote node health) to determine the root cause.

What This Monitor CAN Detect: The local_ack_timeout_err and req_transport_retries_exceeded (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can’t see the application error.

Diagnostic Commands:

$ # Read hardware counters directly from sysfs
$ cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_transport_retries_exceeded
$ # Output: 42 (non-zero value indicates connection-severing retries)
$ 
$ # Detailed link quality and error counters via Mellanox diagnostic tools
$ mlxlink -d /dev/mst/mt4126_pciconf0 --show_ber
$ # Output: Symbol Errors, BER counters
$ 
$ # Query eye opening (signal quality indicator)
$ mlxlink -d /dev/mst/mt4126_pciconf0 --eye_open
$ # Output: Eye height/width for each PAM4 lane (identifies physical cable degradation)

Correlation: Use with ibdiagnet to determine if issue is local (NIC) or remote (Switch/Fabric).

Fabric-wide Diagnostic Command:

$ # Perform comprehensive fabric-wide diagnostics (requires Subnet Manager access)
$ ibdiagnet -o /tmp/ibdiag_output
$ # Output: Summary of fabric errors, including symbol errors on switches and remote ports

3. Architecture

3.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern

The Degradation Monitor follows NVSentinel’s established architectural pattern where:

Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
MongoDB serves as the source of truth for event history and correlation queries

Architectural Principle	Implementation	Purpose
Raw Event Reporting	Each threshold violation → immediate event	Enables centralized correlation with full historical context
Centralized Correlation	Health Events Analyzer MongoDB pipelines	Flexible, configurable rules without monitor code changes
Temporal Correlation	Analyzer rules with time windows	Detects patterns like “5 degradation events in 24 hours”

3.2 Component Responsibilities

Component	Responsibility	What It Does NOT Do
NIC Health Monitor (Degradation Check)	Poll sysfs counters, calculate deltas/rates, persist counter snapshots and breach state, emit raw events and recovery events	Aggregation, deduplication, correlation, pattern detection
Health Events Analyzer	Correlate events, detect patterns, escalate severity	Direct hardware access

Local State Persistence: The Degradation Check maintains a persistent state file on the node (hostPath-backed) containing per-counter snapshots (value + timestamp), per-counter breach flags, and the host boot ID. This enables the monitor to (1) compute accurate deltas and precise velocity rates by holding the persisted snapshot for the configured velocity window and computing the rate over the real elapsed time — so a 120/hour threshold is observed over a one-hour window rather than extrapolated from a single 1s sample; (2) seamlessly resume velocity windows after a pod restart because the snapshot timestamp survives the restart; (3) emit recovery events (IsHealthy=true) when counters are reset by an administrator, by retaining the breach flag across restarts; and (4) detect host reboots to clear all state and emit healthy baseline events for all ports and counters, since the node may have had NICs replaced during maintenance. This local state is strictly operational — all correlation and pattern detection remains centralized in the Health Events Analyzer.

3.3 Degradation Check Data Flow (1s polling interval)

Reads:
├── counters/      → Standard IB counters (symbol_error, link_error_recovery, etc.)
├── hw_counters/   → Extended counters (roce_slow_restart, rnr_nak_retry_err, etc.)
├── statistics/    → Ethernet statistics (rx_crc_errors, rx_missed_errors, etc.)
└── carrier_changes → Link flap counter (catches UP/DOWN events between polls)
Calculates (locally, for threshold comparison):
├── Δ (delta)      → current_value − snapshot.value
├── Δt (elapsed)   → now − snapshot.timestamp (real wall-clock time)
└── Δ/Δt (rate)    → For velocity thresholds, evaluated only after Δt ≥ window
                     (window = 1s / 1m / 1h, matching the configured velocityUnit)
When threshold exceeded for the first time, emits a single RAW event with:
├── Counter name   → e.g., "symbol_error_fatal"
├── Current value  → e.g., 12500
├── Delta          → accumulated since the snapshot was taken
├── Rate           → e.g., 200/hour
└── Threshold      → e.g., 120/hour
Subsequent polls while breached emit nothing (latching breach). A
recovery event is emitted only when the counter is reset (admin clear)
or the host reboots.
Fatal counter thresholds (configurable, defaults shown):
├── link_downed (Delta > 0)                    → QP disconnect (FATAL)
├── excessive_buffer_overrun_errors (any)      → Lossless violation (FATAL)
├── local_link_integrity_errors (any)          → Link outside spec (FATAL)
├── rnr_nak_retry_err (any)                    → Connection severed (FATAL)
└── symbol_error_fatal (> 120/hour)            → IBTA BER spec violation (FATAL)
Non-fatal thresholds (configurable, defaults shown):
├── symbol_error           > 10/sec
├── link_error_recovery    > 5/min
├── roce_slow_restart      > 10/sec
└── carrier_changes        > 2/interval
Persists (to hostPath-backed state file after each poll cycle):
├── Per-counter snapshot  → Value + wall-clock timestamp (for delta/velocity)
├── Per-counter breach    → Whether threshold is currently exceeded (for recovery)
└── Boot ID              → Detects host reboot to clear state + emit healthy baselines
Emits: Raw DEGRADATION events → Platform Connector → MongoDB
       Recovery events (IsHealthy=true) when breached counter clears (e.g., admin reset)
       (Pattern detection and escalation handled by Health Events Analyzer)

4. Complete Counter Specification

4.1 Complete Counter Set (“Golden Counters” + Extended)

This monitor tracks both fatal counters (deterministic workload failure) and non-fatal counters (degradation indicators). The IsFatal field in the HealthEvent distinguishes between them.

4.1.1 Standard Counters (`/sys/class/infiniband/<dev>/ports/<port>/counters/`)

Counter	File Name	Degradation Meaning	IsFatal	Alert Threshold	Source
Symbol Error	`symbol_error`	Raw bit errors before FEC. Expected non-zero for PAM4 (HDR/NDR).	No	Rate-based (e.g., > 10/sec for warning)	Oracle/IBTA
Link Error Recovery	`link_error_recovery`	PHY-initiated link retraining (micro-flapping). Causes millisecond-scale latency spikes.	No	> 5/min (watchdog trigger)	NVIDIA UFM IB Port Counters
Link Downed	`link_downed`	Port Training State Machine failed to maintain LinkUp.	YES	Delta > 0 (Runtime)	HPE ClusterStor
Port Receive Errors	`port_rcv_errors`	Malformed packets (CRC, length errors). Saturates retry bandwidth at high rates.	No	> 10/sec (retry saturation)	NVIDIA UFM IB Port Counters
Local Link Integrity	`local_link_integrity_errors`	Raw physical errors exceeded LocalPhyErrors hardware cap. Link operating outside spec.	YES	> 0 (any)	HPE ClusterStor
Buffer Overrun	`excessive_buffer_overrun_errors`	HCA internal buffer overflow—lossless contract violated. Packet dropped immediately.	YES	> 0 (any)	IBM Redbooks
Port Transmit Discards	`port_xmit_discards`	TX discards due to congestion.	No	> 100/sec

4.1.2 Extended Counters (`/sys/class/infiniband/<dev>/ports/<port>/hw_counters/`) — Non-Fatal

All extended counters are non-fatal by default. They indicate congestion, retransmissions, or recoverable transport events. RDMA’s reliable transport handles these automatically; workloads continue with potential performance impact.

Key Non-Fatal Counters (monitor for performance degradation):

Category	Counters	IsFatal	Alert Threshold	Justification
Physical	`symbol_error`	No	> 10/sec	PHY signal degradation / Dirty fiber.
Link	`link_error_recovery`	No	> 5/min	Link Flapping / PTSM Instability.
Integrity	`port_rcv_errors`	No	> 10/sec	FCS/CRC Corruption (Bit Rot).
Congestion	`port_xmit_discards`	No	> 100/sec	Congestion Collapse / PFC breakdown.
Transport	`roce_slow_restart`	No	> 10/sec	Victim Flow / Transport Oscillation (Straggler).
Transport	`rnr_nak_retry_err`	YES	> 0 (any)	RNR NAK retry exhausted; QP enters error state (ref).
Timeout	`local_ack_timeout_err`	No	> 1/sec	Broken Path / Fabric Black Hole. Can be caused by remote app crash (see Section 2.6).
Interface	`carrier_changes`	No	> 2/interval	Physical instability visible to OS.

Key Insights:

rnr_nak_retry_err > 0: FATAL - Indicates RNR NAK retry exhausted; the connection has been severed.

roce_slow_restart > 10/sec: Primary indicator for Grey Failures. Indicates flow oscillation and straggler behavior.

port_xmit_discards > 100/sec: Flow control breakdown. Network physically unable to handle load.

symbol_error > 10/sec: Signature of “Dirty Fiber” or microscopic dust on connectors.

4.2 Counter Locations

Standard IB counters: /sys/class/infiniband/<dev>/ports/<port>/counters/ (symbol_error, link_downed, local_link_integrity_errors, etc.)
Extended counters (Mellanox): /sys/class/infiniband/<dev>/ports/<port>/hw_counters/ (rnr_nak_retry_err, roce_slow_restart, etc.)
Ethernet stats (RoCE): /sys/class/net/<iface>/statistics/ (carrier_changes)

4.3 Diagnostic Commands

$ # Read standard counters
$ cat /sys/class/infiniband/mlx5_0/ports/1/counters/symbol_error
$ cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_errors
$ cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_discards
$ 
$ # Read extended hw_counters (degradation monitoring)
$ cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/local_ack_timeout_err
$ cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/roce_slow_restart
$ cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err
$ 
$ # Fabric-wide diagnostics (requires Subnet Manager access)
$ ibdiagnet -o /tmp/ibdiag_output

4.4 Key Design Decisions

link_downed is Fatal. In running MPI/NCCL jobs, any increment (Delta > 0) guarantees job crash.
excessive_buffer_overrun_errors is Fatal. Violates fundamental “lossless” contract; packet causing overrun is dropped immediately.
rnr_nak_retry_err is Fatal. Indicates Receiver Not Ready NAK retry exhausted; the connection has been severed.
local_link_integrity_errors is Fatal. This counter is a “meta-threshold”—it only increments when raw physical errors exceed the hardware-defined LocalPhyErrors cap.
symbol_error uses PAM4 (HDR/NDR) considerations. Zero-tolerance is obsolete for modern links; non-zero raw BER is expected. Monitor velocity for degradation trends.
Most hw_counters are Non-Fatal by default—they indicate degradation that should be monitored but doesn’t immediately crash workloads. Exception: rnr_nak_retry_err is fatal.

4.5 Consolidated Deterministic Failure Thresholds (Defaults)

Configuration Note: All thresholds and severity levels are configurable. The values below are defaults based on industry specifications and vendor recommendations. See Section 10: Configuration for customization options.

Table 1: Absolute Deterministic Failure Thresholds (Default: Fatal - IsFatal=true)

Breaching these thresholds guarantees application failure or mandatory node exclusion.

Counter Name	Type	Fatal Threshold	IsFatal	Deterministic Mechanism	Source
`link_downed`	Standard	Delta > 0 (Runtime)	YES	Logical path destruction; QP disconnect. Standard HPC apps don’t support transparent QP rerouting.	HPE ClusterStor
`excessive_buffer_overrun_errors`	Standard	> 0 (Any)	YES	Lossless guarantee violation; packet dropped immediately. HCA ingress buffer overflow.	IBM Redbooks
`rnr_nak_retry_err`	Extended	> 0 (Any)	YES	Receiver Not Ready NAK retry exhausted; QP transitions to error state (`IBV_WC_RNR_RETRY_EXC_ERR`). Connection cannot recover without application-level teardown.	ibv_modify_qp(3) - rnr_retry QP attr, NVIDIA RDMA Programming
`local_link_integrity_errors`	Standard	> 0 (Any)	YES	Physical error density exceeds hardware-defined LocalPhyErrors cap. Link outside spec.	HPE ClusterStor
`symbol_error_fatal`	Standard	> 120/hour	YES	IBTA BER spec violation (10E-12). Link operating outside specification; FEC margin exhausted.	Oracle/IBTA BER Threshold

Table 2: Predictive Thresholds (Non-Fatal - IsFatal=false)

Breaching these rates indicates degradation requiring monitoring. Workloads continue but performance may be impacted.

Threshold Source Note: These thresholds are derived from a combination of IBTA BER specifications, cloud provider operational heuristics (Azure, AWS), vendor documentation, and field experience. Specific rate values are configurable defaults, not specification mandates. See Section 10: Configuration for customization options.

Counter Name	Type	Alert Threshold	IsFatal	Rationale	Source
`symbol_error`	PHY	> 10/sec	No	Physical layer degradation (Dirty Fiber). Derived from IBTA BER spec (10E-12); 10/sec implies BER degraded to ~1E-8.	Oracle/IBTA BER Threshold, NVIDIA UFM IB Port Counters
`link_error_recovery`	Link	> 5/min	No	PTSM Instability. Each retrain causes 50ms-2s stall. 5/min = flapping link.	NVIDIA UFM IB Port Counters (counter definition); threshold is operational heuristic
`port_rcv_errors`	Standard	> 10/sec	No	Bit Rot / CRC Corruption. Saturates transport replay buffer.	NVIDIA UFM IB Port Counters
`port_xmit_discards`	Standard	> 100/sec	No	Congestion Collapse / PFC breakdown.	NVIDIA UFM IB Port Counters (counter definition); threshold is operational heuristic
`roce_slow_restart`	RoCE	> 10/sec	No	”Victim Flow” oscillation. Jitter impacts AllReduce synchronization.	NVIDIA DOCA Telemetry
`local_ack_timeout_err`	Transport	> 1/sec	No	ACK timeouts indicate path issues (Black Hole). Can also be caused by remote application crash (see Section 2.6).	Operational heuristic
`carrier_changes`	Interface	> 2/interval	No	Link instability (catches UP/DOWN events between polls).	Operational heuristic

4.6 Technical Justification for Non-Fatal Thresholds

The following analysis validates the efficacy of the proposed monitoring design based on hardware specifications and empirical reliability studies.

1. Physical Layer (L1) Justifications

Symbol Error (symbol_error > 10/sec): A rate of 10/sec is a robust indicator of physical degradation. In modern PAM4 links, a healthy optical connection operates with a BER better than 1E-12 (roughly one error every few hours). A rate of 10/sec implies the BER has degraded by orders of magnitude (to ~1E-8). This is the classic signature of “Dirty Fiber” or microscopic dust on connectors.
Link Error Recovery (link_error_recovery > 5/min): Tracks the Port Training State Machine (PTSM). 5 events per minute represents a “Flapping” link. While the link recovers (non-fatal), each retrain causes 50ms to 2s of stall, decimating performance for synchronous GPU workloads.
Carrier Changes (carrier_changes > 2/interval): The OS-visible shadow of link recovery. Confirms that physical instability was severe enough to disrupt the driver layer.

2. Data Link Layer (L2) Justifications

Port Receive Errors (port_rcv_errors > 10/sec): Indicates “Bit Rot”—data corruption surviving the PHY but failing the CRC/FCS check. Triggers “Phantom Congestion” as the network repeatedly retransmits corrupted frames.
Port Transmit Discards (port_xmit_discards > 100/sec): Indicates flow control breakdown. The network is physically unable to handle the load, and backpressure mechanisms (PFC) are failing. Definitive signal of Congestion Collapse.

3. Transport Layer (L4) Justifications

RoCE Slow Restart (roce_slow_restart > 10/sec): Primary indicator for Grey Failures. Indicates a flow is timing out and resetting its congestion window repeatedly. This creates stragglers that stall the entire GPU fleet during collective operations (AllReduce).
Local ACK Timeout (local_ack_timeout_err > 1/sec): In a reliable lossless network, ACKs should not be lost. A persistent rate of 1/sec implies a “Fabric Black Hole” (e.g., a specific bad ECMP path).

Note on rnr_nak_retry_err: This counter is FATAL (not a non-fatal threshold). Any increment indicates the Receiver Not Ready NAK retry limit has been exhausted and the connection has been severed. This is a terminal state of error handling.

Final Verdict: These thresholds are calibrated to distinguish between background noise (standard FEC activity) and pathological hardware degradation that threatens AI training efficiency.

5. Counter Reading and Parsing

5.1 Mellanox Counter Reading

For Mellanox devices (IB and RoCE), the monitor reads:

Standard Counters: /sys/class/infiniband/<dev>/ports/1/counters/
- Fatal counters: link_downed, local_link_integrity_errors, excessive_buffer_overrun_errors
- Two-tier counter: symbol_error — non-fatal at > 10/sec (degradation warning), fatal at > 120/hour (IBTA BER spec violation)
Extended Counters: /sys/class/infiniband/<dev>/ports/1/hw_counters/
- Fatal counter: rnr_nak_retry_err
- Non-fatal counters for degradation monitoring

Note: Mellanox throughput counters (port_rcv_data, port_xmit_data) are in 4-byte words. Multiply by 4 to get bytes.

5.2 Mellanox Fatal Counter Paths

Counter	Path	Fatal Threshold
`symbol_error_fatal`	`/sys/class/infiniband/<dev>/ports/<port>/counters/symbol_error`	> 120/hour
`local_link_integrity_errors`	`/sys/class/infiniband/<dev>/ports/<port>/counters/local_link_integrity_errors`	Delta > 0
`excessive_buffer_overrun_errors`	`/sys/class/infiniband/<dev>/ports/<port>/counters/excessive_buffer_overrun_errors`	Delta > 0
`rnr_nak_retry_err`	`/sys/class/infiniband/<dev>/ports/<port>/hw_counters/rnr_nak_retry_err`	Delta > 0

Note: symbol_error has two default config entries: symbol_error (non-fatal, > 10/sec for degradation) and symbol_error_fatal (fatal, > 120/hour per IBTA specification (10E-12 BER)). Both read from the same sysfs file. On PAM4 links (HDR/NDR), some non-zero symbol errors are expected; tune the fatal threshold if 120/hour is too sensitive for your environment.

6. Counter Reset Handling

Hardware counters may reset due to driver reloads, device resets, administrator-initiated clears (e.g., perfquery -x, echo 0 > /sys/...), or (rarely) uint64 overflow. The monitor must handle cases where Current < Previous to avoid incorrect delta calculations and must emit recovery events when a counter reset clears a previously breached threshold.

6.1 The Problem

Poll N:   symbol_error = 1,000,000
Driver Reload / Counter Reset / Admin Clear
Poll N+1: symbol_error = 50
Naive Delta = 50 - 1,000,000 = NEGATIVE (or overflow to huge positive)

Additionally, if symbol_error had previously triggered a FATAL event (e.g., exceeding 120/hour), and an administrator resets the counters to remediate the issue, the monitor must detect this and emit a recovery event (IsHealthy=true) to clear the unhealthy condition on the platform.

6.2 Counter Reset Causes

Cause	Detection	Expected Behavior
Driver reload (`modprobe -r mlx5_core`)	`current < previous`; syslog monitor reports correlated kernel log	Treat `current` as delta, check for recovery
Device reset (firmware/hardware initiated)	`current < previous`; may correlate with syslog events	Treat `current` as delta, check for recovery
Administrator clear (CSP/cluster admin)	`current < previous` (typically to 0); no correlated syslog event	Treat `current` as delta, emit recovery event if previously breached
Host reboot	Boot ID changes; all counters restart from 0	Clear all persisted state, emit healthy baselines for all ports and counters (see Section 6.5)
uint64 overflow	`current < previous` (extremely rare)	Treat `current` as delta

6.3 Counter Reset Handling Algorithm

Reset Detection (uses an in-memory lastPollValue per counter plus the persisted snapshot as fallback after pod restart):

If lastPollValue is recorded for this counter (steady state):
- Reset detected when current < lastPollValue
Else if a persisted snapshot exists (first poll after pod restart, lastPollValue not yet rebuilt):
- Reset detected when current < snapshot.value
Else (truly first poll for this counter):
- No reset to detect; just initialize the snapshot

Sysfs counters are kernel-maintained and monotonic between resets, so current < previous is a definitive reset signal.

Threshold Evaluation and Recovery Steps (latching breach):

If reset detected:
- If counter was previously breached → emit recovery event (IsHealthy=true, IsFatal=false, RecommendedAction=NONE), clear breached flag
- If not previously breached → no event
- Update snapshot to (current, now) so the next window starts fresh from the post-reset baseline
Else if counter is already breached (latched):
- No event — breach stays latched until counter reset or boot ID change
Else (no reset, not currently breached) — evaluate the threshold:
- For delta: breach = (current − snapshot.value) > threshold, update snapshot every poll
- For velocity: skip until now − snapshot.timestamp ≥ window, then breach = rate > threshold, update snapshot
If breach detected in step 3:
- Emit unhealthy event (IsHealthy=false, IsFatal per counter config), set breached=true in persistent state
If no breach in step 3:
- No event — still healthy

Latching breach rationale: Once a fatal counter increments (e.g. link_downed=1), the underlying physical event has happened. The fact that no further increments occur in the next poll does not mean the issue is resolved — only that no more events are accumulating right now. Recovery therefore requires explicit remediation (admin counter reset or host reboot), not merely the absence of new errors. This is consistent with the Section 6.4 admin-reset timeline.

6.4 Admin Counter Reset: Recovery Event Scenario

The following timeline illustrates why persistent breach tracking and recovery events are required:

Timeline: Admin Counter Reset Recovery
T=0s    Poll:  link_downed = 0       (delta=0, no breach, breached=false)
T=5s    Poll:  link_downed = 1       (delta=1, BREACH → emit FATAL event, breached=true)
T=10s   Poll:  link_downed = 1       (delta=0, still breached, no event)
T=15s   --- CSP admin resets counters (perfquery -x) ---
T=20s   Poll:  link_downed = 0       (current < previous → reset detected)
        delta = 0 (current value), threshold NOT breached
        breached was true → transition to healthy
        → Emit RECOVERY event (IsHealthy=true, IsFatal=false)
        → Set breached=false in persistent state
T=25s   Poll:  link_downed = 0       (delta=0, not breached, no event)

Without breach tracking: After the admin reset at T=15s, the monitor would see delta=0, emit nothing, and the node would remain stuck in an unhealthy state on the platform indefinitely — even though the admin fixed the issue.

With pod restart between T=15s and T=20s: Without persistent state, the monitor loses all knowledge that link_downed was previously breached. The new pod starts fresh, sees link_downed=0, and never emits a recovery event. The persistent state file ensures the breached=true flag survives pod restarts.

6.5 Boot ID Handling

On host reboot, the node may come back with entirely different hardware (the CSP may have replaced NICs during maintenance). All kernel-maintained sysfs counters reset to zero, port states are re-established from scratch, and the device set may have changed. All persisted state from the previous boot is stale and must be discarded. The monitor must then emit healthy baseline events for all ports and counters to clear any stale FATAL conditions on the platform from the previous boot.

Algorithm:

On startup, read current boot ID from /proc/sys/kernel/random/boot_id
Compare to the boot ID stored in the persistent state file
If boot IDs differ (host rebooted):
- Clear ALL persisted state: counter snapshots, breach flags, port states, known devices
- Update the stored boot ID and save the empty state
- On the first poll cycle after reboot, emit baseline events:
  - State checks: For every port that is currently ACTIVE/LinkUp, emit a healthy event (IsHealthy=true). This clears any stale FATAL port conditions on the platform from the previous boot. Ports that are currently unhealthy (e.g., DOWN, Disabled) emit fatal/non-fatal events as usual — the node may have come back with a hardware issue.
  - Counter checks: Emit a healthy event (IsHealthy=true) for every configured counter. Since counters reset to 0 on reboot and there is no previous value to compute a delta from, all counters are below threshold on the first poll. This clears any stale counter breach conditions on the platform. The first poll also establishes the counter baseline:
    - Delta counters evaluate against the new baseline starting from the second poll (one polling interval later) — any increment above threshold triggers an unhealthy event.
    - Velocity counters wait for their full velocityUnit window (1s / 1m / 1h) to elapse against the new baseline before evaluating; they do not extrapolate from a partial sample.
- Rationale: the node is effectively a fresh machine after reboot — NICs may have been replaced, firmware updated, cables reseated. The platform must be told that all previously-reported conditions are resolved unless new issues are detected on this boot.
If boot IDs match (pod restart, same host boot):
- Restore all persisted state (counter snapshots, breach flags, port states, known devices)
- Resume normal boundary-crossing detection with full context

Consistency with sibling monitors: This boot ID mechanism matches the pattern used by the GPU health monitor (--state-file with boot ID) and the syslog health monitor (state.json with boot_id and journal cursors).

6.6 Persistent State File

The monitor persists its operational state to a JSON file on a hostPath-backed volume, enabling it to survive pod restarts without losing counter context.

State File Path: /var/run/nic_health_monitor/state.json

Kubernetes Volume Mount:

1 volumes:
2   - name: nic-state-vol
3     hostPath:
4       path: /var/run/nic_health_monitor
5       type: DirectoryOrCreate
6 
7 volumeMounts:
8   - name: nic-state-vol
9     mountPath: /var/run/nic_health_monitor

State File Structure:

1 // MonitorState is the persistent state written to disk as JSON.
2 // This single state file is shared by both state checks and counter checks.
3 type MonitorState struct {
4     Version          int                          `json:"version"`
5     BootID           string                       `json:"boot_id"`
6 
7     // Counter detection state
8     CounterSnapshots map[string]CounterSnapshot   `json:"counter_snapshots"`
9     BreachFlags      map[string]CounterBreachFlag `json:"breach_flags"`
10 
11     // State detection state (port state and device presence)
12     PortStates       map[string]PortStateSnapshot `json:"port_states"`
13     KnownDevices     []string                     `json:"known_devices"`
14 }
15 
16 // CounterSnapshot stores the value and wall-clock timestamp of a counter
17 // reading. It plays two distinct roles depending on the configured
18 // thresholdType:
19 //   - For "delta" thresholds, the snapshot is updated every poll. Δ is
20 //     just current_value − snapshot.value over the polling interval.
21 //   - For "velocity" thresholds, the snapshot is held for the full
22 //     velocityUnit window (1s / 1m / 1h). Evaluation only happens once
23 //     elapsed ≥ window, and the rate is computed as delta / elapsed in
24 //     the configured unit. After evaluation the snapshot is advanced
25 //     to the current reading so the next window starts fresh. This
26 //     avoids extrapolating a 1s sample into an hourly rate.
27 //
28 // Because the timestamp is persisted, velocity windows survive pod
29 // restarts: the new pod resumes from the persisted (value, timestamp)
30 // instead of restarting the window from zero.
31 type CounterSnapshot struct {
32     Value     uint64    `json:"value"`
33     Timestamp time.Time `json:"timestamp"`
34 }
35 
36 // CounterBreachFlag tracks whether a counter has an active threshold breach.
37 // This is needed because the breach state cannot be derived from the counter
38 // value alone — it depends on the delta at the time of the original breach,
39 // not the absolute value. Without this flag, the monitor cannot emit recovery
40 // events after an admin counter reset.
41 type CounterBreachFlag struct {
42     Breached  bool      `json:"breached"`
43     CheckName string    `json:"check_name"`
44     IsFatal   bool      `json:"is_fatal"`
45     Since     time.Time `json:"since"`
46 }
47 
48 // PortStateSnapshot captures the last-known state of a port for the state
49 // checks. Persisting this enables recovery event emission after pod restart:
50 // if a port was DOWN (fatal event sent) and an admin fixes the cable while
51 // the pod is restarting, the new pod can detect the DOWN→ACTIVE transition
52 // and emit a recovery event (IsHealthy=true). Without this, the platform
53 // would remain stuck in the FATAL state for that port.
54 // Also enables device disappearance detection across restarts via KnownDevices.
55 //
56 // LinkLayer ("InfiniBand" or "Ethernet") lets each state check filter the
57 // shared map to its own ports on startup so the IB and Ethernet checks
58 // don't treat each other's entries as "disappeared" during the seed.
59 type PortStateSnapshot struct {
60     State         string `json:"state"`                  // raw sysfs value, e.g., "4: ACTIVE", "1: DOWN"
61     PhysicalState string `json:"physical_state"`         // raw sysfs value, e.g., "5: LinkUp", "3: Disabled"
62     Device        string `json:"device"`                 // e.g., "mlx5_0"
63     Port          int    `json:"port"`
64     LinkLayer     string `json:"link_layer,omitempty"`   // "InfiniBand" | "Ethernet"
65 }

Map keys: Counter snapshots and breach flags use the key format <device>:<port>:<counter_name> (e.g., mlx5_0:1:link_downed). Port state snapshots use <device>_<port> (e.g., mlx5_0_1). KnownDevices is a flat list of device names (e.g., ["mlx5_0", "mlx5_1", ...]).

Save triggers: The state file is written after each poll cycle completes (both state and counter checks). Errors during save are logged as warnings but do not halt monitoring.

Load behavior: On startup, the monitor attempts to load the state file. If the file is missing or corrupt, the monitor starts with empty state (equivalent to first boot). A warning is logged.

6.7 Rationale

When a counter resets, the new value represents errors accumulated since the reset
This is a conservative approach: we may slightly undercount errors immediately after a reset
Alternative (treating reset as zero delta) could miss real errors that occurred during/after reset
Admin-initiated resets are a legitimate remediation action — the monitor must recognize them and clear the unhealthy condition by emitting a recovery event
Driver reloads are logged separately by the Syslog Health Monitor, providing correlation context
Persistent state ensures recovery events survive pod restarts, preventing nodes from being permanently stuck in an unhealthy state
Per-counter timestamps enable accurate velocity calculation — the snapshot is held for the configured velocityUnit window (1s / 1m / 1h) and the rate is computed over the real wall-clock elapsed time, not extrapolated from a single poll. This means a 120/hour threshold genuinely observes one hour of data, not 3600 × the per-second rate. Persisting the timestamp lets the window survive pod restarts: the new pod resumes from the persisted snapshot instead of starting a fresh window.

7. Missing Counter Handling

Not all counters are available on all NIC versions or firmware revisions. The monitor must gracefully handle missing counters to ensure portability across different hardware generations (ConnectX-5, ConnectX-6, ConnectX-7, etc.).

7.1 Design Principles

Fail-open for missing counters: If a counter file does not exist, skip it silently. Do not emit errors or events.
Log at startup only: On monitor initialization, log which counters are available vs. unavailable for debugging purposes. Do not repeatedly log missing counters during polling.
Graceful degradation: The monitor should function with whatever subset of counters is available. A node with an older NIC still benefits from the counters that do exist.
Configuration flexibility: Allow operators to disable specific counters via configuration if they are known to be unavailable or irrelevant for their environment.

7.2 Common Counter Availability by NIC Generation

Counter	ConnectX-5	ConnectX-6	ConnectX-7
`symbol_error`	Yes	Yes	Yes
`link_error_recovery`	Yes	Yes	Yes
`link_downed`	Yes	Yes	Yes
`port_rcv_errors`	Yes	Yes	Yes
`roce_slow_restart`	No	Yes	Yes
`local_ack_timeout_err`	Yes	Yes	Yes

Note: Counter availability may also depend on firmware version and driver configuration. The monitor should always verify counter existence at runtime rather than relying on static assumptions.

8. RDMA vs TCP/IP Counter Domains

Critical Architectural Note: RDMA vs TCP/IP Counter Domains

For RoCE devices, there are TWO separate counter domains:

Counter Location Tracks Example Traffic
/sys/class/infiniband/<dev>/ports/<port>/counters/ RDMA traffic only ib_write_bw, distributed apps
/sys/class/net/<iface>/statistics/ TCP/IP traffic only ping, ssh, HTTP

Field-validated observation: Running ping through a RoCE interface does NOT increment InfiniBand counters (port_rcv_data, port_xmit_data stay at 0). The ping goes through the TCP/IP stack and is tracked in Ethernet statistics instead.

Implication for monitoring: To detect RDMA-specific degradation (which affects distributed workloads), you MUST monitor the InfiniBand counters. Ethernet statistics alone will miss RDMA-layer issues like roce_slow_restart errors.

Counter Location	Tracks	Example Traffic
`/sys/class/infiniband/<dev>/ports/<port>/counters/`	RDMA traffic only	ib_write_bw, distributed apps
`/sys/class/net/<iface>/statistics/`	TCP/IP traffic only	ping, ssh, HTTP

8.1 Counter Domain Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    RDMA vs TCP/IP COUNTER DOMAINS                                │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│                         ┌─────────────────────────────┐                         │
│                         │      APPLICATION LAYER      │                         │
│                         └──────────────┬──────────────┘                         │
│                                        │                                         │
│                    ┌───────────────────┼───────────────────┐                    │
│                    │                   │                   │                    │
│                    ▼                   │                   ▼                    │
│  ┌─────────────────────────┐           │     ┌─────────────────────────┐        │
│  │      RDMA STACK         │           │     │      TCP/IP STACK       │        │
│  │  (NCCL, MPI, ib_*)      │           │     │  (HTTP, SSH, ping)      │        │
│  └───────────┬─────────────┘           │     └───────────┬─────────────┘        │
│              │                         │                 │                      │
│              ▼                         │                 ▼                      │
│  ┌─────────────────────────┐           │     ┌─────────────────────────┐        │
│  │ InfiniBand Counters     │           │     │ Ethernet Statistics     │        │
│  │ /sys/class/infiniband/  │           │     │ /sys/class/net/         │        │
│  │ <dev>/ports/<p>/counters│           │     │ <iface>/statistics/     │        │
│  │                         │           │     │                         │        │
│  │ • symbol_error          │           │     │ • rx_bytes              │        │
│  │ • port_rcv_errors       │           │     │ • tx_bytes              │        │
│  │ • roce_slow_restart     │           │     │ • rx_errors             │        │
│  │ • port_rcv_data         │           │     │ • carrier_changes       │        │
│  └───────────┬─────────────┘           │     └───────────┬─────────────┘        │
│              │                         │                 │                      │
│              └─────────────────────────┴─────────────────┘                      │
│                                        │                                         │
│                                        ▼                                         │
│                         ┌─────────────────────────────┐                         │
│                         │     PHYSICAL NIC HARDWARE   │                         │
│                         │        (ConnectX-6)         │                         │
│                         └─────────────────────────────┘                         │
│                                                                                  │
│  ═══════════════════════════════════════════════════════════════════════════    │
│                                                                                  │
│  KEY INSIGHT: Monitor InfiniBand counters for RDMA workload health              │
│               Ethernet stats won't catch roce_slow_restart!                      │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

9. Data Structures

9.1 Counter Structures

1 // CounterSnapshot represents a point-in-time reading of all counters for a port
2 type CounterSnapshot struct {
3     Device    string             `json:"device"`
4     Port      int                `json:"port"`
5     Timestamp time.Time          `json:"timestamp"`
6     Counters  map[string]uint64  `json:"counters"`  // counter_name -> value
7 }
8 
9 // CounterDelta represents the change between two snapshots
10 type CounterDelta struct {
11     Device       string             `json:"device"`
12     Port         int                `json:"port"`
13     IntervalSec  float64            `json:"interval_sec"`
14     Deltas       map[string]uint64  `json:"deltas"`       // counter_name -> delta
15     Rates        map[string]float64 `json:"rates"`        // counter_name -> rate/sec
16 }

9.2 Persistent State Structures

The monitor persists operational state to survive pod restarts. See Section 6.6 for the on-disk schema and field-level rationale; the structures defined there (MonitorState, CounterSnapshot, CounterBreachFlag, PortStateSnapshot) are the canonical reference.

Example state file content:

1 {
2   "version": 1,
3   "boot_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
4   "counter_snapshots": {
5     "mlx5_0:1:link_downed": {"value": 3, "timestamp": "2025-06-15T10:30:00Z"},
6     "mlx5_0:1:symbol_error": {"value": 1500000, "timestamp": "2025-06-15T10:30:00Z"}
7   },
8   "breach_flags": {
9     "mlx5_0:1:link_downed": {
10       "breached": true,
11       "check_name": "InfiniBandStateCheck",
12       "is_fatal": true,
13       "since": "2025-06-15T10:25:00Z"
14     }
15   },
16   "port_states": {
17     "mlx5_0_1": {"state": "1: DOWN", "physical_state": "3: Disabled", "device": "mlx5_0", "port": 1, "link_layer": "InfiniBand"},
18     "mlx5_1_1": {"state": "4: ACTIVE", "physical_state": "5: LinkUp", "device": "mlx5_1", "port": 1, "link_layer": "InfiniBand"}
19   },
20   "known_devices": ["mlx5_0", "mlx5_1", "mlx5_2", "mlx5_3"]
21 }

9.3 Entity Model

NICs and Ports are modeled as separate entity types to enable precise fault localization:

Entity Type	Entity Value Format	Example	Use Case
`NIC`	`<device_name>`	`mlx5_0`	Device-level failures
`NICPort`	`<port_number>`	`1`	Port-level counter violations

10. Configuration

The counter monitoring system is fully configurable, allowing operators to:

Define which counters to monitor
Configure threshold types (delta-based or velocity-based)
Set fatal/non-fatal severity levels per counter
Override default thresholds for specific environments

10.1 Configuration Schema

1 # NIC Health Monitor - Counter Detection Configuration
2 counterDetection:
3   # Enable/disable counter monitoring
4   enabled: true
5 
6   # Counter definitions - fully configurable
7   # Each counter can specify:
8   #   - name:           Counter identifier (used in events)
9   #   - path:           Sysfs path relative to device (supports standard and hw_counters)
10   #   - enabled:        Enable/disable this counter (default: true)
11   #   - isFatal:        Whether threshold breach triggers Fatal event (default: false)
12   #   - thresholdType:  "delta" (absolute change) or "velocity" (rate per time unit)
13   #   - threshold:      Numeric threshold value
14   #   - velocityUnit:   For velocity thresholds: "second", "minute", "hour"
15   #   - description:    Human-readable description for event messages
16 
17   counters: []  # See defaults below

Polling interval: The polling interval is set globally on the Helm chart (pollingInterval, default 1s) and is not configured per counter. Velocity thresholds are evaluated against a window matching the configured velocityUnit (1s / 1m / 1h), independent of the polling interval — so a fast 1s poll is suitable for every counter type without producing false alerts on long windows.

10.2 Default Counter Configuration

The following counters are monitored by default. Operators can override any setting or add custom counters.

1 counterDetection:
2   enabled: true
3 
4   counters:
5     #--------------------------------------------------------------------------
6     # FATAL COUNTERS (Default: IsFatal=true, RecommendedAction=REPLACE_VM)
7     # These counters indicate deterministic workload failure
8     #--------------------------------------------------------------------------
9 
10     - name: link_downed
11       path: counters/link_downed
12       enabled: true
13       isFatal: true
14       thresholdType: delta
15       threshold: 0              # Any increment (> 0) is fatal
16       description: "Port Training State Machine failed - QP disconnect"
17       
18     - name: excessive_buffer_overrun_errors
19       path: counters/excessive_buffer_overrun_errors
20       enabled: true
21       isFatal: true
22       thresholdType: delta
23       threshold: 0              # Any increment is fatal
24       description: "HCA internal buffer overflow - lossless contract violated"
25       
26     - name: local_link_integrity_errors
27       path: counters/local_link_integrity_errors
28       enabled: true
29       isFatal: true
30       thresholdType: delta
31       threshold: 0              # Any increment is fatal
32       description: "Physical errors exceed LocalPhyErrors hardware cap"
33       
34     - name: rnr_nak_retry_err
35       path: hw_counters/rnr_nak_retry_err
36       enabled: true
37       isFatal: true
38       thresholdType: delta
39       threshold: 0              # Any increment is fatal
40       description: "Receiver Not Ready NAK retry exhausted - connection severed"
41     
42     #--------------------------------------------------------------------------
43     # NON-FATAL COUNTERS (Default: IsFatal=false, RecommendedAction=NONE)
44     # These counters indicate degradation requiring monitoring
45     #--------------------------------------------------------------------------
46     
47     # Physical Layer (PHY) - two-tier symbol_error monitoring
48     - name: symbol_error
49       path: counters/symbol_error
50       enabled: true
51       isFatal: false
52       thresholdType: velocity
53       threshold: 10.0
54       velocityUnit: second
55       description: "PHY bit errors before FEC - physical layer degradation"
56 
57     - name: symbol_error_fatal
58       path: counters/symbol_error
59       enabled: true
60       isFatal: true
61       thresholdType: velocity
62       threshold: 120.0
63       velocityUnit: hour
64       description: "Symbol errors exceed IBTA BER threshold (10E-12) - link outside spec"
65       
66     - name: link_error_recovery
67       path: counters/link_error_recovery
68       enabled: true
69       isFatal: false
70       thresholdType: velocity
71       threshold: 5.0
72       velocityUnit: minute
73       description: "Link retraining events - micro-flapping"
74     
75     # Transport Layer
76     - name: port_rcv_errors
77       path: counters/port_rcv_errors
78       enabled: true
79       isFatal: false
80       thresholdType: velocity
81       threshold: 10.0
82       velocityUnit: second
83       description: "Malformed packets received"
84       
85     - name: out_of_sequence
86       path: hw_counters/out_of_sequence
87       enabled: true
88       isFatal: false
89       thresholdType: velocity
90       threshold: 100.0
91       velocityUnit: second
92       description: "Fabric routing issues - out of sequence packets"
93       
94     - name: local_ack_timeout_err
95       path: hw_counters/local_ack_timeout_err
96       enabled: true
97       isFatal: false
98       thresholdType: velocity
99       threshold: 1.0
100       velocityUnit: second
101       description: "ACK timeout - potential fabric black hole"
102     
103     # Congestion Indicators
104     - name: port_xmit_discards
105       path: counters/port_xmit_discards
106       enabled: true
107       isFatal: false
108       thresholdType: velocity
109       threshold: 100.0
110       velocityUnit: second
111       description: "TX discards due to congestion"
112       
113     - name: port_xmit_wait
114       path: counters/port_xmit_wait
115       enabled: true
116       isFatal: false
117       thresholdType: velocity
118       threshold: 10000.0
119       velocityUnit: second
120       description: "TX wait ticks - congestion backpressure"
121     
122     # RoCE-specific
123     - name: roce_slow_restart
124       path: hw_counters/roce_slow_restart
125       enabled: true
126       isFatal: false
127       thresholdType: velocity
128       threshold: 10.0
129       velocityUnit: second
130       description: "Victim flow oscillation"
131     
132     # Interface Level
133     - name: carrier_changes
134       path: /sys/class/net/{interface}/statistics/carrier_changes
135       enabled: true
136       isFatal: false
137       thresholdType: delta
138       threshold: 2               # > 2 changes per interval
139       description: "Link instability - carrier state changes"

10.3 Custom Counter Example

Operators can add custom counters or override defaults:

1 counterDetection:
2   counters:
3     # Override: Make symbol_error fatal for strict environments
4     - name: symbol_error
5       path: counters/symbol_error
6       enabled: true
7       isFatal: true                    # Override: make fatal
8       thresholdType: velocity
9       threshold: 120.0                 # IBTA spec: 120/hour
10       velocityUnit: hour               # Changed from second
11       description: "Symbol errors exceed IBTA BER threshold"
12     
13     # Custom: Add vendor-specific counter
14     - name: custom_vendor_error
15       path: hw_counters/vendor_specific_err
16       enabled: true
17       isFatal: false
18       thresholdType: delta
19       threshold: 100
20       description: "Vendor-specific error counter"
21     
22     # Disable: Turn off a default counter
23     - name: port_xmit_wait
24       enabled: false

10.4 Threshold Processing Algorithm

The evaluator uses latching breach semantics: once a counter breaches its threshold, the breach flag stays set until the counter is reset (current < previous) or the host reboots. Polls while a counter is already breached emit nothing; recovery events fire only on counter reset of a previously breached counter.

On startup, before the first poll cycle:
  0. Check boot ID (see Section 6.5):
     IF boot ID changed (host rebooted):
       - Clear ALL persisted state (snapshots and breach flags)
       - Set reboot_detected = true
     ELSE (pod restart, same boot):
       - Restore all persisted state
       - Set reboot_detected = false
For each configured counter (key = <device>:<port>:<counter_name>):
  1. Read current_value from sysfs and capture wall-clock now.
  2. IF reboot_detected AND this is the first poll for the key:
     → Emit HEALTHY baseline event (clears any stale platform condition):
         - IsHealthy = true
         - IsFatal = false
         - RecommendedAction = NONE
         - Message = "Counter {name} healthy after reboot on port {device} port {port}"
         - CheckName: state check name if isFatal, else degradation check name
     → Store snapshot = (current_value, now). Continue to next counter.
  3. Reset detection (in this priority order):
     a. IF an in-memory lastPollValue exists for this key AND
        current_value < lastPollValue → reset
     b. ELSE IF a persisted snapshot exists AND
        current_value < snapshot.value → reset
     The lastPollValue path catches mid-window resets in steady state;
     the snapshot.value fallback catches resets that happened during pod
     downtime when lastPollValue is gone.
     IF reset:
       IF previously_breached:
         → Emit RECOVERY event (IsHealthy=true, IsFatal=false,
           RecommendedAction=NONE, CheckName matching the original breach)
         → Clear breach flag
       Update snapshot to (current_value, now). Continue to next counter.
  4. IF previously_breached:
     → No event (latched). Skip evaluation entirely.
  5. Evaluate based on thresholdType:
     IF thresholdType == "delta":
         delta = current_value − snapshot.value
         breach = (delta > threshold)
         Update snapshot to (current_value, now) every poll.
     IF thresholdType == "velocity":
         elapsed = now − snapshot.timestamp
         window = velocityUnit (1s | 1m | 1h)
         IF elapsed < window:
             → Skip evaluation; leave snapshot untouched. The window is
               not yet full. Continue to next counter.
         delta   = current_value − snapshot.value
         rate    = delta / elapsed expressed in velocityUnit
         breach  = (rate > threshold)
         Update snapshot to (current_value, now). The next window starts
         from this reading.
  6. IF breach (and we got here, so we were not previously breached):
     → Emit UNHEALTHY event:
         - IsHealthy = false
         - IsFatal = counter.isFatal
         - RecommendedAction = REPLACE_VM if counter.isFatal, else NONE
         - Message = "{port}: {name} - {description} (value=..., delta=..., rate=...)"
         - CheckName:
             isFatal=true  → state check name (InfiniBandStateCheck / EthernetStateCheck)
             isFatal=false → degradation check name
         - ComponentClass = "NIC"
     → Set breached = true in persistent breach flags.
     IF NOT breach: no event.
  7. Record current_value as lastPollValue (in-memory) for the next poll.
After all counters evaluated:
  8. Save persistent state to disk only if any snapshot or breach flag
     actually changed (to avoid unnecessary writes every poll).

10.5 Configuration Validation

The monitor validates configuration at startup:

Validation	Requirement	Action on Failure
Counter path exists	Path must be readable in sysfs	Log warning, skip counter
Threshold is positive	threshold >= 0	Reject configuration
velocityUnit valid	Must be `second`, `minute`, or `hour`	Reject configuration
thresholdType valid	Must be `delta` or `velocity`	Reject configuration
Unique counter names	No duplicate `name` fields	Reject configuration

11. Event Management

11.1 Event Construction

Example Event Fields (Fatal - link_downed):

Note: Fatal counter events use the state check name (InfiniBandStateCheck / EthernetStateCheck) so that all fatal signals for a given NIC type consolidate under a single node condition.

Field	Value
Agent	`nic-health-monitor`
CheckName	`InfiniBandStateCheck`
ComponentClass	`NIC`
Message	”Port mlx5_0 port 1: link_downed - Port Training State Machine failed - QP disconnect (value=1, delta=1, rate=0.20/sec)“
IsFatal	`true`
IsHealthy	`false`
RecommendedAction	`REPLACE_VM`
EntitiesImpacted	`[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]`

Example Event Fields (Non-Fatal - Degradation):

Note: Non-fatal counter events use the degradation check name (InfiniBandDegradationCheck / EthernetDegradationCheck) to keep degradation signals separate from fatal conditions on the node.

Field	Value
Agent	`nic-health-monitor`
CheckName	`InfiniBandDegradationCheck`
ComponentClass	`NIC`
Message	”Port mlx5_0 port 1: symbol_error - PHY bit errors before FEC - physical layer degradation (value=100, delta=100, rate=20.00/sec)“
IsFatal	`false`
IsHealthy	`false`
RecommendedAction	`NONE`
EntitiesImpacted	`[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]`

Example Event Fields (Recovery - Counter Reset by Admin):

Note: Recovery events are emitted when a previously breached counter returns below its threshold — typically after an administrator clears the counters. The CheckName matches the original breach event to ensure the recovery clears the correct condition.

Field	Value
Agent	`nic-health-monitor`
CheckName	`InfiniBandStateCheck`
ComponentClass	`NIC`
Message	”Counter link_downed recovered on port mlx5_0 port 1”
IsFatal	`false`
IsHealthy	`true`
RecommendedAction	`NONE`
EntitiesImpacted	`[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]`

11.2 Event Routing

IsFatal	IsHealthy	Action	Use Case
`true`	`false`	Immediate gRPC dispatch to Platform Connector	link_downed, buffer overrun
`false`	`false`	Batched gRPC dispatch (periodic)	Symbol errors, congestion
`false`	`true`	Immediate gRPC dispatch to Platform Connector	Counter recovery after admin reset

Appendix A: Quick Reference - Default Counter Thresholds

Note: All counters, thresholds, and severity levels are configurable via the monitor configuration. The values below are the defaults that apply when no custom configuration is provided. See Section 10: Configuration for customization options.

Fatal Counters (Default: IsFatal = true)

Counter	Path	Default Threshold	Default Action	Configurable
`link_downed`	`counters/`	Delta > 0	REPLACE_VM	Yes
`excessive_buffer_overrun_errors`	`counters/`	Delta > 0	REPLACE_VM	Yes
`local_link_integrity_errors`	`counters/`	Delta > 0	REPLACE_VM	Yes
`rnr_nak_retry_err`	`hw_counters/`	Delta > 0	REPLACE_VM	Yes
`symbol_error_fatal`	`counters/`	> 120/hour	REPLACE_VM	Yes

Driver/Firmware Logs

For kernel log pattern details (fatal and non-fatal classifications, regex patterns, and kernel source references), see Syslog Detection & Correlation.

Non-Fatal Counters (Default: IsFatal = false)

Counter	Path	Default Threshold	Default Action	Configurable
`symbol_error`	`counters/`	> 10/sec	Monitor	Yes
`link_error_recovery`	`counters/`	> 5/min	Monitor	Yes
`port_rcv_errors`	`counters/`	> 10/sec	Monitor	Yes
`port_xmit_discards`	`counters/`	> 100/sec	Monitor	Yes
`roce_slow_restart`	`hw_counters/`	> 10/sec	Monitor	Yes
`local_ack_timeout_err`	`hw_counters/`	> 1/sec	Monitor	Yes
`carrier_changes`	interface	> 2/interval	Monitor	Yes

Note: rnr_nak_retry_err is FATAL by default (see Fatal Counters table above). All counters can have their severity and threshold overridden via configuration.

Design Principle

Source	IsFatal	Recommended Action	Purpose
Deterministic Logs	`true`	`REPLACE_VM`	Fatal driver/firmware condition
Port State Changes (link-state-detection)	`true`	`REPLACE_VM`	Fatal NIC condition detected
Fatal Counters (link-counter-detection)	`true`	`REPLACE_VM`	Fatal NIC condition detected
Diagnostic Logs	`false`	`NONE`	Evidence/context for investigation

Key Insight: Deterministically fatal events in logs (cmd_exec timeout, etc.) are Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module absent) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM.

Table of Contents