NIC Health Monitor: Link Counter Detection
NIC Health Monitor: Link Counter Detection
NIC Health Monitor: Link Counter Detection
Related Documents:
Modern GPU clusters suffer from Grey Failures (subtle degradations) and straggler effects where a single degraded link throttles thousands of GPUs. Simple UP/DOWN polling is insufficient; a deterministic degradation detection system is required that can detect both hard failures and gradual degradation before FEC exhaustion causes catastrophic packet loss.
This document covers the Degradation Monitoring component of the NIC Health Monitor, which detects:
This monitor uses a binary severity model based on workload impact:
Key Design Principle: The only question that matters is “Will the running workload fail because of this?“
Modern interconnects (HDR/NDR InfiniBand, 100/200/400GbE) use PAM4 modulation (Pulse Amplitude Modulation, 4-level) to achieve high bandwidth. This represents a fundamental paradigm shift from previous generations.
Critical: In PAM4 systems, raw bit errors are a physical certainty. A monitor that alerts on “Any Error > 0” would be permanently alarming. The velocity-based approach is the only valid monitoring strategy for 400G+ networks. (Reference: PAM4 Test Challenges)
Degradation flow: Physical impairment (cable/SFP) → Eye diagram closes (DSP struggles) → Symbol errors (PHY layer) → FEC corrections (recoverable) → CRC failures (unrecoverable) → Packet loss (FATAL)
Monitoring opportunity: Detect degradation at the symbol_error stage, before FEC exhaustion causes packet loss.
Because errors are inevitable in PAM4, Forward Error Correction (FEC) is mandatory for 200G/400G/NDR links.
FEC masks physical degradation until the error rate exceeds correction capacity—then packet loss spikes instantly from 0% to ~100% (the “cliff”). The Degradation Monitor tracks Pre-FEC BER via symbol_error velocity, enabling node draining before the cliff is reached.
PAM4 Note (HDR/NDR): On 200G/400G adapters, non-zero raw BER is expected. Use rate-based thresholds (e.g.,
symbol_error > 10/sec) for degradation detection, notsymbol_error > 0.
Unlike general-purpose TCP/IP networks, which are architected to be resilient to packet loss, latency variation, and out-of-order delivery, RDMA fabrics—specifically InfiniBand (IB) and RDMA over Converged Ethernet (RoCE)—are designed under a “lossless” assumption. This architectural premise dictates that once a packet is admitted to the fabric, its delivery is guaranteed by credit-based flow control (in IB) or Priority Flow Control (in RoCE), relieving the transport layer of heavy congestion management overhead.
Key Insight: This reliance on near-perfect transmission introduces a binary fragility to the system. When the physical or link layer violates the lossless assumption, the impact on the application is often not merely performance degradation, but catastrophic failure. For tightly coupled distributed workloads using MPI or NCCL, a failure in a single link deterministically terminates the entire job.
The critical operational requirement is distinguishing between:
The boundary between soft and hard errors is defined by:
The InfiniBand specification defines a compliant link as maintaining a Bit Error Rate (BER) of better than 10E-12. This physical constant provides the basis for threshold calculations:
Monitoring Implication: While a single
SymbolErroris not fatal, a rate exceeding 120/hour (≈2/minute) is a deterministic predictor of impending link instability. Monitoring systems should treat this as a Fatal condition requiring node replacement.
The following counters represent absolute deterministic failure when they increment:
Note: These four counters represent absolute deterministic failure. Additionally,
symbol_errorhas a fatal threshold at > 120/hour (IBTA BER spec violation) via thesymbol_error_fatalconfig entry. All other counters (symbol_error at > 10/sec, port_rcv_errors, etc.) are non-fatal degradation indicators.
When hardware counters increment, they don’t directly cause application failure—they trigger a reaction in the software stack. Understanding this interaction defines the “Fatal” threshold:
When the NIC sends a packet and the ACK never arrives:
After retry_cnt attempts (default: 7), the NIC tears down the connection and the application receives IBV_WC_RETRY_EXC_ERR.
Implications:
Important: Application-Triggered Timeouts. A rising
local_ack_timeout_errcounter does NOT necessarily indicate a local NIC fault. If a remote NCCL rank crashes or hangs, the remote NIC stops responding to RDMA requests. The local NIC retries and eventually exhaustsretry_cnt, incrementinglocal_ack_timeout_erron the local side. This means the counter can be triggered by: (1) fabric black hole (network issue), (2) remote NIC failure, or (3) remote application crash/hang — which is not a NIC problem at all. This is whylocal_ack_timeout_erris classified as Non-Fatal (IsFatal=false) — it requires correlation with other signals (port state, remote node health) to determine the root cause.
What This Monitor CAN Detect: The local_ack_timeout_err and req_transport_retries_exceeded (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can’t see the application error.
Diagnostic Commands:
Correlation: Use with ibdiagnet to determine if issue is local (NIC) or remote (Switch/Fabric).
Fabric-wide Diagnostic Command:
The Degradation Monitor follows NVSentinel’s established architectural pattern where:
Local State Persistence: The Degradation Check maintains a persistent state file on the node (hostPath-backed) containing per-counter snapshots (value + timestamp), per-counter breach flags, and the host boot ID. This enables the monitor to (1) compute accurate deltas and precise velocity rates by holding the persisted snapshot for the configured velocity window and computing the rate over the real elapsed time — so a
120/hourthreshold is observed over a one-hour window rather than extrapolated from a single 1s sample; (2) seamlessly resume velocity windows after a pod restart because the snapshot timestamp survives the restart; (3) emit recovery events (IsHealthy=true) when counters are reset by an administrator, by retaining the breach flag across restarts; and (4) detect host reboots to clear all state and emit healthy baseline events for all ports and counters, since the node may have had NICs replaced during maintenance. This local state is strictly operational — all correlation and pattern detection remains centralized in the Health Events Analyzer.
Analyzer Escalation: The Health Events Analyzer only escalates repeated non-fatal degradation events. The RepeatedNICDegradation rule triggers when 3 non-fatal InfiniBandDegradationCheck or EthernetDegradationCheck events occur on the same NIC + NICPort within 1 hour, and recommends CONTACT_SUPPORT. Fatal counters still emit REPLACE_VM directly from the monitor on first breach.
This monitor tracks both fatal counters (deterministic workload failure) and non-fatal counters (degradation indicators). The IsFatal field in the HealthEvent distinguishes between them.
/sys/class/infiniband/<dev>/ports/<port>/counters/)/sys/class/infiniband/<dev>/ports/<port>/hw_counters/) — Non-FatalAll extended counters are non-fatal by default. They indicate congestion, retransmissions, or recoverable transport events. RDMA’s reliable transport handles these automatically; workloads continue with potential performance impact.
Key Non-Fatal Counters (monitor for performance degradation):
Key Insights:
rnr_nak_retry_err> 0: FATAL - Indicates RNR NAK retry exhausted; the connection has been severed.roce_slow_restart> 10/sec: Primary indicator for Grey Failures. Indicates flow oscillation and straggler behavior.port_xmit_discards> 100/sec: Flow control breakdown. Network physically unable to handle load.symbol_error> 10/sec: Signature of “Dirty Fiber” or microscopic dust on connectors.
/sys/class/infiniband/<dev>/ports/<port>/counters/ (symbol_error, link_downed, local_link_integrity_errors, etc.)/sys/class/infiniband/<dev>/ports/<port>/hw_counters/ (rnr_nak_retry_err, roce_slow_restart, etc.)/sys/class/net/<iface>/statistics/ (carrier_changes)link_downed is Fatal. In running MPI/NCCL jobs, any increment (Delta > 0) guarantees job crash.excessive_buffer_overrun_errors is Fatal. Violates fundamental “lossless” contract; packet causing overrun is dropped immediately.rnr_nak_retry_err is Fatal. Indicates Receiver Not Ready NAK retry exhausted; the connection has been severed.local_link_integrity_errors is Fatal. This counter is a “meta-threshold”—it only increments when raw physical errors exceed the hardware-defined LocalPhyErrors cap.symbol_error uses PAM4 (HDR/NDR) considerations. Zero-tolerance is obsolete for modern links; non-zero raw BER is expected. Monitor velocity for degradation trends.hw_counters are Non-Fatal by default—they indicate degradation that should be monitored but doesn’t immediately crash workloads. Exception: rnr_nak_retry_err is fatal.Configuration Note: Thresholds are configurable. Counter severity, sysfs path, recommended action, and event description are owned by the monitor’s hardcoded counter definitions. See Section 10: Configuration for customization options.
Table 1: Absolute Deterministic Failure Thresholds (Default: Fatal - IsFatal=true)
Breaching these thresholds guarantees application failure or mandatory node exclusion.
Table 2: Predictive Thresholds (Non-Fatal - IsFatal=false)
Breaching these rates indicates degradation requiring monitoring. Workloads continue but performance may be impacted.
Threshold Source Note: These thresholds are derived from a combination of IBTA BER specifications, cloud provider operational heuristics (Azure, AWS), vendor documentation, and field experience. Specific rate values are configurable defaults, not specification mandates. See Section 10: Configuration for customization options.
The following analysis validates the efficacy of the proposed monitoring design based on hardware specifications and empirical reliability studies.
1. Physical Layer (L1) Justifications
symbol_error > 10/sec): A rate of 10/sec is a robust indicator of physical degradation. In modern PAM4 links, a healthy optical connection operates with a BER better than 1E-12 (roughly one error every few hours). A rate of 10/sec implies the BER has degraded by orders of magnitude (to ~1E-8). This is the classic signature of “Dirty Fiber” or microscopic dust on connectors.link_error_recovery > 5/min): Tracks the Port Training State Machine (PTSM). 5 events per minute represents a “Flapping” link. While the link recovers (non-fatal), each retrain causes 50ms to 2s of stall, decimating performance for synchronous GPU workloads.carrier_changes > 2/interval): The OS-visible shadow of link recovery. Confirms that physical instability was severe enough to disrupt the driver layer.2. Data Link Layer (L2) Justifications
port_rcv_errors > 10/sec): Indicates “Bit Rot”—data corruption surviving the PHY but failing the CRC/FCS check. Triggers “Phantom Congestion” as the network repeatedly retransmits corrupted frames.port_xmit_discards > 100/sec): Indicates flow control breakdown. The network is physically unable to handle the load, and backpressure mechanisms (PFC) are failing. Definitive signal of Congestion Collapse.3. Transport Layer (L4) Justifications
roce_slow_restart > 10/sec): Primary indicator for Grey Failures. Indicates a flow is timing out and resetting its congestion window repeatedly. This creates stragglers that stall the entire GPU fleet during collective operations (AllReduce).local_ack_timeout_err > 1/sec): In a reliable lossless network, ACKs should not be lost. A persistent rate of 1/sec implies a “Fabric Black Hole” (e.g., a specific bad ECMP path).Note on
rnr_nak_retry_err: This counter is FATAL (not a non-fatal threshold). Any increment indicates the Receiver Not Ready NAK retry limit has been exhausted and the connection has been severed. This is a terminal state of error handling.
Final Verdict: These thresholds are calibrated to distinguish between background noise (standard FEC activity) and pathological hardware degradation that threatens AI training efficiency.
For Mellanox devices (IB and RoCE), the monitor reads:
/sys/class/infiniband/<dev>/ports/1/counters/
link_downed, local_link_integrity_errors, excessive_buffer_overrun_errorssymbol_error — non-fatal at > 10/sec (degradation warning), fatal at > 120/hour (IBTA BER spec violation)/sys/class/infiniband/<dev>/ports/1/hw_counters/
rnr_nak_retry_errNote: Mellanox throughput counters (
port_rcv_data,port_xmit_data) are in 4-byte words. Multiply by 4 to get bytes.
Note:
symbol_errorhas two default config entries:symbol_error(non-fatal, > 10/sec for degradation) andsymbol_error_fatal(fatal, > 120/hour per IBTA specification (10E-12 BER)). Both read from the same sysfs file. On PAM4 links (HDR/NDR), some non-zero symbol errors are expected; tune the fatal threshold if 120/hour is too sensitive for your environment.
Hardware counters may reset due to driver reloads, device resets, administrator-initiated clears (e.g., perfquery -x, echo 0 > /sys/...), or (rarely) uint64 overflow. The monitor must handle cases where Current < Previous to avoid incorrect delta calculations and must emit recovery events when a counter reset clears a previously breached threshold.
Additionally, if symbol_error had previously triggered a FATAL event (e.g., exceeding 120/hour), and an administrator resets the counters to remediate the issue, the monitor must detect this and emit a recovery event (IsHealthy=true) to clear the unhealthy condition on the platform.
Reset Detection (uses an in-memory lastPollValue per counter
plus the persisted snapshot as fallback after pod restart):
lastPollValue is recorded for this counter (steady state):
current < lastPollValuelastPollValue not yet rebuilt):
current < snapshot.valueSysfs counters are kernel-maintained and monotonic between resets, so
current < previous is a definitive reset signal.
Threshold Evaluation and Recovery Steps (latching breach):
IsHealthy=true, IsFatal=false, RecommendedAction=NONE),
clear breached flag(current, now) so the next window starts
fresh from the post-reset baselinedelta: breach = (current − snapshot.value) > threshold,
update snapshot every pollvelocity: skip until now − snapshot.timestamp ≥ window,
then breach = rate > threshold, update snapshotIsHealthy=false, IsFatal per counter
config), set breached=true in persistent stateLatching breach rationale: Once a fatal counter increments (e.g.
link_downed=1), the underlying physical event has happened. The fact that no further increments occur in the next poll does not mean the issue is resolved — only that no more events are accumulating right now. Recovery therefore requires explicit remediation (admin counter reset or host reboot), not merely the absence of new errors. This is consistent with the Section 6.4 admin-reset timeline.
The following timeline illustrates why persistent breach tracking and recovery events are required:
Without breach tracking: After the admin reset at T=15s, the monitor would see delta=0, emit nothing, and the node would remain stuck in an unhealthy state on the platform indefinitely — even though the admin fixed the issue.
With pod restart between T=15s and T=20s: Without persistent state, the monitor loses all knowledge that link_downed was previously breached. The new pod starts fresh, sees link_downed=0, and never emits a recovery event. The persistent state file ensures the breached=true flag survives pod restarts.
On host reboot, the node may come back with entirely different hardware (the CSP may have replaced NICs during maintenance). All kernel-maintained sysfs counters reset to zero, port states are re-established from scratch, and the device set may have changed. All persisted state from the previous boot is stale and must be discarded. The monitor must then emit healthy baseline events for all ports and counters to clear any stale FATAL conditions on the platform from the previous boot.
Algorithm:
/proc/sys/kernel/random/boot_idACTIVE/LinkUp, emit a healthy event (IsHealthy=true). This clears any stale FATAL port conditions on the platform from the previous boot. Ports that are currently unhealthy (e.g., DOWN, Disabled) emit fatal/non-fatal events as usual — the node may have come back with a hardware issue.IsHealthy=true) for every configured counter. Since counters reset to 0 on reboot and there is no previous value to compute a delta from, all counters are below threshold on the first poll. This clears any stale counter breach conditions on the platform. The first poll also establishes the counter baseline:
velocityUnit window (1s / 1m / 1h) to elapse against the new baseline before evaluating; they do not extrapolate from a partial sample.Consistency with sibling monitors: This boot ID mechanism matches the pattern used by the GPU health monitor (
--state-filewith boot ID) and the syslog health monitor (state.jsonwithboot_idand journal cursors).
The monitor persists its operational state to a JSON file on a hostPath-backed volume, enabling it to survive pod restarts without losing counter context.
State File Path: /var/run/nic_health_monitor/state.json
Kubernetes Volume Mount:
State File Structure:
Map keys: Counter snapshots and breach flags use the key format <device>:<port>:<counter_name> (e.g., mlx5_0:1:link_downed). Port state snapshots use <device>_<port> (e.g., mlx5_0_1). KnownDevices is a flat list of device names (e.g., ["mlx5_0", "mlx5_1", ...]).
Save triggers: The state file is written after each poll cycle completes (both state and counter checks). Errors during save are logged as warnings but do not halt monitoring.
Load behavior: On startup, the monitor attempts to load the state file. If the file is missing or corrupt, the monitor starts with empty state (equivalent to first boot). A warning is logged.
120/hour threshold genuinely observes one hour of data, not 3600 × the per-second rate. Persisting the timestamp lets the window survive pod restarts: the new pod resumes from the persisted snapshot instead of starting a fresh window.Not all counters are available on all NIC versions or firmware revisions. The monitor must gracefully handle missing counters to ensure portability across different hardware generations (ConnectX-5, ConnectX-6, ConnectX-7, etc.).
Note: Counter availability may also depend on firmware version and driver configuration. The monitor should always verify counter existence at runtime rather than relying on static assumptions.
Critical Architectural Note: RDMA vs TCP/IP Counter Domains
For RoCE devices, there are TWO separate counter domains:
Field-validated observation: Running
pingthrough a RoCE interface does NOT increment InfiniBand counters (port_rcv_data,port_xmit_datastay at 0). The ping goes through the TCP/IP stack and is tracked in Ethernet statistics instead.Implication for monitoring: To detect RDMA-specific degradation (which affects distributed workloads), you MUST monitor the InfiniBand counters. Ethernet statistics alone will miss RDMA-layer issues like
roce_slow_restarterrors.
The monitor persists operational state to survive pod restarts. See Section 6.6 for the on-disk schema and field-level rationale; the structures defined there (MonitorState, CounterSnapshot, CounterBreachFlag, PortStateSnapshot) are the canonical reference.
Example state file content:
NICs and Ports are modeled as separate entity types to enable precise fault localization:
The counter monitoring system is configurable within a hardcoded allowlist of NIC error/degradation counters. Counter paths, default severity, recommended action, and event descriptions are owned by code. Operators can:
Polling interval: The polling interval is set globally on the Helm chart (
pollingInterval, default1s) and is not configured per counter. Velocity thresholds are evaluated against a window matching the configuredvelocityUnit(1s / 1m / 1h), independent of the polling interval — so a fast 1s poll is suitable for every counter type without producing false alerts on long windows.
The following counters are monitored by default. Operators can override enablement and threshold settings, or add additional counters from the allowed set. The monitor resolves each name to its sysfs path, severity, recommended action, and event description.
The monitor rejects arbitrary counter names at startup. Operators can add entries from the allowlist or override thresholds for existing entries:
Allowed counter selections are hardcoded in pkg/config/config.go. They include:
counters/, such as symbol_error, link_error_recovery, link_downed, port_rcv_errors, port_xmit_discards, and port_xmit_wait.hw_counters/, such as rnr_nak_retry_err, local_ack_timeout_err, req_transport_retries_exceeded, out_of_sequence, and roce_slow_restart.statistics/, such as carrier_changes, rx_errors, rx_crc_errors, rx_missed_errors, tx_errors, and tx_carrier_errors.The evaluator uses latching breach semantics: once a counter
breaches its threshold, the breach flag stays set until the counter is
reset (current < previous) or the host reboots. Polls while a counter
is already breached emit nothing; recovery events fire only on counter
reset of a previously breached counter.
The monitor validates configuration at startup:
Example Event Fields (Fatal - link_downed):
Note: Fatal counter events use the state check name (
InfiniBandStateCheck/EthernetStateCheck) so that all fatal signals for a given NIC type consolidate under a single node condition.
Example Event Fields (Non-Fatal - Degradation):
Note: Non-fatal counter events use the degradation check name (
InfiniBandDegradationCheck/EthernetDegradationCheck) to keep degradation signals separate from fatal conditions on the node.
Example Event Fields (Recovery - Counter Reset by Admin):
Note: Recovery events are emitted when a previously breached counter returns below its threshold — typically after an administrator clears the counters. The
CheckNamematches the original breach event to ensure the recovery clears the correct condition.
Note: Supported counter names and thresholds are configurable via the monitor configuration. Severity, sysfs path, recommended action, and event description come from the monitor’s hardcoded counter definitions. See Section 10: Configuration for customization options.
For kernel log pattern details (fatal and non-fatal classifications, regex patterns, and kernel source references), see Syslog Detection & Correlation.
Note:
rnr_nak_retry_erris FATAL by default (see Fatal Counters table above). Counter thresholds can be overridden via configuration; severity changes should use platform connector / analyzer overrides.
Key Insight: Deterministically fatal events in logs (cmd_exec timeout, etc.) are Fatal (
IsFatal=true) withRecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module absent) are Non-Fatal (IsFatal=false). State and fatal counter conditions are Fatal (IsFatal=true) withRecommendedAction_REPLACE_VM. Repeated non-fatal counter degradation is an analyzer-generatedCONTACT_SUPPORTescalation, not automatic remediation.