NIC Health Monitor: Link Counter Detection
NIC Health Monitor: Link Counter Detection
Table of Contents
- Overview
- Theoretical Foundation
- Architecture
- Complete Counter Specification
- Counter Reading and Parsing
- Counter Reset Handling and Persistent State
- Missing Counter Handling
- RDMA vs TCP/IP Counter Domains
- Data Structures
- Configuration
- Event Management
Related Documents:
- Link State Detection - UP/DOWN state monitoring
- Syslog Detection & Correlation - Kernel log monitoring and repeat failure detection
1. Overview
1.1 Problem Statement
Modern GPU clusters suffer from Grey Failures (subtle degradations) and straggler effects where a single degraded link throttles thousands of GPUs. Simple UP/DOWN polling is insufficient; a deterministic degradation detection system is required that can detect both hard failures and gradual degradation before FEC exhaustion causes catastrophic packet loss.
1.2 Scope of Link Counter Detection
This document covers the Degradation Monitoring component of the NIC Health Monitor, which detects:
- Fatal counter violations - Counters that guarantee workload failure when incremented
- Rate-based degradation - Error rates exceeding thresholds that predict impending failure
- Pre-failure prediction - Detecting BER climbing before FEC exhaustion
1.3 Binary Severity Model
This monitor uses a binary severity model based on workload impact:
Key Design Principle: The only question that matters is “Will the running workload fail because of this?“
1.4 Counter Detection Overview Diagram
2. Theoretical Foundation
2.1 The Physics of High-Speed Signaling Degradation
Modern interconnects (HDR/NDR InfiniBand, 100/200/400GbE) use PAM4 modulation (Pulse Amplitude Modulation, 4-level) to achieve high bandwidth. This represents a fundamental paradigm shift from previous generations.
2.1.1 PAM4 vs NRZ: Why Velocity Monitoring is Required
Critical: In PAM4 systems, raw bit errors are a physical certainty. A monitor that alerts on “Any Error > 0” would be permanently alarming. The velocity-based approach is the only valid monitoring strategy for 400G+ networks. (Reference: PAM4 Test Challenges)
2.2 Signal Degradation Progression
Degradation flow: Physical impairment (cable/SFP) → Eye diagram closes (DSP struggles) → Symbol errors (PHY layer) → FEC corrections (recoverable) → CRC failures (unrecoverable) → Packet loss (FATAL)
Monitoring opportunity: Detect degradation at the symbol_error stage, before FEC exhaustion causes packet loss.
2.3 Bit Error Rate (BER), FEC, and the “Cliff Effect”
Because errors are inevitable in PAM4, Forward Error Correction (FEC) is mandatory for 200G/400G/NDR links.
2.3.1 The FEC “Cliff Effect”
FEC masks physical degradation until the error rate exceeds correction capacity—then packet loss spikes instantly from 0% to ~100% (the “cliff”). The Degradation Monitor tracks Pre-FEC BER via symbol_error velocity, enabling node draining before the cliff is reached.
PAM4 Note (HDR/NDR): On 200G/400G adapters, non-zero raw BER is expected. Use rate-based thresholds (e.g.,
symbol_error > 10/sec) for degradation detection, notsymbol_error > 0.
2.4 The Lossless Assumption and Deterministic Failure Horizons
Unlike general-purpose TCP/IP networks, which are architected to be resilient to packet loss, latency variation, and out-of-order delivery, RDMA fabrics—specifically InfiniBand (IB) and RDMA over Converged Ethernet (RoCE)—are designed under a “lossless” assumption. This architectural premise dictates that once a packet is admitted to the fabric, its delivery is guaranteed by credit-based flow control (in IB) or Priority Flow Control (in RoCE), relieving the transport layer of heavy congestion management overhead.
Key Insight: This reliance on near-perfect transmission introduces a binary fragility to the system. When the physical or link layer violates the lossless assumption, the impact on the application is often not merely performance degradation, but catastrophic failure. For tightly coupled distributed workloads using MPI or NCCL, a failure in a single link deterministically terminates the entire job.
2.4.1 Soft vs Hard Errors: The Determinism Boundary
The critical operational requirement is distinguishing between:
The boundary between soft and hard errors is defined by:
- Counter thresholds that indicate recovery mechanism exhaustion
- Rate of change that exceeds retry bandwidth
- Specific counter types that indicate fundamental violation of the lossless contract
2.4.2 The 10E-12 BER Threshold
The InfiniBand specification defines a compliant link as maintaining a Bit Error Rate (BER) of better than 10E-12. This physical constant provides the basis for threshold calculations:
- At a BER of 10E-12, a link running at high speed (e.g., HDR 200Gb/s) experiences a predictable number of errors per unit time
- IBTA-compliant threshold: Maximum allowable symbol error rate is 120 errors per hour (IBTA Specification / Oracle Documentation)
- Below this rate, FEC algorithms can typically correct errors without retransmission
- Above this rate, the “effective” error rate (post-FEC) rises, leading to packet corruption and Link Level Retransmission (LLR) or transport layer retries
Monitoring Implication: While a single
SymbolErroris not fatal, a rate exceeding 120/hour (≈2/minute) is a deterministic predictor of impending link instability. Monitoring systems should treat this as a Fatal condition requiring node replacement.
2.4.3 Deterministic Failure Mechanisms
The following counters represent absolute deterministic failure when they increment:
Note: These four counters represent absolute deterministic failure. Additionally,
symbol_errorhas a fatal threshold at > 120/hour (IBTA BER spec violation) via thesymbol_error_fatalconfig entry. All other counters (symbol_error at > 10/sec, port_rcv_errors, etc.) are non-fatal degradation indicators.
2.5 The Transport Layer Retry Window
When hardware counters increment, they don’t directly cause application failure—they trigger a reaction in the software stack. Understanding this interaction defines the “Fatal” threshold:
2.6 Transport Retry Count Exceeded (Error 12)
When the NIC sends a packet and the ACK never arrives:
After retry_cnt attempts (default: 7), the NIC tears down the connection and the application receives IBV_WC_RETRY_EXC_ERR.
Implications:
- Confirms Logical Link is broken even if physical link is UP
- Often indicates “Silent Drop” or Black Hole in the fabric
- Local symptom of a remote problem
Important: Application-Triggered Timeouts. A rising
local_ack_timeout_errcounter does NOT necessarily indicate a local NIC fault. If a remote NCCL rank crashes or hangs, the remote NIC stops responding to RDMA requests. The local NIC retries and eventually exhaustsretry_cnt, incrementinglocal_ack_timeout_erron the local side. This means the counter can be triggered by: (1) fabric black hole (network issue), (2) remote NIC failure, or (3) remote application crash/hang — which is not a NIC problem at all. This is whylocal_ack_timeout_erris classified as Non-Fatal (IsFatal=false) — it requires correlation with other signals (port state, remote node health) to determine the root cause.
What This Monitor CAN Detect: The local_ack_timeout_err and req_transport_retries_exceeded (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can’t see the application error.
Diagnostic Commands:
Correlation: Use with ibdiagnet to determine if issue is local (NIC) or remote (Switch/Fabric).
Fabric-wide Diagnostic Command:
3. Architecture
3.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern
The Degradation Monitor follows NVSentinel’s established architectural pattern where:
- Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
- Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
- MongoDB serves as the source of truth for event history and correlation queries
3.2 Component Responsibilities
Local State Persistence: The Degradation Check maintains a persistent state file on the node (hostPath-backed) containing per-counter snapshots (value + timestamp), per-counter breach flags, and the host boot ID. This enables the monitor to (1) compute accurate deltas and precise velocity rates by holding the persisted snapshot for the configured velocity window and computing the rate over the real elapsed time — so a
120/hourthreshold is observed over a one-hour window rather than extrapolated from a single 1s sample; (2) seamlessly resume velocity windows after a pod restart because the snapshot timestamp survives the restart; (3) emit recovery events (IsHealthy=true) when counters are reset by an administrator, by retaining the breach flag across restarts; and (4) detect host reboots to clear all state and emit healthy baseline events for all ports and counters, since the node may have had NICs replaced during maintenance. This local state is strictly operational — all correlation and pattern detection remains centralized in the Health Events Analyzer.
3.3 Degradation Check Data Flow (1s polling interval)
4. Complete Counter Specification
4.1 Complete Counter Set (“Golden Counters” + Extended)
This monitor tracks both fatal counters (deterministic workload failure) and non-fatal counters (degradation indicators). The IsFatal field in the HealthEvent distinguishes between them.
4.1.1 Standard Counters (/sys/class/infiniband/<dev>/ports/<port>/counters/)
4.1.2 Extended Counters (/sys/class/infiniband/<dev>/ports/<port>/hw_counters/) — Non-Fatal
All extended counters are non-fatal by default. They indicate congestion, retransmissions, or recoverable transport events. RDMA’s reliable transport handles these automatically; workloads continue with potential performance impact.
Key Non-Fatal Counters (monitor for performance degradation):
Key Insights:
rnr_nak_retry_err> 0: FATAL - Indicates RNR NAK retry exhausted; the connection has been severed.roce_slow_restart> 10/sec: Primary indicator for Grey Failures. Indicates flow oscillation and straggler behavior.port_xmit_discards> 100/sec: Flow control breakdown. Network physically unable to handle load.symbol_error> 10/sec: Signature of “Dirty Fiber” or microscopic dust on connectors.
4.2 Counter Locations
- Standard IB counters:
/sys/class/infiniband/<dev>/ports/<port>/counters/(symbol_error, link_downed, local_link_integrity_errors, etc.) - Extended counters (Mellanox):
/sys/class/infiniband/<dev>/ports/<port>/hw_counters/(rnr_nak_retry_err, roce_slow_restart, etc.) - Ethernet stats (RoCE):
/sys/class/net/<iface>/statistics/(carrier_changes)
4.3 Diagnostic Commands
4.4 Key Design Decisions
link_downedis Fatal. In running MPI/NCCL jobs, any increment (Delta > 0) guarantees job crash.excessive_buffer_overrun_errorsis Fatal. Violates fundamental “lossless” contract; packet causing overrun is dropped immediately.rnr_nak_retry_erris Fatal. Indicates Receiver Not Ready NAK retry exhausted; the connection has been severed.local_link_integrity_errorsis Fatal. This counter is a “meta-threshold”—it only increments when raw physical errors exceed the hardware-defined LocalPhyErrors cap.symbol_erroruses PAM4 (HDR/NDR) considerations. Zero-tolerance is obsolete for modern links; non-zero raw BER is expected. Monitor velocity for degradation trends.- Most
hw_countersare Non-Fatal by default—they indicate degradation that should be monitored but doesn’t immediately crash workloads. Exception:rnr_nak_retry_erris fatal.
4.5 Consolidated Deterministic Failure Thresholds (Defaults)
Configuration Note: All thresholds and severity levels are configurable. The values below are defaults based on industry specifications and vendor recommendations. See Section 10: Configuration for customization options.
Table 1: Absolute Deterministic Failure Thresholds (Default: Fatal - IsFatal=true)
Breaching these thresholds guarantees application failure or mandatory node exclusion.
Table 2: Predictive Thresholds (Non-Fatal - IsFatal=false)
Breaching these rates indicates degradation requiring monitoring. Workloads continue but performance may be impacted.
Threshold Source Note: These thresholds are derived from a combination of IBTA BER specifications, cloud provider operational heuristics (Azure, AWS), vendor documentation, and field experience. Specific rate values are configurable defaults, not specification mandates. See Section 10: Configuration for customization options.
4.6 Technical Justification for Non-Fatal Thresholds
The following analysis validates the efficacy of the proposed monitoring design based on hardware specifications and empirical reliability studies.
1. Physical Layer (L1) Justifications
- Symbol Error (
symbol_error> 10/sec): A rate of 10/sec is a robust indicator of physical degradation. In modern PAM4 links, a healthy optical connection operates with a BER better than 1E-12 (roughly one error every few hours). A rate of 10/sec implies the BER has degraded by orders of magnitude (to ~1E-8). This is the classic signature of “Dirty Fiber” or microscopic dust on connectors. - Link Error Recovery (
link_error_recovery> 5/min): Tracks the Port Training State Machine (PTSM). 5 events per minute represents a “Flapping” link. While the link recovers (non-fatal), each retrain causes 50ms to 2s of stall, decimating performance for synchronous GPU workloads. - Carrier Changes (
carrier_changes> 2/interval): The OS-visible shadow of link recovery. Confirms that physical instability was severe enough to disrupt the driver layer.
2. Data Link Layer (L2) Justifications
- Port Receive Errors (
port_rcv_errors> 10/sec): Indicates “Bit Rot”—data corruption surviving the PHY but failing the CRC/FCS check. Triggers “Phantom Congestion” as the network repeatedly retransmits corrupted frames. - Port Transmit Discards (
port_xmit_discards> 100/sec): Indicates flow control breakdown. The network is physically unable to handle the load, and backpressure mechanisms (PFC) are failing. Definitive signal of Congestion Collapse.
3. Transport Layer (L4) Justifications
- RoCE Slow Restart (
roce_slow_restart> 10/sec): Primary indicator for Grey Failures. Indicates a flow is timing out and resetting its congestion window repeatedly. This creates stragglers that stall the entire GPU fleet during collective operations (AllReduce). - Local ACK Timeout (
local_ack_timeout_err> 1/sec): In a reliable lossless network, ACKs should not be lost. A persistent rate of 1/sec implies a “Fabric Black Hole” (e.g., a specific bad ECMP path).
Note on
rnr_nak_retry_err: This counter is FATAL (not a non-fatal threshold). Any increment indicates the Receiver Not Ready NAK retry limit has been exhausted and the connection has been severed. This is a terminal state of error handling.
Final Verdict: These thresholds are calibrated to distinguish between background noise (standard FEC activity) and pathological hardware degradation that threatens AI training efficiency.
5. Counter Reading and Parsing
5.1 Mellanox Counter Reading
For Mellanox devices (IB and RoCE), the monitor reads:
- Standard Counters:
/sys/class/infiniband/<dev>/ports/1/counters/- Fatal counters:
link_downed,local_link_integrity_errors,excessive_buffer_overrun_errors - Two-tier counter:
symbol_error— non-fatal at > 10/sec (degradation warning), fatal at > 120/hour (IBTA BER spec violation)
- Fatal counters:
- Extended Counters:
/sys/class/infiniband/<dev>/ports/1/hw_counters/- Fatal counter:
rnr_nak_retry_err - Non-fatal counters for degradation monitoring
- Fatal counter:
Note: Mellanox throughput counters (
port_rcv_data,port_xmit_data) are in 4-byte words. Multiply by 4 to get bytes.
5.2 Mellanox Fatal Counter Paths
Note:
symbol_errorhas two default config entries:symbol_error(non-fatal, > 10/sec for degradation) andsymbol_error_fatal(fatal, > 120/hour per IBTA specification (10E-12 BER)). Both read from the same sysfs file. On PAM4 links (HDR/NDR), some non-zero symbol errors are expected; tune the fatal threshold if 120/hour is too sensitive for your environment.
6. Counter Reset Handling
Hardware counters may reset due to driver reloads, device resets, administrator-initiated clears (e.g., perfquery -x, echo 0 > /sys/...), or (rarely) uint64 overflow. The monitor must handle cases where Current < Previous to avoid incorrect delta calculations and must emit recovery events when a counter reset clears a previously breached threshold.
6.1 The Problem
Additionally, if symbol_error had previously triggered a FATAL event (e.g., exceeding 120/hour), and an administrator resets the counters to remediate the issue, the monitor must detect this and emit a recovery event (IsHealthy=true) to clear the unhealthy condition on the platform.
6.2 Counter Reset Causes
6.3 Counter Reset Handling Algorithm
Reset Detection (uses an in-memory lastPollValue per counter
plus the persisted snapshot as fallback after pod restart):
- If
lastPollValueis recorded for this counter (steady state):- Reset detected when
current < lastPollValue
- Reset detected when
- Else if a persisted snapshot exists (first poll after pod restart,
lastPollValuenot yet rebuilt):- Reset detected when
current < snapshot.value
- Reset detected when
- Else (truly first poll for this counter):
- No reset to detect; just initialize the snapshot
Sysfs counters are kernel-maintained and monotonic between resets, so
current < previous is a definitive reset signal.
Threshold Evaluation and Recovery Steps (latching breach):
- If reset detected:
- If counter was previously breached → emit recovery event
(
IsHealthy=true,IsFatal=false,RecommendedAction=NONE), clearbreachedflag - If not previously breached → no event
- Update snapshot to
(current, now)so the next window starts fresh from the post-reset baseline
- If counter was previously breached → emit recovery event
(
- Else if counter is already breached (latched):
- No event — breach stays latched until counter reset or boot ID change
- Else (no reset, not currently breached) — evaluate the threshold:
- For
delta:breach = (current − snapshot.value) > threshold, update snapshot every poll - For
velocity: skip untilnow − snapshot.timestamp ≥ window, thenbreach = rate > threshold, update snapshot
- For
- If breach detected in step 3:
- Emit unhealthy event (
IsHealthy=false,IsFatalper counter config), setbreached=truein persistent state
- Emit unhealthy event (
- If no breach in step 3:
- No event — still healthy
Latching breach rationale: Once a fatal counter increments (e.g.
link_downed=1), the underlying physical event has happened. The fact that no further increments occur in the next poll does not mean the issue is resolved — only that no more events are accumulating right now. Recovery therefore requires explicit remediation (admin counter reset or host reboot), not merely the absence of new errors. This is consistent with the Section 6.4 admin-reset timeline.
6.4 Admin Counter Reset: Recovery Event Scenario
The following timeline illustrates why persistent breach tracking and recovery events are required:
Without breach tracking: After the admin reset at T=15s, the monitor would see delta=0, emit nothing, and the node would remain stuck in an unhealthy state on the platform indefinitely — even though the admin fixed the issue.
With pod restart between T=15s and T=20s: Without persistent state, the monitor loses all knowledge that link_downed was previously breached. The new pod starts fresh, sees link_downed=0, and never emits a recovery event. The persistent state file ensures the breached=true flag survives pod restarts.
6.5 Boot ID Handling
On host reboot, the node may come back with entirely different hardware (the CSP may have replaced NICs during maintenance). All kernel-maintained sysfs counters reset to zero, port states are re-established from scratch, and the device set may have changed. All persisted state from the previous boot is stale and must be discarded. The monitor must then emit healthy baseline events for all ports and counters to clear any stale FATAL conditions on the platform from the previous boot.
Algorithm:
- On startup, read current boot ID from
/proc/sys/kernel/random/boot_id - Compare to the boot ID stored in the persistent state file
- If boot IDs differ (host rebooted):
- Clear ALL persisted state: counter snapshots, breach flags, port states, known devices
- Update the stored boot ID and save the empty state
- On the first poll cycle after reboot, emit baseline events:
- State checks: For every port that is currently
ACTIVE/LinkUp, emit a healthy event (IsHealthy=true). This clears any stale FATAL port conditions on the platform from the previous boot. Ports that are currently unhealthy (e.g.,DOWN,Disabled) emit fatal/non-fatal events as usual — the node may have come back with a hardware issue. - Counter checks: Emit a healthy event (
IsHealthy=true) for every configured counter. Since counters reset to 0 on reboot and there is no previous value to compute a delta from, all counters are below threshold on the first poll. This clears any stale counter breach conditions on the platform. The first poll also establishes the counter baseline:- Delta counters evaluate against the new baseline starting from the second poll (one polling interval later) — any increment above threshold triggers an unhealthy event.
- Velocity counters wait for their full
velocityUnitwindow (1s / 1m / 1h) to elapse against the new baseline before evaluating; they do not extrapolate from a partial sample.
- State checks: For every port that is currently
- Rationale: the node is effectively a fresh machine after reboot — NICs may have been replaced, firmware updated, cables reseated. The platform must be told that all previously-reported conditions are resolved unless new issues are detected on this boot.
- If boot IDs match (pod restart, same host boot):
- Restore all persisted state (counter snapshots, breach flags, port states, known devices)
- Resume normal boundary-crossing detection with full context
Consistency with sibling monitors: This boot ID mechanism matches the pattern used by the GPU health monitor (
--state-filewith boot ID) and the syslog health monitor (state.jsonwithboot_idand journal cursors).
6.6 Persistent State File
The monitor persists its operational state to a JSON file on a hostPath-backed volume, enabling it to survive pod restarts without losing counter context.
State File Path: /var/run/nic_health_monitor/state.json
Kubernetes Volume Mount:
State File Structure:
Map keys: Counter snapshots and breach flags use the key format <device>:<port>:<counter_name> (e.g., mlx5_0:1:link_downed). Port state snapshots use <device>_<port> (e.g., mlx5_0_1). KnownDevices is a flat list of device names (e.g., ["mlx5_0", "mlx5_1", ...]).
Save triggers: The state file is written after each poll cycle completes (both state and counter checks). Errors during save are logged as warnings but do not halt monitoring.
Load behavior: On startup, the monitor attempts to load the state file. If the file is missing or corrupt, the monitor starts with empty state (equivalent to first boot). A warning is logged.
6.7 Rationale
- When a counter resets, the new value represents errors accumulated since the reset
- This is a conservative approach: we may slightly undercount errors immediately after a reset
- Alternative (treating reset as zero delta) could miss real errors that occurred during/after reset
- Admin-initiated resets are a legitimate remediation action — the monitor must recognize them and clear the unhealthy condition by emitting a recovery event
- Driver reloads are logged separately by the Syslog Health Monitor, providing correlation context
- Persistent state ensures recovery events survive pod restarts, preventing nodes from being permanently stuck in an unhealthy state
- Per-counter timestamps enable accurate velocity calculation — the snapshot is held for the configured velocityUnit window (1s / 1m / 1h) and the rate is computed over the real wall-clock elapsed time, not extrapolated from a single poll. This means a
120/hourthreshold genuinely observes one hour of data, not 3600 × the per-second rate. Persisting the timestamp lets the window survive pod restarts: the new pod resumes from the persisted snapshot instead of starting a fresh window.
7. Missing Counter Handling
Not all counters are available on all NIC versions or firmware revisions. The monitor must gracefully handle missing counters to ensure portability across different hardware generations (ConnectX-5, ConnectX-6, ConnectX-7, etc.).
7.1 Design Principles
- Fail-open for missing counters: If a counter file does not exist, skip it silently. Do not emit errors or events.
- Log at startup only: On monitor initialization, log which counters are available vs. unavailable for debugging purposes. Do not repeatedly log missing counters during polling.
- Graceful degradation: The monitor should function with whatever subset of counters is available. A node with an older NIC still benefits from the counters that do exist.
- Configuration flexibility: Allow operators to disable specific counters via configuration if they are known to be unavailable or irrelevant for their environment.
7.2 Common Counter Availability by NIC Generation
Note: Counter availability may also depend on firmware version and driver configuration. The monitor should always verify counter existence at runtime rather than relying on static assumptions.
8. RDMA vs TCP/IP Counter Domains
Critical Architectural Note: RDMA vs TCP/IP Counter Domains
For RoCE devices, there are TWO separate counter domains:
Field-validated observation: Running
pingthrough a RoCE interface does NOT increment InfiniBand counters (port_rcv_data,port_xmit_datastay at 0). The ping goes through the TCP/IP stack and is tracked in Ethernet statistics instead.Implication for monitoring: To detect RDMA-specific degradation (which affects distributed workloads), you MUST monitor the InfiniBand counters. Ethernet statistics alone will miss RDMA-layer issues like
roce_slow_restarterrors.
8.1 Counter Domain Diagram
9. Data Structures
9.1 Counter Structures
9.2 Persistent State Structures
The monitor persists operational state to survive pod restarts. See Section 6.6 for the on-disk schema and field-level rationale; the structures defined there (MonitorState, CounterSnapshot, CounterBreachFlag, PortStateSnapshot) are the canonical reference.
Example state file content:
9.3 Entity Model
NICs and Ports are modeled as separate entity types to enable precise fault localization:
10. Configuration
The counter monitoring system is fully configurable, allowing operators to:
- Define which counters to monitor
- Configure threshold types (delta-based or velocity-based)
- Set fatal/non-fatal severity levels per counter
- Override default thresholds for specific environments
10.1 Configuration Schema
Polling interval: The polling interval is set globally on the Helm chart (
pollingInterval, default1s) and is not configured per counter. Velocity thresholds are evaluated against a window matching the configuredvelocityUnit(1s / 1m / 1h), independent of the polling interval — so a fast 1s poll is suitable for every counter type without producing false alerts on long windows.
10.2 Default Counter Configuration
The following counters are monitored by default. Operators can override any setting or add custom counters.
10.3 Custom Counter Example
Operators can add custom counters or override defaults:
10.4 Threshold Processing Algorithm
The evaluator uses latching breach semantics: once a counter
breaches its threshold, the breach flag stays set until the counter is
reset (current < previous) or the host reboots. Polls while a counter
is already breached emit nothing; recovery events fire only on counter
reset of a previously breached counter.
10.5 Configuration Validation
The monitor validates configuration at startup:
11. Event Management
11.1 Event Construction
Example Event Fields (Fatal - link_downed):
Note: Fatal counter events use the state check name (
InfiniBandStateCheck/EthernetStateCheck) so that all fatal signals for a given NIC type consolidate under a single node condition.
Example Event Fields (Non-Fatal - Degradation):
Note: Non-fatal counter events use the degradation check name (
InfiniBandDegradationCheck/EthernetDegradationCheck) to keep degradation signals separate from fatal conditions on the node.
Example Event Fields (Recovery - Counter Reset by Admin):
Note: Recovery events are emitted when a previously breached counter returns below its threshold — typically after an administrator clears the counters. The
CheckNamematches the original breach event to ensure the recovery clears the correct condition.
11.2 Event Routing
Appendix A: Quick Reference - Default Counter Thresholds
Note: All counters, thresholds, and severity levels are configurable via the monitor configuration. The values below are the defaults that apply when no custom configuration is provided. See Section 10: Configuration for customization options.
Fatal Counters (Default: IsFatal = true)
Driver/Firmware Logs
For kernel log pattern details (fatal and non-fatal classifications, regex patterns, and kernel source references), see Syslog Detection & Correlation.
Non-Fatal Counters (Default: IsFatal = false)
Note:
rnr_nak_retry_erris FATAL by default (see Fatal Counters table above). All counters can have their severity and threshold overridden via configuration.
Design Principle
Key Insight: Deterministically fatal events in logs (cmd_exec timeout, etc.) are Fatal (
IsFatal=true) withRecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module absent) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) withRecommendedAction_REPLACE_VM.
References
PHY & Signal Integrity
- PAM4 Error Correction Challenges in 400GbE (EDN)
- Determine Which Links Are Experiencing Significant Errors - Sun/Oracle (citing IBTA BER Threshold)
Linux Kernel & Driver
Fabric Diagnostics
- ibdiagnet User Manual (NVIDIA)
- Black Hole Detection (sFlow)
- InfiniBand™ Architecture Specification (IBTA)
Vendor Monitoring Guides
- InfiniBand Errors Dashboard - HPE ClusterStor
- HPC Clusters Using InfiniBand on IBM Power Systems - IBM Redbooks
- NVIDIA UFM InfiniBand Port Counters
- NVIDIA DOCA Telemetry Service Guide
- NVIDIA UFM Telemetry - InfiniBand Cluster Bring-Up
RDMA Programming References
- ibv_modify_qp(3) — Linux Manual Page (rnr_retry, retry_cnt)
- NVIDIA RDMA-Aware Programming - Queue Pair Bringup