NIC Health Monitor: Link Counter Detection

View as Markdown

Table of Contents

  1. Overview
  2. Theoretical Foundation
  3. Architecture
  4. Complete Counter Specification
  5. Counter Reading and Parsing
  6. Counter Reset Handling and Persistent State
  7. Missing Counter Handling
  8. RDMA vs TCP/IP Counter Domains
  9. Data Structures
  10. Configuration
  11. Event Management

Related Documents:


1. Overview

1.1 Problem Statement

Modern GPU clusters suffer from Grey Failures (subtle degradations) and straggler effects where a single degraded link throttles thousands of GPUs. Simple UP/DOWN polling is insufficient; a deterministic degradation detection system is required that can detect both hard failures and gradual degradation before FEC exhaustion causes catastrophic packet loss.

This document covers the Degradation Monitoring component of the NIC Health Monitor, which detects:

  • Fatal counter violations - Counters that guarantee workload failure when incremented
  • Rate-based degradation - Error rates exceeding thresholds that predict impending failure
  • Pre-failure prediction - Detecting BER climbing before FEC exhaustion

1.3 Binary Severity Model

This monitor uses a binary severity model based on workload impact:

SeverityMeaningExample
FatalWorkload WILL fail or HAS failedlink_downed (any), excessive_buffer_overrun_errors (any)
Non-FatalDegradation detected, workload continuesSymbol errors, congestion, link flapping

Key Design Principle: The only question that matters is “Will the running workload fail because of this?“

1.4 Counter Detection Overview Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│ LINK COUNTER DETECTION FLOW │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DATA SOURCES (sysfs) │ │
│ ├─────────────────────────────────────────────────────────────────────────┤ │
│ │ /sys/class/infiniband/<dev>/ports/<port>/ │ │
│ │ ├── counters/ │ │
│ │ │ ├── symbol_error → PHY bit errors (before FEC) │ │
│ │ │ ├── link_error_recovery → Link retraining events │ │
│ │ │ ├── link_downed → Port training failures (FATAL) │ │
│ │ │ ├── port_rcv_errors → Malformed packets │ │
│ │ │ ├── local_link_integrity_errors → Physical errors (FATAL) │ │
│ │ │ ├── excessive_buffer_overrun → Lossless violation (FATAL) │ │
│ │ │ └── port_xmit_discards → TX discards (congestion) │ │
│ │ │ │ │
│ │ └── hw_counters/ → Extended counters │ │
│ │ ├── roce_slow_restart → Victim flow oscillation │ │
│ │ ├── local_ack_timeout_err → ACK timeout (path issues) │ │
│ │ ├── rnr_nak_retry_err → Connection severed (FATAL) │ │
│ │ └── req_transport_retries_exceeded → IB only (FATAL) │ │
│ │ │ │
│ │ /sys/class/net/<interface>/statistics/ │ │
│ │ └── carrier_changes → Link flap counter │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DEGRADATION MONITOR (1s polling interval) │ │
│ ├─────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ CALCULATES (locally, for threshold comparison): │ │
│ │ ├── Δ (delta) → current_value − snapshot.value │ │
│ │ ├── Δt (elapsed) → now − snapshot.timestamp (wall-clock) │ │
│ │ └── Δ/Δt (rate) → Errors per unit time, only after Δt ≥ window │ │
│ │ │ │
│ │ PERSISTS (hostPath-backed state file): │ │
│ │ ├── Per-counter snapshot (value + timestamp for delta/velocity) │ │
│ │ ├── Per-counter breach flag (for recovery event emission) │ │
│ │ └── Boot ID (clear state + emit healthy baselines on reboot) │ │
│ │ │ │
│ │ FATAL COUNTERS (immediate event): │ │
│ │ ├── link_downed (Delta > 0) → FATAL │ │
│ │ ├── excessive_buffer_overrun (any) → FATAL │ │
│ │ ├── local_link_integrity_errors (any) → FATAL │ │
│ │ ├── rnr_nak_retry_err (any) → FATAL │ │
│ │ └── symbol_error_fatal (> 120/hour) → FATAL │ │
│ │ │ │
│ │ NON-FATAL THRESHOLDS (degradation event): │ │
│ │ ├── symbol_error > 10/sec → NON-FATAL │ │
│ │ ├── link_error_recovery > 5/min → NON-FATAL │ │
│ │ ├── roce_slow_restart > 10/sec → NON-FATAL │ │
│ │ └── carrier_changes > 2/interval → NON-FATAL │ │
│ │ │ │
│ │ RECOVERY (when previously breached counter clears): │ │
│ │ └── Admin counter reset detected → RECOVERY (IsHealthy=true) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ RAW EVENTS + RECOVERY EVENTS → PLATFORM CONNECTOR → MongoDB │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ HEALTH EVENTS ANALYZER (Escalation Rules) │ │
│ ├─────────────────────────────────────────────────────────────────────────┤ │
│ │ • RepeatedNICDegradation: "5+ non-fatal events in 24h → FATAL" │ │
│ │ • Pattern detection across time windows │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘

2. Theoretical Foundation

2.1 The Physics of High-Speed Signaling Degradation

Modern interconnects (HDR/NDR InfiniBand, 100/200/400GbE) use PAM4 modulation (Pulse Amplitude Modulation, 4-level) to achieve high bandwidth. This represents a fundamental paradigm shift from previous generations.

2.1.1 PAM4 vs NRZ: Why Velocity Monitoring is Required

AspectNRZ (EDR/100GbE)PAM4 (NDR/400GbE)
Bits per symbol12
Voltage levels2 (0, 1)4 (00, 01, 10, 11)
Eye heightMaximum1/3 of NRZ
SNRHighDrastically reduced
Raw bit errorsRare anomalyGuaranteed and constant
Monitoring approach”Any error is bad”Velocity-based only

Critical: In PAM4 systems, raw bit errors are a physical certainty. A monitor that alerts on “Any Error > 0” would be permanently alarming. The velocity-based approach is the only valid monitoring strategy for 400G+ networks. (Reference: PAM4 Test Challenges)

2.2 Signal Degradation Progression

Degradation flow: Physical impairment (cable/SFP) → Eye diagram closes (DSP struggles) → Symbol errors (PHY layer) → FEC corrections (recoverable) → CRC failures (unrecoverable) → Packet loss (FATAL)

Monitoring opportunity: Detect degradation at the symbol_error stage, before FEC exhaustion causes packet loss.

2.3 Bit Error Rate (BER), FEC, and the “Cliff Effect”

Because errors are inevitable in PAM4, Forward Error Correction (FEC) is mandatory for 200G/400G/NDR links.

Link Health StateBit Error RateSymbol ErrorsAction
Healthy< 10E-15~0 post-FECNone
Failed (Fatal)> 10E-12FEC margin exhaustedFatal (REPLACE_VM)

2.3.1 The FEC “Cliff Effect”

FEC masks physical degradation until the error rate exceeds correction capacity—then packet loss spikes instantly from 0% to ~100% (the “cliff”). The Degradation Monitor tracks Pre-FEC BER via symbol_error velocity, enabling node draining before the cliff is reached.

PAM4 Note (HDR/NDR): On 200G/400G adapters, non-zero raw BER is expected. Use rate-based thresholds (e.g., symbol_error > 10/sec) for degradation detection, not symbol_error > 0.

2.4 The Lossless Assumption and Deterministic Failure Horizons

Unlike general-purpose TCP/IP networks, which are architected to be resilient to packet loss, latency variation, and out-of-order delivery, RDMA fabrics—specifically InfiniBand (IB) and RDMA over Converged Ethernet (RoCE)—are designed under a “lossless” assumption. This architectural premise dictates that once a packet is admitted to the fabric, its delivery is guaranteed by credit-based flow control (in IB) or Priority Flow Control (in RoCE), relieving the transport layer of heavy congestion management overhead.

Key Insight: This reliance on near-perfect transmission introduces a binary fragility to the system. When the physical or link layer violates the lossless assumption, the impact on the application is often not merely performance degradation, but catastrophic failure. For tightly coupled distributed workloads using MPI or NCCL, a failure in a single link deterministically terminates the entire job.

2.4.1 Soft vs Hard Errors: The Determinism Boundary

The critical operational requirement is distinguishing between:

Error TypeCharacteristicsImpact
Soft ErrorsProbabilistic, recoverable via FEC/retriesPerformance degradation, workload continues
Hard ErrorsDeterministic, exceed recovery capacityApplication failure guaranteed

The boundary between soft and hard errors is defined by:

  1. Counter thresholds that indicate recovery mechanism exhaustion
  2. Rate of change that exceeds retry bandwidth
  3. Specific counter types that indicate fundamental violation of the lossless contract

2.4.2 The 10E-12 BER Threshold

The InfiniBand specification defines a compliant link as maintaining a Bit Error Rate (BER) of better than 10E-12. This physical constant provides the basis for threshold calculations:

  • At a BER of 10E-12, a link running at high speed (e.g., HDR 200Gb/s) experiences a predictable number of errors per unit time
  • IBTA-compliant threshold: Maximum allowable symbol error rate is 120 errors per hour (IBTA Specification / Oracle Documentation)
  • Below this rate, FEC algorithms can typically correct errors without retransmission
  • Above this rate, the “effective” error rate (post-FEC) rises, leading to packet corruption and Link Level Retransmission (LLR) or transport layer retries

Monitoring Implication: While a single SymbolError is not fatal, a rate exceeding 120/hour (≈2/minute) is a deterministic predictor of impending link instability. Monitoring systems should treat this as a Fatal condition requiring node replacement.

2.4.3 Deterministic Failure Mechanisms

The following counters represent absolute deterministic failure when they increment:

CounterMechanismWhy Deterministic
link_downedPort Training State Machine fails to maintain LinkUpStandard HPC applications do not support transparent dynamic rerouting of active QPs
excessive_buffer_overrun_errorsHCA internal ingress buffer overflowsViolates fundamental “lossless” contract; packet causing overrun is dropped immediately
RNR_nak_retry_errReceiver Not Ready NAK retry exhaustedTerminal state of error handling; connection is severed
local_link_integrity_errorsRaw physical errors exceed LocalPhyErrors hardware limitLink is operating outside design specifications

Note: These four counters represent absolute deterministic failure. Additionally, symbol_error has a fatal threshold at > 120/hour (IBTA BER spec violation) via the symbol_error_fatal config entry. All other counters (symbol_error at > 10/sec, port_rcv_errors, etc.) are non-fatal degradation indicators.

2.5 The Transport Layer Retry Window

When hardware counters increment, they don’t directly cause application failure—they trigger a reaction in the software stack. Understanding this interaction defines the “Fatal” threshold:

┌─────────────────────────────────────────────────────────────────────────────────┐
│ TRANSPORT LAYER RETRY WINDOW │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Hardware: SymbolError ──► FEC fails ──► Packet Corrupted │
│ │ │
│ ▼ │
│ Receiver: drops packet ──► PortRcvErrors increments │
│ │ │
│ ▼ │
│ Sender: waits for ACK ──► Timeout ──► Retry (1) ──► ... ──► Retry (N) │
│ │ │ │
│ │ ▼ │
│ │ GIVE UP │
│ │ │ │
│ ▼ ▼ │
│ Application: NCCL_IB_RETRY_CNT (default: 7) exhausted │
│ │ │
│ ▼ │
│ Result: QP transitions to ERROR state ──► Application crashes │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘

2.6 Transport Retry Count Exceeded (Error 12)

When the NIC sends a packet and the ACK never arrives:

Send Packet → Wait for ACK → Timeout → Retry (1) → Timeout → ... → Retry (N) → GIVE UP

After retry_cnt attempts (default: 7), the NIC tears down the connection and the application receives IBV_WC_RETRY_EXC_ERR.

Implications:

  • Confirms Logical Link is broken even if physical link is UP
  • Often indicates “Silent Drop” or Black Hole in the fabric
  • Local symptom of a remote problem

Important: Application-Triggered Timeouts. A rising local_ack_timeout_err counter does NOT necessarily indicate a local NIC fault. If a remote NCCL rank crashes or hangs, the remote NIC stops responding to RDMA requests. The local NIC retries and eventually exhausts retry_cnt, incrementing local_ack_timeout_err on the local side. This means the counter can be triggered by: (1) fabric black hole (network issue), (2) remote NIC failure, or (3) remote application crash/hang — which is not a NIC problem at all. This is why local_ack_timeout_err is classified as Non-Fatal (IsFatal=false) — it requires correlation with other signals (port state, remote node health) to determine the root cause.

What This Monitor CAN Detect: The local_ack_timeout_err and req_transport_retries_exceeded (native IB) hardware counters track these retry events at the NIC level. Rising counter values indicate transport-layer problems even if we can’t see the application error.

Diagnostic Commands:

$# Read hardware counters directly from sysfs
$cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_transport_retries_exceeded
$# Output: 42 (non-zero value indicates connection-severing retries)
$
$# Detailed link quality and error counters via Mellanox diagnostic tools
$mlxlink -d /dev/mst/mt4126_pciconf0 --show_ber
$# Output: Symbol Errors, BER counters
$
$# Query eye opening (signal quality indicator)
$mlxlink -d /dev/mst/mt4126_pciconf0 --eye_open
$# Output: Eye height/width for each PAM4 lane (identifies physical cable degradation)

Correlation: Use with ibdiagnet to determine if issue is local (NIC) or remote (Switch/Fabric).

Fabric-wide Diagnostic Command:

$# Perform comprehensive fabric-wide diagnostics (requires Subnet Manager access)
$ibdiagnet -o /tmp/ibdiag_output
$# Output: Summary of fabric errors, including symbol errors on switches and remote ports

3. Architecture

3.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern

The Degradation Monitor follows NVSentinel’s established architectural pattern where:

  1. Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
  2. Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
  3. MongoDB serves as the source of truth for event history and correlation queries
Architectural PrincipleImplementationPurpose
Raw Event ReportingEach threshold violation → immediate eventEnables centralized correlation with full historical context
Centralized CorrelationHealth Events Analyzer MongoDB pipelinesFlexible, configurable rules without monitor code changes
Temporal CorrelationAnalyzer rules with time windowsDetects patterns like “5 degradation events in 24 hours”

3.2 Component Responsibilities

ComponentResponsibilityWhat It Does NOT Do
NIC Health Monitor (Degradation Check)Poll sysfs counters, calculate deltas/rates, persist counter snapshots and breach state, emit raw events and recovery eventsAggregation, deduplication, correlation, pattern detection
Health Events AnalyzerCorrelate events, detect patterns, escalate severityDirect hardware access

Local State Persistence: The Degradation Check maintains a persistent state file on the node (hostPath-backed) containing per-counter snapshots (value + timestamp), per-counter breach flags, and the host boot ID. This enables the monitor to (1) compute accurate deltas and precise velocity rates by holding the persisted snapshot for the configured velocity window and computing the rate over the real elapsed time — so a 120/hour threshold is observed over a one-hour window rather than extrapolated from a single 1s sample; (2) seamlessly resume velocity windows after a pod restart because the snapshot timestamp survives the restart; (3) emit recovery events (IsHealthy=true) when counters are reset by an administrator, by retaining the breach flag across restarts; and (4) detect host reboots to clear all state and emit healthy baseline events for all ports and counters, since the node may have had NICs replaced during maintenance. This local state is strictly operational — all correlation and pattern detection remains centralized in the Health Events Analyzer.

3.3 Degradation Check Data Flow (1s polling interval)

Reads:
├── counters/ → Standard IB counters (symbol_error, link_error_recovery, etc.)
├── hw_counters/ → Extended counters (roce_slow_restart, rnr_nak_retry_err, etc.)
├── statistics/ → Ethernet statistics (rx_crc_errors, rx_missed_errors, etc.)
└── carrier_changes → Link flap counter (catches UP/DOWN events between polls)
Calculates (locally, for threshold comparison):
├── Δ (delta) → current_value − snapshot.value
├── Δt (elapsed) → now − snapshot.timestamp (real wall-clock time)
└── Δ/Δt (rate) → For velocity thresholds, evaluated only after Δt ≥ window
(window = 1s / 1m / 1h, matching the configured velocityUnit)
When threshold exceeded for the first time, emits a single RAW event with:
├── Counter name → e.g., "symbol_error_fatal"
├── Current value → e.g., 12500
├── Delta → accumulated since the snapshot was taken
├── Rate → e.g., 200/hour
└── Threshold → e.g., 120/hour
Subsequent polls while breached emit nothing (latching breach). A
recovery event is emitted only when the counter is reset (admin clear)
or the host reboots.
Fatal counter thresholds (configurable, defaults shown):
├── link_downed (Delta > 0) → QP disconnect (FATAL)
├── excessive_buffer_overrun_errors (any) → Lossless violation (FATAL)
├── local_link_integrity_errors (any) → Link outside spec (FATAL)
├── rnr_nak_retry_err (any) → Connection severed (FATAL)
└── symbol_error_fatal (> 120/hour) → IBTA BER spec violation (FATAL)
Non-fatal thresholds (configurable, defaults shown):
├── symbol_error > 10/sec
├── link_error_recovery > 5/min
├── roce_slow_restart > 10/sec
└── carrier_changes > 2/interval
Persists (to hostPath-backed state file after each poll cycle):
├── Per-counter snapshot → Value + wall-clock timestamp (for delta/velocity)
├── Per-counter breach → Whether threshold is currently exceeded (for recovery)
└── Boot ID → Detects host reboot to clear state + emit healthy baselines
Emits: Raw DEGRADATION events → Platform Connector → MongoDB
Recovery events (IsHealthy=true) when breached counter clears (e.g., admin reset)
(Pattern detection and escalation handled by Health Events Analyzer)

4. Complete Counter Specification

4.1 Complete Counter Set (“Golden Counters” + Extended)

This monitor tracks both fatal counters (deterministic workload failure) and non-fatal counters (degradation indicators). The IsFatal field in the HealthEvent distinguishes between them.

4.1.1 Standard Counters (/sys/class/infiniband/<dev>/ports/<port>/counters/)

CounterFile NameDegradation MeaningIsFatalAlert ThresholdSource
Symbol Errorsymbol_errorRaw bit errors before FEC. Expected non-zero for PAM4 (HDR/NDR).NoRate-based (e.g., > 10/sec for warning)Oracle/IBTA
Link Error Recoverylink_error_recoveryPHY-initiated link retraining (micro-flapping). Causes millisecond-scale latency spikes.No> 5/min (watchdog trigger)NVIDIA UFM IB Port Counters
Link Downedlink_downedPort Training State Machine failed to maintain LinkUp.YESDelta > 0 (Runtime)HPE ClusterStor
Port Receive Errorsport_rcv_errorsMalformed packets (CRC, length errors). Saturates retry bandwidth at high rates.No> 10/sec (retry saturation)NVIDIA UFM IB Port Counters
Local Link Integritylocal_link_integrity_errorsRaw physical errors exceeded LocalPhyErrors hardware cap. Link operating outside spec.YES> 0 (any)HPE ClusterStor
Buffer Overrunexcessive_buffer_overrun_errorsHCA internal buffer overflow—lossless contract violated. Packet dropped immediately.YES> 0 (any)IBM Redbooks
Port Transmit Discardsport_xmit_discardsTX discards due to congestion.No> 100/sec

4.1.2 Extended Counters (/sys/class/infiniband/<dev>/ports/<port>/hw_counters/) — Non-Fatal

All extended counters are non-fatal by default. They indicate congestion, retransmissions, or recoverable transport events. RDMA’s reliable transport handles these automatically; workloads continue with potential performance impact.

Key Non-Fatal Counters (monitor for performance degradation):

CategoryCountersIsFatalAlert ThresholdJustification
Physicalsymbol_errorNo> 10/secPHY signal degradation / Dirty fiber.
Linklink_error_recoveryNo> 5/minLink Flapping / PTSM Instability.
Integrityport_rcv_errorsNo> 10/secFCS/CRC Corruption (Bit Rot).
Congestionport_xmit_discardsNo> 100/secCongestion Collapse / PFC breakdown.
Transportroce_slow_restartNo> 10/secVictim Flow / Transport Oscillation (Straggler).
Transportrnr_nak_retry_errYES> 0 (any)RNR NAK retry exhausted; QP enters error state (ref).
Timeoutlocal_ack_timeout_errNo> 1/secBroken Path / Fabric Black Hole. Can be caused by remote app crash (see Section 2.6).
Interfacecarrier_changesNo> 2/intervalPhysical instability visible to OS.

Key Insights:

  • rnr_nak_retry_err > 0: FATAL - Indicates RNR NAK retry exhausted; the connection has been severed.
  • roce_slow_restart > 10/sec: Primary indicator for Grey Failures. Indicates flow oscillation and straggler behavior.
  • port_xmit_discards > 100/sec: Flow control breakdown. Network physically unable to handle load.
  • symbol_error > 10/sec: Signature of “Dirty Fiber” or microscopic dust on connectors.

4.2 Counter Locations

  • Standard IB counters: /sys/class/infiniband/<dev>/ports/<port>/counters/ (symbol_error, link_downed, local_link_integrity_errors, etc.)
  • Extended counters (Mellanox): /sys/class/infiniband/<dev>/ports/<port>/hw_counters/ (rnr_nak_retry_err, roce_slow_restart, etc.)
  • Ethernet stats (RoCE): /sys/class/net/<iface>/statistics/ (carrier_changes)

4.3 Diagnostic Commands

$# Read standard counters
$cat /sys/class/infiniband/mlx5_0/ports/1/counters/symbol_error
$cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_errors
$cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_discards
$
$# Read extended hw_counters (degradation monitoring)
$cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/local_ack_timeout_err
$cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/roce_slow_restart
$cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err
$
$# Fabric-wide diagnostics (requires Subnet Manager access)
$ibdiagnet -o /tmp/ibdiag_output

4.4 Key Design Decisions

  • link_downed is Fatal. In running MPI/NCCL jobs, any increment (Delta > 0) guarantees job crash.
  • excessive_buffer_overrun_errors is Fatal. Violates fundamental “lossless” contract; packet causing overrun is dropped immediately.
  • rnr_nak_retry_err is Fatal. Indicates Receiver Not Ready NAK retry exhausted; the connection has been severed.
  • local_link_integrity_errors is Fatal. This counter is a “meta-threshold”—it only increments when raw physical errors exceed the hardware-defined LocalPhyErrors cap.
  • symbol_error uses PAM4 (HDR/NDR) considerations. Zero-tolerance is obsolete for modern links; non-zero raw BER is expected. Monitor velocity for degradation trends.
  • Most hw_counters are Non-Fatal by default—they indicate degradation that should be monitored but doesn’t immediately crash workloads. Exception: rnr_nak_retry_err is fatal.

4.5 Consolidated Deterministic Failure Thresholds (Defaults)

Configuration Note: All thresholds and severity levels are configurable. The values below are defaults based on industry specifications and vendor recommendations. See Section 10: Configuration for customization options.

Table 1: Absolute Deterministic Failure Thresholds (Default: Fatal - IsFatal=true)

Breaching these thresholds guarantees application failure or mandatory node exclusion.

Counter NameTypeFatal ThresholdIsFatalDeterministic MechanismSource
link_downedStandardDelta > 0 (Runtime)YESLogical path destruction; QP disconnect. Standard HPC apps don’t support transparent QP rerouting.HPE ClusterStor
excessive_buffer_overrun_errorsStandard> 0 (Any)YESLossless guarantee violation; packet dropped immediately. HCA ingress buffer overflow.IBM Redbooks
rnr_nak_retry_errExtended> 0 (Any)YESReceiver Not Ready NAK retry exhausted; QP transitions to error state (IBV_WC_RNR_RETRY_EXC_ERR). Connection cannot recover without application-level teardown.ibv_modify_qp(3) - rnr_retry QP attr, NVIDIA RDMA Programming
local_link_integrity_errorsStandard> 0 (Any)YESPhysical error density exceeds hardware-defined LocalPhyErrors cap. Link outside spec.HPE ClusterStor
symbol_error_fatalStandard> 120/hourYESIBTA BER spec violation (10E-12). Link operating outside specification; FEC margin exhausted.Oracle/IBTA BER Threshold

Table 2: Predictive Thresholds (Non-Fatal - IsFatal=false)

Breaching these rates indicates degradation requiring monitoring. Workloads continue but performance may be impacted.

Threshold Source Note: These thresholds are derived from a combination of IBTA BER specifications, cloud provider operational heuristics (Azure, AWS), vendor documentation, and field experience. Specific rate values are configurable defaults, not specification mandates. See Section 10: Configuration for customization options.

Counter NameTypeAlert ThresholdIsFatalRationaleSource
symbol_errorPHY> 10/secNoPhysical layer degradation (Dirty Fiber). Derived from IBTA BER spec (10E-12); 10/sec implies BER degraded to ~1E-8.Oracle/IBTA BER Threshold, NVIDIA UFM IB Port Counters
link_error_recoveryLink> 5/minNoPTSM Instability. Each retrain causes 50ms-2s stall. 5/min = flapping link.NVIDIA UFM IB Port Counters (counter definition); threshold is operational heuristic
port_rcv_errorsStandard> 10/secNoBit Rot / CRC Corruption. Saturates transport replay buffer.NVIDIA UFM IB Port Counters
port_xmit_discardsStandard> 100/secNoCongestion Collapse / PFC breakdown.NVIDIA UFM IB Port Counters (counter definition); threshold is operational heuristic
roce_slow_restartRoCE> 10/secNo”Victim Flow” oscillation. Jitter impacts AllReduce synchronization.NVIDIA DOCA Telemetry
local_ack_timeout_errTransport> 1/secNoACK timeouts indicate path issues (Black Hole). Can also be caused by remote application crash (see Section 2.6).Operational heuristic
carrier_changesInterface> 2/intervalNoLink instability (catches UP/DOWN events between polls).Operational heuristic

4.6 Technical Justification for Non-Fatal Thresholds

The following analysis validates the efficacy of the proposed monitoring design based on hardware specifications and empirical reliability studies.

1. Physical Layer (L1) Justifications

  • Symbol Error (symbol_error > 10/sec): A rate of 10/sec is a robust indicator of physical degradation. In modern PAM4 links, a healthy optical connection operates with a BER better than 1E-12 (roughly one error every few hours). A rate of 10/sec implies the BER has degraded by orders of magnitude (to ~1E-8). This is the classic signature of “Dirty Fiber” or microscopic dust on connectors.
  • Link Error Recovery (link_error_recovery > 5/min): Tracks the Port Training State Machine (PTSM). 5 events per minute represents a “Flapping” link. While the link recovers (non-fatal), each retrain causes 50ms to 2s of stall, decimating performance for synchronous GPU workloads.
  • Carrier Changes (carrier_changes > 2/interval): The OS-visible shadow of link recovery. Confirms that physical instability was severe enough to disrupt the driver layer.

2. Data Link Layer (L2) Justifications

  • Port Receive Errors (port_rcv_errors > 10/sec): Indicates “Bit Rot”—data corruption surviving the PHY but failing the CRC/FCS check. Triggers “Phantom Congestion” as the network repeatedly retransmits corrupted frames.
  • Port Transmit Discards (port_xmit_discards > 100/sec): Indicates flow control breakdown. The network is physically unable to handle the load, and backpressure mechanisms (PFC) are failing. Definitive signal of Congestion Collapse.

3. Transport Layer (L4) Justifications

  • RoCE Slow Restart (roce_slow_restart > 10/sec): Primary indicator for Grey Failures. Indicates a flow is timing out and resetting its congestion window repeatedly. This creates stragglers that stall the entire GPU fleet during collective operations (AllReduce).
  • Local ACK Timeout (local_ack_timeout_err > 1/sec): In a reliable lossless network, ACKs should not be lost. A persistent rate of 1/sec implies a “Fabric Black Hole” (e.g., a specific bad ECMP path).

Note on rnr_nak_retry_err: This counter is FATAL (not a non-fatal threshold). Any increment indicates the Receiver Not Ready NAK retry limit has been exhausted and the connection has been severed. This is a terminal state of error handling.

Final Verdict: These thresholds are calibrated to distinguish between background noise (standard FEC activity) and pathological hardware degradation that threatens AI training efficiency.


5. Counter Reading and Parsing

5.1 Mellanox Counter Reading

For Mellanox devices (IB and RoCE), the monitor reads:

  1. Standard Counters: /sys/class/infiniband/<dev>/ports/1/counters/
    • Fatal counters: link_downed, local_link_integrity_errors, excessive_buffer_overrun_errors
    • Two-tier counter: symbol_error — non-fatal at > 10/sec (degradation warning), fatal at > 120/hour (IBTA BER spec violation)
  2. Extended Counters: /sys/class/infiniband/<dev>/ports/1/hw_counters/
    • Fatal counter: rnr_nak_retry_err
    • Non-fatal counters for degradation monitoring

Note: Mellanox throughput counters (port_rcv_data, port_xmit_data) are in 4-byte words. Multiply by 4 to get bytes.

5.2 Mellanox Fatal Counter Paths

CounterPathFatal Threshold
symbol_error_fatal/sys/class/infiniband/<dev>/ports/<port>/counters/symbol_error> 120/hour
local_link_integrity_errors/sys/class/infiniband/<dev>/ports/<port>/counters/local_link_integrity_errorsDelta > 0
excessive_buffer_overrun_errors/sys/class/infiniband/<dev>/ports/<port>/counters/excessive_buffer_overrun_errorsDelta > 0
rnr_nak_retry_err/sys/class/infiniband/<dev>/ports/<port>/hw_counters/rnr_nak_retry_errDelta > 0

Note: symbol_error has two default config entries: symbol_error (non-fatal, > 10/sec for degradation) and symbol_error_fatal (fatal, > 120/hour per IBTA specification (10E-12 BER)). Both read from the same sysfs file. On PAM4 links (HDR/NDR), some non-zero symbol errors are expected; tune the fatal threshold if 120/hour is too sensitive for your environment.


6. Counter Reset Handling

Hardware counters may reset due to driver reloads, device resets, administrator-initiated clears (e.g., perfquery -x, echo 0 > /sys/...), or (rarely) uint64 overflow. The monitor must handle cases where Current < Previous to avoid incorrect delta calculations and must emit recovery events when a counter reset clears a previously breached threshold.

6.1 The Problem

Poll N: symbol_error = 1,000,000
Driver Reload / Counter Reset / Admin Clear
Poll N+1: symbol_error = 50
Naive Delta = 50 - 1,000,000 = NEGATIVE (or overflow to huge positive)

Additionally, if symbol_error had previously triggered a FATAL event (e.g., exceeding 120/hour), and an administrator resets the counters to remediate the issue, the monitor must detect this and emit a recovery event (IsHealthy=true) to clear the unhealthy condition on the platform.

6.2 Counter Reset Causes

CauseDetectionExpected Behavior
Driver reload (modprobe -r mlx5_core)current < previous; syslog monitor reports correlated kernel logTreat current as delta, check for recovery
Device reset (firmware/hardware initiated)current < previous; may correlate with syslog eventsTreat current as delta, check for recovery
Administrator clear (CSP/cluster admin)current < previous (typically to 0); no correlated syslog eventTreat current as delta, emit recovery event if previously breached
Host rebootBoot ID changes; all counters restart from 0Clear all persisted state, emit healthy baselines for all ports and counters (see Section 6.5)
uint64 overflowcurrent < previous (extremely rare)Treat current as delta

6.3 Counter Reset Handling Algorithm

Reset Detection (uses an in-memory lastPollValue per counter plus the persisted snapshot as fallback after pod restart):

  1. If lastPollValue is recorded for this counter (steady state):
    • Reset detected when current < lastPollValue
  2. Else if a persisted snapshot exists (first poll after pod restart, lastPollValue not yet rebuilt):
    • Reset detected when current < snapshot.value
  3. Else (truly first poll for this counter):
    • No reset to detect; just initialize the snapshot

Sysfs counters are kernel-maintained and monotonic between resets, so current < previous is a definitive reset signal.

Threshold Evaluation and Recovery Steps (latching breach):

  1. If reset detected:
    • If counter was previously breached → emit recovery event (IsHealthy=true, IsFatal=false, RecommendedAction=NONE), clear breached flag
    • If not previously breached → no event
    • Update snapshot to (current, now) so the next window starts fresh from the post-reset baseline
  2. Else if counter is already breached (latched):
    • No event — breach stays latched until counter reset or boot ID change
  3. Else (no reset, not currently breached) — evaluate the threshold:
    • For delta: breach = (current − snapshot.value) > threshold, update snapshot every poll
    • For velocity: skip until now − snapshot.timestamp ≥ window, then breach = rate > threshold, update snapshot
  4. If breach detected in step 3:
    • Emit unhealthy event (IsHealthy=false, IsFatal per counter config), set breached=true in persistent state
  5. If no breach in step 3:
    • No event — still healthy

Latching breach rationale: Once a fatal counter increments (e.g. link_downed=1), the underlying physical event has happened. The fact that no further increments occur in the next poll does not mean the issue is resolved — only that no more events are accumulating right now. Recovery therefore requires explicit remediation (admin counter reset or host reboot), not merely the absence of new errors. This is consistent with the Section 6.4 admin-reset timeline.

6.4 Admin Counter Reset: Recovery Event Scenario

The following timeline illustrates why persistent breach tracking and recovery events are required:

Timeline: Admin Counter Reset Recovery
T=0s Poll: link_downed = 0 (delta=0, no breach, breached=false)
T=5s Poll: link_downed = 1 (delta=1, BREACH → emit FATAL event, breached=true)
T=10s Poll: link_downed = 1 (delta=0, still breached, no event)
T=15s --- CSP admin resets counters (perfquery -x) ---
T=20s Poll: link_downed = 0 (current < previous → reset detected)
delta = 0 (current value), threshold NOT breached
breached was true → transition to healthy
→ Emit RECOVERY event (IsHealthy=true, IsFatal=false)
→ Set breached=false in persistent state
T=25s Poll: link_downed = 0 (delta=0, not breached, no event)

Without breach tracking: After the admin reset at T=15s, the monitor would see delta=0, emit nothing, and the node would remain stuck in an unhealthy state on the platform indefinitely — even though the admin fixed the issue.

With pod restart between T=15s and T=20s: Without persistent state, the monitor loses all knowledge that link_downed was previously breached. The new pod starts fresh, sees link_downed=0, and never emits a recovery event. The persistent state file ensures the breached=true flag survives pod restarts.

6.5 Boot ID Handling

On host reboot, the node may come back with entirely different hardware (the CSP may have replaced NICs during maintenance). All kernel-maintained sysfs counters reset to zero, port states are re-established from scratch, and the device set may have changed. All persisted state from the previous boot is stale and must be discarded. The monitor must then emit healthy baseline events for all ports and counters to clear any stale FATAL conditions on the platform from the previous boot.

Algorithm:

  1. On startup, read current boot ID from /proc/sys/kernel/random/boot_id
  2. Compare to the boot ID stored in the persistent state file
  3. If boot IDs differ (host rebooted):
    • Clear ALL persisted state: counter snapshots, breach flags, port states, known devices
    • Update the stored boot ID and save the empty state
    • On the first poll cycle after reboot, emit baseline events:
      • State checks: For every port that is currently ACTIVE/LinkUp, emit a healthy event (IsHealthy=true). This clears any stale FATAL port conditions on the platform from the previous boot. Ports that are currently unhealthy (e.g., DOWN, Disabled) emit fatal/non-fatal events as usual — the node may have come back with a hardware issue.
      • Counter checks: Emit a healthy event (IsHealthy=true) for every configured counter. Since counters reset to 0 on reboot and there is no previous value to compute a delta from, all counters are below threshold on the first poll. This clears any stale counter breach conditions on the platform. The first poll also establishes the counter baseline:
        • Delta counters evaluate against the new baseline starting from the second poll (one polling interval later) — any increment above threshold triggers an unhealthy event.
        • Velocity counters wait for their full velocityUnit window (1s / 1m / 1h) to elapse against the new baseline before evaluating; they do not extrapolate from a partial sample.
    • Rationale: the node is effectively a fresh machine after reboot — NICs may have been replaced, firmware updated, cables reseated. The platform must be told that all previously-reported conditions are resolved unless new issues are detected on this boot.
  4. If boot IDs match (pod restart, same host boot):
    • Restore all persisted state (counter snapshots, breach flags, port states, known devices)
    • Resume normal boundary-crossing detection with full context

Consistency with sibling monitors: This boot ID mechanism matches the pattern used by the GPU health monitor (--state-file with boot ID) and the syslog health monitor (state.json with boot_id and journal cursors).

6.6 Persistent State File

The monitor persists its operational state to a JSON file on a hostPath-backed volume, enabling it to survive pod restarts without losing counter context.

State File Path: /var/run/nic_health_monitor/state.json

Kubernetes Volume Mount:

1volumes:
2 - name: nic-state-vol
3 hostPath:
4 path: /var/run/nic_health_monitor
5 type: DirectoryOrCreate
6
7volumeMounts:
8 - name: nic-state-vol
9 mountPath: /var/run/nic_health_monitor

State File Structure:

1// MonitorState is the persistent state written to disk as JSON.
2// This single state file is shared by both state checks and counter checks.
3type MonitorState struct {
4 Version int `json:"version"`
5 BootID string `json:"boot_id"`
6
7 // Counter detection state
8 CounterSnapshots map[string]CounterSnapshot `json:"counter_snapshots"`
9 BreachFlags map[string]CounterBreachFlag `json:"breach_flags"`
10
11 // State detection state (port state and device presence)
12 PortStates map[string]PortStateSnapshot `json:"port_states"`
13 KnownDevices []string `json:"known_devices"`
14}
15
16// CounterSnapshot stores the value and wall-clock timestamp of a counter
17// reading. It plays two distinct roles depending on the configured
18// thresholdType:
19// - For "delta" thresholds, the snapshot is updated every poll. Δ is
20// just current_value − snapshot.value over the polling interval.
21// - For "velocity" thresholds, the snapshot is held for the full
22// velocityUnit window (1s / 1m / 1h). Evaluation only happens once
23// elapsed ≥ window, and the rate is computed as delta / elapsed in
24// the configured unit. After evaluation the snapshot is advanced
25// to the current reading so the next window starts fresh. This
26// avoids extrapolating a 1s sample into an hourly rate.
27//
28// Because the timestamp is persisted, velocity windows survive pod
29// restarts: the new pod resumes from the persisted (value, timestamp)
30// instead of restarting the window from zero.
31type CounterSnapshot struct {
32 Value uint64 `json:"value"`
33 Timestamp time.Time `json:"timestamp"`
34}
35
36// CounterBreachFlag tracks whether a counter has an active threshold breach.
37// This is needed because the breach state cannot be derived from the counter
38// value alone — it depends on the delta at the time of the original breach,
39// not the absolute value. Without this flag, the monitor cannot emit recovery
40// events after an admin counter reset.
41type CounterBreachFlag struct {
42 Breached bool `json:"breached"`
43 CheckName string `json:"check_name"`
44 IsFatal bool `json:"is_fatal"`
45 Since time.Time `json:"since"`
46}
47
48// PortStateSnapshot captures the last-known state of a port for the state
49// checks. Persisting this enables recovery event emission after pod restart:
50// if a port was DOWN (fatal event sent) and an admin fixes the cable while
51// the pod is restarting, the new pod can detect the DOWN→ACTIVE transition
52// and emit a recovery event (IsHealthy=true). Without this, the platform
53// would remain stuck in the FATAL state for that port.
54// Also enables device disappearance detection across restarts via KnownDevices.
55//
56// LinkLayer ("InfiniBand" or "Ethernet") lets each state check filter the
57// shared map to its own ports on startup so the IB and Ethernet checks
58// don't treat each other's entries as "disappeared" during the seed.
59type PortStateSnapshot struct {
60 State string `json:"state"` // raw sysfs value, e.g., "4: ACTIVE", "1: DOWN"
61 PhysicalState string `json:"physical_state"` // raw sysfs value, e.g., "5: LinkUp", "3: Disabled"
62 Device string `json:"device"` // e.g., "mlx5_0"
63 Port int `json:"port"`
64 LinkLayer string `json:"link_layer,omitempty"` // "InfiniBand" | "Ethernet"
65}

Map keys: Counter snapshots and breach flags use the key format <device>:<port>:<counter_name> (e.g., mlx5_0:1:link_downed). Port state snapshots use <device>_<port> (e.g., mlx5_0_1). KnownDevices is a flat list of device names (e.g., ["mlx5_0", "mlx5_1", ...]).

Save triggers: The state file is written after each poll cycle completes (both state and counter checks). Errors during save are logged as warnings but do not halt monitoring.

Load behavior: On startup, the monitor attempts to load the state file. If the file is missing or corrupt, the monitor starts with empty state (equivalent to first boot). A warning is logged.

6.7 Rationale

  • When a counter resets, the new value represents errors accumulated since the reset
  • This is a conservative approach: we may slightly undercount errors immediately after a reset
  • Alternative (treating reset as zero delta) could miss real errors that occurred during/after reset
  • Admin-initiated resets are a legitimate remediation action — the monitor must recognize them and clear the unhealthy condition by emitting a recovery event
  • Driver reloads are logged separately by the Syslog Health Monitor, providing correlation context
  • Persistent state ensures recovery events survive pod restarts, preventing nodes from being permanently stuck in an unhealthy state
  • Per-counter timestamps enable accurate velocity calculation — the snapshot is held for the configured velocityUnit window (1s / 1m / 1h) and the rate is computed over the real wall-clock elapsed time, not extrapolated from a single poll. This means a 120/hour threshold genuinely observes one hour of data, not 3600 × the per-second rate. Persisting the timestamp lets the window survive pod restarts: the new pod resumes from the persisted snapshot instead of starting a fresh window.

7. Missing Counter Handling

Not all counters are available on all NIC versions or firmware revisions. The monitor must gracefully handle missing counters to ensure portability across different hardware generations (ConnectX-5, ConnectX-6, ConnectX-7, etc.).

7.1 Design Principles

  • Fail-open for missing counters: If a counter file does not exist, skip it silently. Do not emit errors or events.
  • Log at startup only: On monitor initialization, log which counters are available vs. unavailable for debugging purposes. Do not repeatedly log missing counters during polling.
  • Graceful degradation: The monitor should function with whatever subset of counters is available. A node with an older NIC still benefits from the counters that do exist.
  • Configuration flexibility: Allow operators to disable specific counters via configuration if they are known to be unavailable or irrelevant for their environment.

7.2 Common Counter Availability by NIC Generation

CounterConnectX-5ConnectX-6ConnectX-7
symbol_errorYesYesYes
link_error_recoveryYesYesYes
link_downedYesYesYes
port_rcv_errorsYesYesYes
roce_slow_restartNoYesYes
local_ack_timeout_errYesYesYes

Note: Counter availability may also depend on firmware version and driver configuration. The monitor should always verify counter existence at runtime rather than relying on static assumptions.


8. RDMA vs TCP/IP Counter Domains

Critical Architectural Note: RDMA vs TCP/IP Counter Domains

For RoCE devices, there are TWO separate counter domains:

Counter LocationTracksExample Traffic
/sys/class/infiniband/<dev>/ports/<port>/counters/RDMA traffic onlyib_write_bw, distributed apps
/sys/class/net/<iface>/statistics/TCP/IP traffic onlyping, ssh, HTTP

Field-validated observation: Running ping through a RoCE interface does NOT increment InfiniBand counters (port_rcv_data, port_xmit_data stay at 0). The ping goes through the TCP/IP stack and is tracked in Ethernet statistics instead.

Implication for monitoring: To detect RDMA-specific degradation (which affects distributed workloads), you MUST monitor the InfiniBand counters. Ethernet statistics alone will miss RDMA-layer issues like roce_slow_restart errors.

8.1 Counter Domain Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│ RDMA vs TCP/IP COUNTER DOMAINS │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────┐ │
│ │ APPLICATION LAYER │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌─────────────────────────┐ │ ┌─────────────────────────┐ │
│ │ RDMA STACK │ │ │ TCP/IP STACK │ │
│ │ (NCCL, MPI, ib_*) │ │ │ (HTTP, SSH, ping) │ │
│ └───────────┬─────────────┘ │ └───────────┬─────────────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌─────────────────────────┐ │ ┌─────────────────────────┐ │
│ │ InfiniBand Counters │ │ │ Ethernet Statistics │ │
│ │ /sys/class/infiniband/ │ │ │ /sys/class/net/ │ │
│ │ <dev>/ports/<p>/counters│ │ │ <iface>/statistics/ │ │
│ │ │ │ │ │ │
│ │ • symbol_error │ │ │ • rx_bytes │ │
│ │ • port_rcv_errors │ │ │ • tx_bytes │ │
│ │ • roce_slow_restart │ │ │ • rx_errors │ │
│ │ • port_rcv_data │ │ │ • carrier_changes │ │
│ └───────────┬─────────────┘ │ └───────────┬─────────────┘ │
│ │ │ │ │
│ └─────────────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ PHYSICAL NIC HARDWARE │ │
│ │ (ConnectX-6) │ │
│ └─────────────────────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════════════ │
│ │
│ KEY INSIGHT: Monitor InfiniBand counters for RDMA workload health │
│ Ethernet stats won't catch roce_slow_restart! │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘

9. Data Structures

9.1 Counter Structures

1// CounterSnapshot represents a point-in-time reading of all counters for a port
2type CounterSnapshot struct {
3 Device string `json:"device"`
4 Port int `json:"port"`
5 Timestamp time.Time `json:"timestamp"`
6 Counters map[string]uint64 `json:"counters"` // counter_name -> value
7}
8
9// CounterDelta represents the change between two snapshots
10type CounterDelta struct {
11 Device string `json:"device"`
12 Port int `json:"port"`
13 IntervalSec float64 `json:"interval_sec"`
14 Deltas map[string]uint64 `json:"deltas"` // counter_name -> delta
15 Rates map[string]float64 `json:"rates"` // counter_name -> rate/sec
16}

9.2 Persistent State Structures

The monitor persists operational state to survive pod restarts. See Section 6.6 for the on-disk schema and field-level rationale; the structures defined there (MonitorState, CounterSnapshot, CounterBreachFlag, PortStateSnapshot) are the canonical reference.

Example state file content:

1{
2 "version": 1,
3 "boot_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
4 "counter_snapshots": {
5 "mlx5_0:1:link_downed": {"value": 3, "timestamp": "2025-06-15T10:30:00Z"},
6 "mlx5_0:1:symbol_error": {"value": 1500000, "timestamp": "2025-06-15T10:30:00Z"}
7 },
8 "breach_flags": {
9 "mlx5_0:1:link_downed": {
10 "breached": true,
11 "check_name": "InfiniBandStateCheck",
12 "is_fatal": true,
13 "since": "2025-06-15T10:25:00Z"
14 }
15 },
16 "port_states": {
17 "mlx5_0_1": {"state": "1: DOWN", "physical_state": "3: Disabled", "device": "mlx5_0", "port": 1, "link_layer": "InfiniBand"},
18 "mlx5_1_1": {"state": "4: ACTIVE", "physical_state": "5: LinkUp", "device": "mlx5_1", "port": 1, "link_layer": "InfiniBand"}
19 },
20 "known_devices": ["mlx5_0", "mlx5_1", "mlx5_2", "mlx5_3"]
21}

9.3 Entity Model

NICs and Ports are modeled as separate entity types to enable precise fault localization:

Entity TypeEntity Value FormatExampleUse Case
NIC<device_name>mlx5_0Device-level failures
NICPort<port_number>1Port-level counter violations

10. Configuration

The counter monitoring system is fully configurable, allowing operators to:

  • Define which counters to monitor
  • Configure threshold types (delta-based or velocity-based)
  • Set fatal/non-fatal severity levels per counter
  • Override default thresholds for specific environments

10.1 Configuration Schema

1# NIC Health Monitor - Counter Detection Configuration
2counterDetection:
3 # Enable/disable counter monitoring
4 enabled: true
5
6 # Counter definitions - fully configurable
7 # Each counter can specify:
8 # - name: Counter identifier (used in events)
9 # - path: Sysfs path relative to device (supports standard and hw_counters)
10 # - enabled: Enable/disable this counter (default: true)
11 # - isFatal: Whether threshold breach triggers Fatal event (default: false)
12 # - thresholdType: "delta" (absolute change) or "velocity" (rate per time unit)
13 # - threshold: Numeric threshold value
14 # - velocityUnit: For velocity thresholds: "second", "minute", "hour"
15 # - description: Human-readable description for event messages
16
17 counters: [] # See defaults below

Polling interval: The polling interval is set globally on the Helm chart (pollingInterval, default 1s) and is not configured per counter. Velocity thresholds are evaluated against a window matching the configured velocityUnit (1s / 1m / 1h), independent of the polling interval — so a fast 1s poll is suitable for every counter type without producing false alerts on long windows.

10.2 Default Counter Configuration

The following counters are monitored by default. Operators can override any setting or add custom counters.

1counterDetection:
2 enabled: true
3
4 counters:
5 #--------------------------------------------------------------------------
6 # FATAL COUNTERS (Default: IsFatal=true, RecommendedAction=REPLACE_VM)
7 # These counters indicate deterministic workload failure
8 #--------------------------------------------------------------------------
9
10 - name: link_downed
11 path: counters/link_downed
12 enabled: true
13 isFatal: true
14 thresholdType: delta
15 threshold: 0 # Any increment (> 0) is fatal
16 description: "Port Training State Machine failed - QP disconnect"
17
18 - name: excessive_buffer_overrun_errors
19 path: counters/excessive_buffer_overrun_errors
20 enabled: true
21 isFatal: true
22 thresholdType: delta
23 threshold: 0 # Any increment is fatal
24 description: "HCA internal buffer overflow - lossless contract violated"
25
26 - name: local_link_integrity_errors
27 path: counters/local_link_integrity_errors
28 enabled: true
29 isFatal: true
30 thresholdType: delta
31 threshold: 0 # Any increment is fatal
32 description: "Physical errors exceed LocalPhyErrors hardware cap"
33
34 - name: rnr_nak_retry_err
35 path: hw_counters/rnr_nak_retry_err
36 enabled: true
37 isFatal: true
38 thresholdType: delta
39 threshold: 0 # Any increment is fatal
40 description: "Receiver Not Ready NAK retry exhausted - connection severed"
41
42 #--------------------------------------------------------------------------
43 # NON-FATAL COUNTERS (Default: IsFatal=false, RecommendedAction=NONE)
44 # These counters indicate degradation requiring monitoring
45 #--------------------------------------------------------------------------
46
47 # Physical Layer (PHY) - two-tier symbol_error monitoring
48 - name: symbol_error
49 path: counters/symbol_error
50 enabled: true
51 isFatal: false
52 thresholdType: velocity
53 threshold: 10.0
54 velocityUnit: second
55 description: "PHY bit errors before FEC - physical layer degradation"
56
57 - name: symbol_error_fatal
58 path: counters/symbol_error
59 enabled: true
60 isFatal: true
61 thresholdType: velocity
62 threshold: 120.0
63 velocityUnit: hour
64 description: "Symbol errors exceed IBTA BER threshold (10E-12) - link outside spec"
65
66 - name: link_error_recovery
67 path: counters/link_error_recovery
68 enabled: true
69 isFatal: false
70 thresholdType: velocity
71 threshold: 5.0
72 velocityUnit: minute
73 description: "Link retraining events - micro-flapping"
74
75 # Transport Layer
76 - name: port_rcv_errors
77 path: counters/port_rcv_errors
78 enabled: true
79 isFatal: false
80 thresholdType: velocity
81 threshold: 10.0
82 velocityUnit: second
83 description: "Malformed packets received"
84
85 - name: out_of_sequence
86 path: hw_counters/out_of_sequence
87 enabled: true
88 isFatal: false
89 thresholdType: velocity
90 threshold: 100.0
91 velocityUnit: second
92 description: "Fabric routing issues - out of sequence packets"
93
94 - name: local_ack_timeout_err
95 path: hw_counters/local_ack_timeout_err
96 enabled: true
97 isFatal: false
98 thresholdType: velocity
99 threshold: 1.0
100 velocityUnit: second
101 description: "ACK timeout - potential fabric black hole"
102
103 # Congestion Indicators
104 - name: port_xmit_discards
105 path: counters/port_xmit_discards
106 enabled: true
107 isFatal: false
108 thresholdType: velocity
109 threshold: 100.0
110 velocityUnit: second
111 description: "TX discards due to congestion"
112
113 - name: port_xmit_wait
114 path: counters/port_xmit_wait
115 enabled: true
116 isFatal: false
117 thresholdType: velocity
118 threshold: 10000.0
119 velocityUnit: second
120 description: "TX wait ticks - congestion backpressure"
121
122 # RoCE-specific
123 - name: roce_slow_restart
124 path: hw_counters/roce_slow_restart
125 enabled: true
126 isFatal: false
127 thresholdType: velocity
128 threshold: 10.0
129 velocityUnit: second
130 description: "Victim flow oscillation"
131
132 # Interface Level
133 - name: carrier_changes
134 path: /sys/class/net/{interface}/statistics/carrier_changes
135 enabled: true
136 isFatal: false
137 thresholdType: delta
138 threshold: 2 # > 2 changes per interval
139 description: "Link instability - carrier state changes"

10.3 Custom Counter Example

Operators can add custom counters or override defaults:

1counterDetection:
2 counters:
3 # Override: Make symbol_error fatal for strict environments
4 - name: symbol_error
5 path: counters/symbol_error
6 enabled: true
7 isFatal: true # Override: make fatal
8 thresholdType: velocity
9 threshold: 120.0 # IBTA spec: 120/hour
10 velocityUnit: hour # Changed from second
11 description: "Symbol errors exceed IBTA BER threshold"
12
13 # Custom: Add vendor-specific counter
14 - name: custom_vendor_error
15 path: hw_counters/vendor_specific_err
16 enabled: true
17 isFatal: false
18 thresholdType: delta
19 threshold: 100
20 description: "Vendor-specific error counter"
21
22 # Disable: Turn off a default counter
23 - name: port_xmit_wait
24 enabled: false

10.4 Threshold Processing Algorithm

The evaluator uses latching breach semantics: once a counter breaches its threshold, the breach flag stays set until the counter is reset (current < previous) or the host reboots. Polls while a counter is already breached emit nothing; recovery events fire only on counter reset of a previously breached counter.

On startup, before the first poll cycle:
0. Check boot ID (see Section 6.5):
IF boot ID changed (host rebooted):
- Clear ALL persisted state (snapshots and breach flags)
- Set reboot_detected = true
ELSE (pod restart, same boot):
- Restore all persisted state
- Set reboot_detected = false
For each configured counter (key = <device>:<port>:<counter_name>):
1. Read current_value from sysfs and capture wall-clock now.
2. IF reboot_detected AND this is the first poll for the key:
→ Emit HEALTHY baseline event (clears any stale platform condition):
- IsHealthy = true
- IsFatal = false
- RecommendedAction = NONE
- Message = "Counter {name} healthy after reboot on port {device} port {port}"
- CheckName: state check name if isFatal, else degradation check name
→ Store snapshot = (current_value, now). Continue to next counter.
3. Reset detection (in this priority order):
a. IF an in-memory lastPollValue exists for this key AND
current_value < lastPollValue → reset
b. ELSE IF a persisted snapshot exists AND
current_value < snapshot.value → reset
The lastPollValue path catches mid-window resets in steady state;
the snapshot.value fallback catches resets that happened during pod
downtime when lastPollValue is gone.
IF reset:
IF previously_breached:
→ Emit RECOVERY event (IsHealthy=true, IsFatal=false,
RecommendedAction=NONE, CheckName matching the original breach)
→ Clear breach flag
Update snapshot to (current_value, now). Continue to next counter.
4. IF previously_breached:
→ No event (latched). Skip evaluation entirely.
5. Evaluate based on thresholdType:
IF thresholdType == "delta":
delta = current_value − snapshot.value
breach = (delta > threshold)
Update snapshot to (current_value, now) every poll.
IF thresholdType == "velocity":
elapsed = now − snapshot.timestamp
window = velocityUnit (1s | 1m | 1h)
IF elapsed < window:
→ Skip evaluation; leave snapshot untouched. The window is
not yet full. Continue to next counter.
delta = current_value − snapshot.value
rate = delta / elapsed expressed in velocityUnit
breach = (rate > threshold)
Update snapshot to (current_value, now). The next window starts
from this reading.
6. IF breach (and we got here, so we were not previously breached):
→ Emit UNHEALTHY event:
- IsHealthy = false
- IsFatal = counter.isFatal
- RecommendedAction = REPLACE_VM if counter.isFatal, else NONE
- Message = "{port}: {name} - {description} (value=..., delta=..., rate=...)"
- CheckName:
isFatal=true → state check name (InfiniBandStateCheck / EthernetStateCheck)
isFatal=false → degradation check name
- ComponentClass = "NIC"
→ Set breached = true in persistent breach flags.
IF NOT breach: no event.
7. Record current_value as lastPollValue (in-memory) for the next poll.
After all counters evaluated:
8. Save persistent state to disk only if any snapshot or breach flag
actually changed (to avoid unnecessary writes every poll).

10.5 Configuration Validation

The monitor validates configuration at startup:

ValidationRequirementAction on Failure
Counter path existsPath must be readable in sysfsLog warning, skip counter
Threshold is positivethreshold >= 0Reject configuration
velocityUnit validMust be second, minute, or hourReject configuration
thresholdType validMust be delta or velocityReject configuration
Unique counter namesNo duplicate name fieldsReject configuration

11. Event Management

11.1 Event Construction

Example Event Fields (Fatal - link_downed):

Note: Fatal counter events use the state check name (InfiniBandStateCheck / EthernetStateCheck) so that all fatal signals for a given NIC type consolidate under a single node condition.

FieldValue
Agentnic-health-monitor
CheckNameInfiniBandStateCheck
ComponentClassNIC
Message”Port mlx5_0 port 1: link_downed - Port Training State Machine failed - QP disconnect (value=1, delta=1, rate=0.20/sec)“
IsFataltrue
IsHealthyfalse
RecommendedActionREPLACE_VM
EntitiesImpacted[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]

Example Event Fields (Non-Fatal - Degradation):

Note: Non-fatal counter events use the degradation check name (InfiniBandDegradationCheck / EthernetDegradationCheck) to keep degradation signals separate from fatal conditions on the node.

FieldValue
Agentnic-health-monitor
CheckNameInfiniBandDegradationCheck
ComponentClassNIC
Message”Port mlx5_0 port 1: symbol_error - PHY bit errors before FEC - physical layer degradation (value=100, delta=100, rate=20.00/sec)“
IsFatalfalse
IsHealthyfalse
RecommendedActionNONE
EntitiesImpacted[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]

Example Event Fields (Recovery - Counter Reset by Admin):

Note: Recovery events are emitted when a previously breached counter returns below its threshold — typically after an administrator clears the counters. The CheckName matches the original breach event to ensure the recovery clears the correct condition.

FieldValue
Agentnic-health-monitor
CheckNameInfiniBandStateCheck
ComponentClassNIC
Message”Counter link_downed recovered on port mlx5_0 port 1”
IsFatalfalse
IsHealthytrue
RecommendedActionNONE
EntitiesImpacted[{EntityType: "NIC", EntityValue: "mlx5_0"}, {EntityType: "NICPort", EntityValue: "1"}]

11.2 Event Routing

IsFatalIsHealthyActionUse Case
truefalseImmediate gRPC dispatch to Platform Connectorlink_downed, buffer overrun
falsefalseBatched gRPC dispatch (periodic)Symbol errors, congestion
falsetrueImmediate gRPC dispatch to Platform ConnectorCounter recovery after admin reset

Appendix A: Quick Reference - Default Counter Thresholds

Note: All counters, thresholds, and severity levels are configurable via the monitor configuration. The values below are the defaults that apply when no custom configuration is provided. See Section 10: Configuration for customization options.

Fatal Counters (Default: IsFatal = true)

CounterPathDefault ThresholdDefault ActionConfigurable
link_downedcounters/Delta > 0REPLACE_VMYes
excessive_buffer_overrun_errorscounters/Delta > 0REPLACE_VMYes
local_link_integrity_errorscounters/Delta > 0REPLACE_VMYes
rnr_nak_retry_errhw_counters/Delta > 0REPLACE_VMYes
symbol_error_fatalcounters/> 120/hourREPLACE_VMYes

Driver/Firmware Logs

For kernel log pattern details (fatal and non-fatal classifications, regex patterns, and kernel source references), see Syslog Detection & Correlation.

Non-Fatal Counters (Default: IsFatal = false)

CounterPathDefault ThresholdDefault ActionConfigurable
symbol_errorcounters/> 10/secMonitorYes
link_error_recoverycounters/> 5/minMonitorYes
port_rcv_errorscounters/> 10/secMonitorYes
port_xmit_discardscounters/> 100/secMonitorYes
roce_slow_restarthw_counters/> 10/secMonitorYes
local_ack_timeout_errhw_counters/> 1/secMonitorYes
carrier_changesinterface> 2/intervalMonitorYes

Note: rnr_nak_retry_err is FATAL by default (see Fatal Counters table above). All counters can have their severity and threshold overridden via configuration.

Design Principle

SourceIsFatalRecommended ActionPurpose
Deterministic LogstrueREPLACE_VMFatal driver/firmware condition
Port State Changes (link-state-detection)trueREPLACE_VMFatal NIC condition detected
Fatal Counters (link-counter-detection)trueREPLACE_VMFatal NIC condition detected
Diagnostic LogsfalseNONEEvidence/context for investigation

Key Insight: Deterministically fatal events in logs (cmd_exec timeout, etc.) are Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module absent) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM.


References

PHY & Signal Integrity

  1. PAM4 Error Correction Challenges in 400GbE (EDN)
  2. Determine Which Links Are Experiencing Significant Errors - Sun/Oracle (citing IBTA BER Threshold)

Linux Kernel & Driver

  1. sysfs-class-infiniband (Linux Kernel)

Fabric Diagnostics

  1. ibdiagnet User Manual (NVIDIA)
  2. Black Hole Detection (sFlow)
  3. InfiniBand™ Architecture Specification (IBTA)

Vendor Monitoring Guides

  1. InfiniBand Errors Dashboard - HPE ClusterStor
  2. HPC Clusters Using InfiniBand on IBM Power Systems - IBM Redbooks
  3. NVIDIA UFM InfiniBand Port Counters
  4. NVIDIA DOCA Telemetry Service Guide
  5. NVIDIA UFM Telemetry - InfiniBand Cluster Bring-Up

RDMA Programming References

  1. ibv_modify_qp(3) — Linux Manual Page (rnr_retry, retry_cnt)
  2. NVIDIA RDMA-Aware Programming - Queue Pair Bringup