NIC Health Monitor Design
Overview
The NIC Health Monitor is a comprehensive monitoring solution for detecting and reporting network interface failures in GPU clusters. It addresses the challenge of Grey Failures—subtle degradations where a single degraded link can throttle thousands of GPUs in distributed training workloads.
This documentation is organized into three focused areas:
Architecture Overview
Design Principles
”Report Raw, Correlate Centrally” Pattern
The NIC Health Monitor follows NVSentinel’s established architectural pattern:
Design Principle: Health monitors report raw events as-is. All aggregation, correlation, link flap detection, and stabilization window logic is handled centrally by the Health Events Analyzer using configurable MongoDB aggregation rules. Each monitor maintains minimal persistent local state on the node (via hostPath-backed state files) to survive pod restarts — port state snapshots, counter snapshots, breach flags, known device lists, and boot ID for the NIC monitor; journal cursors and boot ID for the syslog monitor. This local state is strictly for delta/velocity calculation, health boundary transition detection, recovery event emission, and resumption; it is not used for correlation or pattern detection.
Binary Severity Model
This monitor uses a binary severity model based on workload impact:
Key Design Principle: The only question that matters is “Will the running workload fail because of this?”
Detection Methods Summary
Three-Layer Detection Approach
Coverage Map
Key Capabilities
- Deterministic failure thresholds from IBTA specifications, cloud provider heuristics (Azure, AWS), and vendor documentation
- Fully configurable counters and thresholds - operators can define which counters to monitor, set custom thresholds (delta or velocity-based), and configure fatal/non-fatal severity per counter
- Rate-based degradation detection via centralized Health Events Analyzer rules
- Pre-failure prediction by detecting BER climbing before FEC exhaustion (IBTA 10E-12 BER threshold: 120 errors/hour)
- Kernel log monitoring integrated into the existing syslog-health-monitor with NIC-specific check patterns
- Centralized event correlation via Health Events Analyzer MongoDB aggregation pipelines
- Link flap detection via Health Events Analyzer rules (e.g., “link_downed 3+ times in 10 minutes”)
- Persistent local state shared across state and counter checks — port state snapshots and known device list (state recovery and device disappearance detection across restarts), counter snapshots with per-counter timestamps (precise velocity calculation), breach flags (recovery event emission after admin counter resets), and boot ID (clear all state and emit healthy baselines on host reboot, since NICs may have been replaced)
- Zero-configuration NIC role classification via a two-level decision (NUMA locality +
nvidia-smi topo -mmatrix, consumed from the NVSentinel metadata collector’sgpu_metadata.json). Works across x86 DGX/HGX (A100, H100, L40S), Grace-based superchips (GB200, GH200), and OEM/cloud platforms. See Link State Detection, Section 4.
Event Sources Summary
State Detection (Fatal)
Counter Detection (Fatal - Defaults)
Note: All counter thresholds and severity levels are configurable. See Link Counter Detection for customization options.
Syslog Detection (Fatal & Non-Fatal)
Design Note: Deterministically fatal events in logs trigger
REPLACE_VM(emitted asIsFatal=true). Diagnostic logs are published as non-fatal events (IsFatal=false) for correlation and do not directly trigger automated remediation.
Non-Fatal Event Sources (Degradation Monitoring)
Recovery and Healthy Baseline Events (IsHealthy = true)
The NIC Health Monitor emits healthy events (IsHealthy=true) in two scenarios to clear stale unhealthy conditions on the platform:
Design Note: Without recovery events, a node that was marked FATAL on the platform would remain stuck in that state indefinitely after the issue is resolved. Persistent state (see Link Counter Detection, Section 6.6) ensures recovery events survive pod restarts. On host reboot, all state is cleared and baseline events are emitted for every port and counter — healthy for those currently healthy, unhealthy for those currently unhealthy — because the node may have entirely new hardware and the platform needs a complete picture of the current state.
Supported Hardware
Current Scope: This initial implementation focuses on Mellanox/NVIDIA InfiniBand and RoCE devices only. The architecture is designed to be extensible for future support of additional NIC vendors.
Future Work
- AWS EFA Support: Device names matching
rdmap\d+s\d+ - Plain Ethernet:
operstate = downdetection via/sys/class/net/<interface>/operstate - TCPXO Support: TCP Express Offload support
Quick Navigation
References
PHY & Signal Integrity
- PAM4 Error Correction Challenges in 400GbE (EDN)
- Determine Which Links Are Experiencing Significant Errors - Sun/Oracle (citing IBTA BER Threshold)
Linux Kernel & Driver
Fabric Diagnostics
- ibdiagnet User Manual (NVIDIA)
- Black Hole Detection (sFlow)
- InfiniBand™ Architecture Specification (IBTA)
Vendor Monitoring Guides
- InfiniBand Errors Dashboard - HPE ClusterStor
- HPC Clusters Using InfiniBand on IBM Power Systems - IBM Redbooks
- NVIDIA UFM InfiniBand Port Counters
- NVIDIA DOCA Telemetry Service Guide
- NVIDIA NCCL Environment Variables (NCCL_IB_RETRY_CNT)