NIC Health Monitor Design

View as Markdown

Overview

The NIC Health Monitor is a comprehensive monitoring solution for detecting and reporting network interface failures in GPU clusters. It addresses the challenge of Grey Failures—subtle degradations where a single degraded link can throttle thousands of GPUs in distributed training workloads.

This documentation is organized into three focused areas:

DocumentFocusKey Capabilities
Link State DetectionUP/DOWN monitoring, device disappearanceBinary state changes, NIC role classification (compute/storage/management), uncabled port anomalies
Link Counter DetectionError rate monitoring, threshold violationsBER tracking, FEC exhaustion prediction, congestion detection
Syslog Detection & CorrelationKernel log monitoring, repeat failuresDriver/firmware errors, correlated failure patterns

Architecture Overview

┌────────────────────────────────────────────────────────────────────────────────┐
│ NVSentinel NIC MONITORING ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ PER-NODE DAEMONSETS (Raw Event Reporters) │ │
│ ├──────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ NIC HEALTH MONITOR │ │ SYSLOG HEALTH MONITOR │ │ │
│ │ │ ══════════════════ │ │ ════════════════════ │ │ │
│ │ │ │ │ │ │ │
│ │ │ DATA SOURCES: │ │ DATA SOURCE: │ │ │
│ │ │ • /sys/class/infiniband/ │ │ • journald / /var/log/journal │ │ │
│ │ │ • /sys/class/net/ │ │ │ │ │
│ │ │ │ │ CHECK: │ │ │
│ │ │ CHECKS: │ │ • SysLogsNICDriverError │ │ │
│ │ │ • InfiniBandStateCheck │ │ (mlx5_core cmd_exec timeout, │ │ │
│ │ │ • InfiniBandDegradationChk │ │ health poll failed, │ │ │
│ │ │ • EthernetStateCheck │ │ unrecoverable, PCIe fatal) │ │ │
│ │ │ • EthernetDegradationCheck │ │ │ │ │
│ │ │ │ │ BEHAVIOR: │ │ │
│ │ │ BEHAVIOR: │ │ • Reports RAW events as-is │ │ │
│ │ │ • Reports RAW events as-is │ │ • Persistent local state │ │ │
│ │ │ • Persistent local state │ │ (journal cursors, boot ID) │ │ │
│ │ │ (port states, counters, │ │ • Correlation centralized │ │ │
│ │ │ boot ID, breach flags, │ │ │ │ │
│ │ │ known devices) │ │ │ │ │
│ │ │ • Correlation centralized │ │ │ │ │
│ │ └─────────────┬───────────────┘ └───────────────┬─────────────────┘ │ │
│ │ │ │ │ │
│ └────────────────┼──────────────────────────────────┼─────────────────────┘ │
│ │ │ │
│ └────────────────┬─────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ PLATFORM CONNECTOR │ │
│ │ ══════════════════ │ │
│ │ • Receives raw events │ │
│ │ • Persists to MongoDB │ │
│ │ • Triggers downstream │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ CENTRALIZED DEPLOYMENT (Correlation & Intelligence) │ │
│ ├──────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ HEALTH EVENTS ANALYZER │ │ │
│ │ │ ══════════════════════ │ │ │
│ │ │ │ │ │
│ │ │ OPTIONAL CORRELATION (MongoDB Aggregation Pipelines): │ │ │
│ │ │ │ │ │
│ │ │ • Correlate kernel log warnings with port state events │ │ │
│ │ │ • Provide diagnostic context for operator investigation │ │ │
│ │ │ │ │ │
│ │ │ NOTE: Deterministic kernel logs (cmd_exec timeout, etc.) = Fatal │ │ │
│ │ │ Diagnostic logs = Non-Fatal │ │ │
│ │ │ │ │ │
│ │ │ OUTPUT: │ │ │
│ │ │ • Deterministic failure detection and remediation triggering │ │ │
│ │ │ • Diagnostic correlation for operator investigation │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘

Design Principles

”Report Raw, Correlate Centrally” Pattern

The NIC Health Monitor follows NVSentinel’s established architectural pattern:

ComponentFunctionData SourceEvent Flow
NIC Health MonitorDetect UP/DOWN transitions, counter threshold violations, emit recovery events on admin counter resets, emit healthy baselines after host rebootsysfs state files, countersRaw + recovery events → Platform Connector
Syslog Health MonitorDetect driver/firmware events from kernel logsjournald/dmesgRaw events → Platform Connector
Health Events AnalyzerCorrelate events, detect patterns, escalate to fatalMongoDBCorrelated events → Platform Connector

Design Principle: Health monitors report raw events as-is. All aggregation, correlation, link flap detection, and stabilization window logic is handled centrally by the Health Events Analyzer using configurable MongoDB aggregation rules. Each monitor maintains minimal persistent local state on the node (via hostPath-backed state files) to survive pod restarts — port state snapshots, counter snapshots, breach flags, known device lists, and boot ID for the NIC monitor; journal cursors and boot ID for the syslog monitor. This local state is strictly for delta/velocity calculation, health boundary transition detection, recovery event emission, and resumption; it is not used for correlation or pattern detection.

Binary Severity Model

This monitor uses a binary severity model based on workload impact:

SeverityMeaningExample
FatalWorkload WILL fail or HAS failedNIC DOWN, cmd_exec timeout, unrecoverable hardware error
Non-FatalDegradation detected, workload continuesSymbol errors, congestion, insufficient power, NETDEV WATCHDOG

Key Design Principle: The only question that matters is “Will the running workload fail because of this?”


Detection Methods Summary

Three-Layer Detection Approach

┌────────────────────────────────────────────────────────────────────────────┐
│ THREE-LAYER NIC MONITORING │
├────────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: LINK STATE DETECTION │
│ ═══════════════════════════════ │
│ • Polling interval: 1 second │
│ • Data source: /sys/class/infiniband/<dev>/ports/<port>/state, phys_state │
│ • Detects: Hard DOWN, device disappearance, uncabled port anomalies │
│ • Documentation: link-state-detection.md │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 2: LINK COUNTER DETECTION │
│ ═══════════════════════════════ │
│ • Polling interval: 1 second │
│ • Velocity thresholds gate themselves on the configured velocityUnit │
│ (1s / 1m / 1h), so a fast poll is safe for every counter type. │
│ • Data source: /sys/class/infiniband/<dev>/ports/<port>/counters/ │
│ • Detects: Symbol errors, link flaps, buffer overruns, transport errors │
│ • Documentation: link-counter-detection.md │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ LAYER 3: SYSLOG DETECTION │
│ ═══════════════════════════ │
│ • Trigger: Event-driven (journald watch) │
│ • Data source: Kernel ring buffer (dmesg), journald │
│ • Detects: cmd_exec timeout, health poll failed, PCIe errors │
│ • Documentation: syslog-detection-correlation.md │
│ │
└────────────────────────────────────────────────────────────────────────────┘

Coverage Map

┌─────────────────────────────────────────────────────────────────┐
│ Coverage Map │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PACKET PATH (Data Moving Through NIC): │
│ ├── Link State Detection: Binary UP/DOWN │
│ └── Link Counter Detection: Error rates, degradation │
│ │
│ DRIVER/FIRMWARE INTERFACE (OS <--> NIC Communication): │
│ └── Syslog Detection: mlx5_core errors, PCIe AER events │
│ │
│ Driver/firmware failures can occur while packet path appears │
│ healthy. All three layers are needed for complete coverage. │
│ │
└─────────────────────────────────────────────────────────────────┘

Key Capabilities

  1. Deterministic failure thresholds from IBTA specifications, cloud provider heuristics (Azure, AWS), and vendor documentation
  2. Fully configurable counters and thresholds - operators can define which counters to monitor, set custom thresholds (delta or velocity-based), and configure fatal/non-fatal severity per counter
  3. Rate-based degradation detection via centralized Health Events Analyzer rules
  4. Pre-failure prediction by detecting BER climbing before FEC exhaustion (IBTA 10E-12 BER threshold: 120 errors/hour)
  5. Kernel log monitoring integrated into the existing syslog-health-monitor with NIC-specific check patterns
  6. Centralized event correlation via Health Events Analyzer MongoDB aggregation pipelines
  7. Link flap detection via Health Events Analyzer rules (e.g., “link_downed 3+ times in 10 minutes”)
  8. Persistent local state shared across state and counter checks — port state snapshots and known device list (state recovery and device disappearance detection across restarts), counter snapshots with per-counter timestamps (precise velocity calculation), breach flags (recovery event emission after admin counter resets), and boot ID (clear all state and emit healthy baselines on host reboot, since NICs may have been replaced)
  9. Zero-configuration NIC role classification via a two-level decision (NUMA locality + nvidia-smi topo -m matrix, consumed from the NVSentinel metadata collector’s gpu_metadata.json). Works across x86 DGX/HGX (A100, H100, L40S), Grace-based superchips (GB200, GH200), and OEM/cloud platforms. See Link State Detection, Section 4.

Event Sources Summary

State Detection (Fatal)

SourceFatal ConditionsRecommended Action
State Monitorstate=DOWN, phys_state=Disabled, device disappeared, uncabled port anomalyRecommendedAction_REPLACE_VM

Counter Detection (Fatal - Defaults)

SourceDefault Fatal ConditionsRecommended Action
Degradation Monitorlink_downed (Delta > 0), excessive_buffer_overrun_errors (any), local_link_integrity_errors (any), rnr_nak_retry_err (any)RecommendedAction_REPLACE_VM

Note: All counter thresholds and severity levels are configurable. See Link Counter Detection for customization options.

Syslog Detection (Fatal & Non-Fatal)

SourceConditionsSeverityPurpose
Log Watchermlx5_core.*cmd_exec timeout, health poll failed, unrecoverableFatalDeterministic hardware/driver failure
Log Watcherinsufficient power, module absent, ACCESS_REG failed, NETDEV WATCHDOG (TX queue timeout - auto-recoverable)Non-FatalDiagnostic context for correlation

Design Note: Deterministically fatal events in logs trigger REPLACE_VM (emitted as IsFatal=true). Diagnostic logs are published as non-fatal events (IsFatal=false) for correlation and do not directly trigger automated remediation.

Non-Fatal Event Sources (Degradation Monitoring)

SourceNon-Fatal ConditionsAction
Degradation Monitorsymbol_error (elevated rate), link_error_recovery (>5/min), port_rcv_errors (>10/sec)Monitor for escalation
Degradation Monitorcarrier_changes (>2/interval), rx_missed_errors (host bottleneck)Monitor for escalation
Degradation Monitorroce_slow_restart (>10/sec), local_ack_timeout_err (>1/sec)Monitor for escalation

Recovery and Healthy Baseline Events (IsHealthy = true)

The NIC Health Monitor emits healthy events (IsHealthy=true) in two scenarios to clear stale unhealthy conditions on the platform:

TriggerSourceBehaviorPurpose
Admin counter resetDegradation MonitorWhen a previously-breached counter returns below threshold (typically after CSP clears counters via perfquery -x), emit recovery event per counterClears stale counter breach conditions
Port recoveryState MonitorWhen a previously-DOWN port transitions to ACTIVE/LinkUp, emit recovery event per portClears stale port FATAL conditions
Host rebootBothOn boot ID change, clear all persisted state and emit baseline events: healthy events for all ACTIVE/LinkUp ports and all counters (at 0 after reboot); fatal/non-fatal events for any ports that are unhealthy post-reboot. Counter unhealthy events follow on the second poll if errors are accumulating.Clears all stale conditions from previous boot — NICs may have been replaced during maintenance

Design Note: Without recovery events, a node that was marked FATAL on the platform would remain stuck in that state indefinitely after the issue is resolved. Persistent state (see Link Counter Detection, Section 6.6) ensures recovery events survive pod restarts. On host reboot, all state is cleared and baseline events are emitted for every port and counter — healthy for those currently healthy, unhealthy for those currently unhealthy — because the node may have entirely new hardware and the platform needs a complete picture of the current state.


Supported Hardware

Current Scope: This initial implementation focuses on Mellanox/NVIDIA InfiniBand and RoCE devices only. The architecture is designed to be extensible for future support of additional NIC vendors.

VendorDetectionCounter MonitoringFatal Detection
Mellanox (IB/RoCE)Device name mlx5_* or driver symlinkYes - fatal countersCounters + State + Kernel Logs

Future Work

  • AWS EFA Support: Device names matching rdmap\d+s\d+
  • Plain Ethernet: operstate = down detection via /sys/class/net/<interface>/operstate
  • TCPXO Support: TCP Express Offload support

Quick Navigation

TopicDocumentSection
UP/DOWN state monitoringLink State DetectionSection 3
Device disappearance / PCI checksLink State DetectionSection 7
Management NIC exclusion (NUMA)Link State DetectionSection 4.1
NIC role classification (topo-based)Link State DetectionSection 4.2
Metadata collector requirementsLink State DetectionSection 12.2
Uncabled port detectionLink State DetectionSection 4.3
SR-IOV VF handlingLink State DetectionSection 8
BER/FEC theoryLink Counter DetectionSection 2
Counter thresholdsLink Counter DetectionSection 4
Counter reset handlingLink Counter DetectionSection 6
Admin reset recovery eventsLink Counter DetectionSection 6.4
Persistent state fileLink Counter DetectionSection 6.6
Boot ID handlingLink Counter DetectionSection 6.5
Driver error patternsSyslog Detection & CorrelationSection 5
Repeat failure detectionSyslog Detection & CorrelationSection 7
Health Events Analyzer rulesSyslog Detection & CorrelationAppendix B

References

PHY & Signal Integrity

  1. PAM4 Error Correction Challenges in 400GbE (EDN)
  2. Determine Which Links Are Experiencing Significant Errors - Sun/Oracle (citing IBTA BER Threshold)

Linux Kernel & Driver

  1. sysfs-class-infiniband (Linux Kernel)
  2. RHEL8 mlx5_core Stack Overflow (Red Hat)

Fabric Diagnostics

  1. ibdiagnet User Manual (NVIDIA)
  2. Black Hole Detection (sFlow)
  3. InfiniBand™ Architecture Specification (IBTA)

Vendor Monitoring Guides

  1. InfiniBand Errors Dashboard - HPE ClusterStor
  2. HPC Clusters Using InfiniBand on IBM Power Systems - IBM Redbooks
  3. NVIDIA UFM InfiniBand Port Counters
  4. NVIDIA DOCA Telemetry Service Guide
  5. NVIDIA NCCL Environment Variables (NCCL_IB_RETRY_CNT)

GPU System References

  1. DGX A100 User Guide
  2. DGX H100 User Guide
  3. DGX B200 User Guide
  4. GB200 NVL2