NIC Health Monitor: Link State Detection
NIC Health Monitor: Link State Detection
Table of Contents
- Overview
- Architecture
- State Monitoring Specification
- Management NIC Exclusion, NIC Role Classification, and Uncabled Port Detection
- Device Discovery and Parsing
- State Change and Flap Detection
- Device Disappearance Handling
- SR-IOV Virtual Function Handling
- RoCE State Monitoring
- Supported Hardware
- Data Structures
- Configuration
- Event Management
Related Documents:
- Link Counter Detection - Counter-based degradation monitoring
- Syslog Detection & Correlation - Kernel log monitoring and repeat failure detection
1. Overview
1.1 Problem Statement
Modern GPU clusters suffer from Grey Failures (subtle degradations) and straggler effects where a single degraded link throttles thousands of GPUs. Simple UP/DOWN polling is the first line of defense for detecting hard failures where the NIC becomes completely unavailable.
1.2 Scope of Link State Detection
This document covers the State Monitoring component of the NIC Health Monitor, which detects:
- Hard UP/DOWN transitions - Link completely lost, no connectivity
- Device disappearance - NIC no longer visible in sysfs (fell off PCIe bus)
- Physical state changes - Port disabled, polling, or in error recovery
- Uncabled port anomaly detection - Card has fewer active ports than its peers (via homogeneity check)
- Management NIC auto-exclusion - NICs on NUMA nodes with no compute GPU are automatically excluded
1.3 Binary Severity Model
This monitor uses a binary severity model based on workload impact:
Key Design Principle: The only question that matters is “Will the running workload fail because of this?“
1.4 State Detection Overview Diagram
2. Architecture
2.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern
The State Monitor follows NVSentinel’s established architectural pattern where:
- Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
- Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
- MongoDB serves as the source of truth for event history and correlation queries
2.2 Component Responsibilities
Local State Persistence: The State Check persists port state snapshots (
state,phys_stateper port) and the known device list to the shared NIC health monitor state file (hostPath-backed, see Link Counter Detection, Section 6.6). This enables the monitor to (1) emit recovery events (IsHealthy=true) after pod restart when a previously-DOWN port has been fixed, (2) detect device disappearance across pod restarts by comparing the current device list against the persisted known devices, and (3) on host reboot (boot ID change), clear all state and emit healthy baseline events for all currently-healthy ports to clear stale FATAL conditions on the platform — since the node may have had NICs replaced during maintenance (see Link Counter Detection, Section 6.5).
2.3 State Check Data Flow (1s polling interval)
2.4 System Context
3. State Monitoring Specification
3.1 Port States (Full Enumeration)
Port states are defined by the Linux kernel InfiniBand sysfs interface. Reference: Linux Kernel sysfs-class-infiniband ABI
3.2 State Transitions
Logical State Flow: DOWN (1) → INIT (2) → ARMED (3) → ACTIVE (4)
- DOWN: No connectivity (FATAL)
- INIT: Initializing — normal transient state during startup. Every port passes through INIT during boot and Subnet Manager configuration. For InfiniBand ports, classified as Non-Fatal (
IsFatal=false) because INIT can persist while waiting for SM configuration. For Ethernet/RoCE ports, INIT is a brief sub-second transient during link training and is not reported (logged at DEBUG level only). If an IB port remains stuck in INIT, it won’t satisfy theACTIVE/LinkUpcondition, causing the card’s active port count to fall below its peers, which is caught as a Fatal condition by the card homogeneity check (see Section 4.2). - ARMED: Waiting for Subnet Manager — same rationale as INIT. For InfiniBand ports, classified as Non-Fatal (
IsFatal=false). For Ethernet/RoCE ports, this state is rare/transient and is not reported. Prolonged ARMED state on IB is caught by the card homogeneity check. - ACTIVE: Normal operational state (HEALTHY)
Physical State Substates: Sleep (1), Polling (2), Disabled (3), Training (4), LinkUp (5), LinkErrorRecovery (6)
- Polling (2): Transient state during link training. Every port passes through Polling when establishing a connection. Classified as Non-Fatal (
IsFatal=false). If a port remains in Polling, it won’t count as active in the card homogeneity check, so the card’s active port count will fall below the peer mode and be caught as a Fatal anomaly (see Section 4.2). - LinkErrorRecovery (6): Active error recovery in progress. Classified as Non-Fatal (
IsFatal=false) because the HCA firmware is actively retrying. If recovery fails and the port remains unhealthy, the card homogeneity check (Section 4.3) escalates to Fatal by detecting fewer active ports than peers.
3.3 Diagnostic Commands
3.4 State-Based Event Generation Algorithm
Port Health Evaluation Steps:
-
Read port state from
/sys/class/infiniband/<dev>/ports/<port>/stateandphys_state -
Load previous port state from persistent state file (or in-memory if available from a prior poll in this pod’s lifetime)
-
Determine health status:
- If
state = ACTIVEANDphys_state = LinkUp→ Healthy - Otherwise → Unhealthy (the specific state/phys_state combination determines the message)
- If
-
Emit event only on health boundary crossing:
- First poll after host reboot (boot ID changed — state cleared):
- All persisted state has been discarded (see Link Counter Detection, Section 6.5)
- Healthy ports (
ACTIVE/LinkUp): Emit healthy event (IsHealthy=true) — this clears any stale FATAL conditions on the platform from the previous boot (the node may have had NICs replaced, cables reseated, etc.) - Unhealthy ports on anomalous cards: Emit fatal event as usual
- Unhealthy ports on expected cards: Suppressed (uncabled port, not a failure)
- First poll with no persisted state (fresh node, corrupt/missing state file):
- Same behavior as the reboot case above
- First poll with persisted previous state (pod restart, same boot):
- Compare current health against persisted previous state
- Emit events on boundary crossings as with subsequent polls below (this is the key benefit of persistence — a port that was DOWN before restart and is now ACTIVE triggers a recovery event)
- Subsequent polls: Only emit when
wasHealthy ≠ isHealthy- Healthy → Unhealthy: FATAL event with consolidated message (e.g., “state DOWN, phys_state Disabled - no connectivity”)
- Unhealthy → Healthy: HEALTHY event (e.g., “healthy (ACTIVE, LinkUp)”)
- Unhealthy → Unhealthy (e.g., DOWN/Disabled → DOWN/Polling): No event — still unhealthy, intermediate transition suppressed
- Healthy → Healthy: No event — still healthy
- First poll after host reboot (boot ID changed — state cleared):
-
One consolidated event per port per transition:
- Logical state and physical state are combined into a single message
- For Ethernet/RoCE, the operstate is also included in the same event
EntitiesImpactedincludes both NIC and Port entitiesRecommendedAction = REPLACE_VMfor fatal events
4. Management NIC Exclusion and Uncabled Port Detection
This section describes three zero-configuration mechanisms that replace the previous gpu_port_config / AtLeastPorts / AtLeastRate approach. These mechanisms require no per-GPU-type configuration and work automatically across DGX, HGX, Grace-based superchips (GB200/GH200), OEM servers, and cloud VMs.
- Section 4.1: NUMA-based management NIC exclusion (exclude NICs on non-GPU NUMA nodes)
- Section 4.2: NIC role classification (topo matrix + link layer + default-route exclusion)
- Section 4.3: Role-based card homogeneity (detect uncabled ports and failures within each role group)
The classification of each NIC uses a three-step decision built from four complementary signals:
- Step 1 — Management gate (NUMA locality, Section 4.1): Is the NIC on a CPU socket that hosts GPUs? If not, exclude it.
- Step 2 — Compute vs Storage (topo matrix + link layer, Section 4.2): For NICs that pass Step 1, consult the
nvidia-smi topo -mGPU↔NIC relationship. If the topo matrix shows PCIe proximity (PIX/PXB), classify as Compute. Otherwise, use the NIC’s link layer as a tiebreaker: InfiniBand NICs are Compute fabric; Ethernet NICs are Storage. - Step 3 — Default route exclusion (Section 4.2): If the NIC carries the host’s default IP route, classify as Management regardless of topo or link layer. This catches management NICs that share a NUMA node with GPUs (e.g., on-prem L40S, GB200). The classifier reads the host’s
/proc/net/route(bind-mounted at/nvsentinel/proc/net/route) at startup to resolve the default route interface.
These steps use four complementary signals, each covering platforms where the others fail:
Removing any one signal causes at least one platform to misclassify. Together they cover x86 SXM (DGX/HGX), x86 PCIe (L40S), Grace (GB200/GH200), on-prem datacenter, and OEM/cloud platforms.
Hard dependency on metadata: The NIC Health Monitor requires the raw GPU↔NIC topology matrix (and the GPU list) published by the metadata collector in
/var/lib/nvsentinel/gpu_metadata.json. The monitor fails to start if the file is missing or unreadable, or ifnic_topologyis absent/empty. There is no silent-fallback mode. This is enforced at startup bytopology.LoadFromMetadata(), which is called before any polling begins; failure returns an error that causes the process to exit. See Section 12.1.Responsibility split: The metadata collector publishes raw facts: per-GPU NUMA nodes (from the
nvidia-smi topo -mNUMA Affinity column) and the raw per-NIC topology-level matrix (one entry per GPU ingpus[]order). The NIC Health Monitor reads these together with per-NIC NUMA nodes (from its own sysfs access — the collector does not enumerate InfiniBand devices) and performs the compute/storage/management classification locally. The monitor never invokesnvidia-smiitself; the matrix and GPU NUMA are produced once by the collector and cached in JSON.Why
nvidia-smi topo -mtext parsing: The GPU↔NIC topology relationship is not available through any structured API. NVML exposes three topology functions (DeviceGetTopologyCommonAncestor,DeviceGetTopologyNearestGpus,SystemGetTopologyGpuSet), but all operate exclusively onnvmlDevice_thandles which represent GPUs only — NVML has no concept of NIC/InfiniBand devices. DCGM’sdcgmGetDeviceTopologyhas the same GPU-only limitation. Thenvidia-smi toposubcommand does not support--format=json/xml/yaml(unlikenvidia-smi --query-gpu); the only output format is the whitespace-aligned ASCII matrix (-m). No existing open-source library parses the full GPU↔NIC matrix from this output — HAMi’sparseNvidiaNumaInfoonly extracts GPU NUMA affinity (not NIC columns) and was itself replaced with sysfs reads due to parsing fragility. The metadata collector therefore includes a purpose-built parser with handling for known format variations (ANSI escape codes,NICnlegend remapping, wrapped headers, Grace NUMA ranges).
4.1 Management NIC Exclusion (NUMA-Based)
4.1.1 The Problem
DGX systems (e.g., DGX A100) have Mellanox ConnectX management NICs that appear in /sys/class/infiniband/ alongside compute fabric NICs. If monitored, a management NIC going DOWN would trigger IsFatal=true with RecommendedAction_REPLACE_VM — an incorrect remediation for a NIC that doesn’t affect GPU workloads. The design doc’s severity model (Fatal = “workload WILL fail”) is specifically designed for compute and storage NICs, not management NICs.
4.1.2 Detection Mechanism
Management NICs on DGX systems are placed on CPU sockets that have no compute GPUs. The monitor exploits this by checking whether each NIC’s NUMA node has a compute GPU on it:
- Read
gpus[].numa_nodefrom/var/lib/nvsentinel/gpu_metadata.json(the metadata collector parses this from thenvidia-smi topo -mNUMA Affinity column and publishes it per GPU). - Build
gpu_numa_setfrom the distinctnuma_nodevalues across all GPUs (ignoring -1 / unknown). - For each
mlx5_*NIC discovered in/sys/class/infiniband/, read/sys/class/infiniband/<dev>/device/numa_node. - If
nic_numa ∉ gpu_numa_set→ exclude (management NIC on separate socket).
Edge case — GPU: If gpus[].numa_node = -1 (unknown, common in VMs or single-socket systems), that GPU is excluded from the gpu_numa_set. If all GPUs have -1, the set is empty and the NIC Health Monitor fails to start — without GPU NUMA information the NUMA gate cannot distinguish management NICs from compute NICs, and monitoring everything would risk false REPLACE_VM on management NIC failures.
Edge case — NIC: If a NIC’s numa_node = -1 (unknown), the NIC is excluded. Under-monitoring (missing a NIC failure) is preferable to over-monitoring (issuing a false REPLACE_VM on a management NIC that happens to go down).
4.1.3 Field Validation
Design Note: Storage NICs (e.g., H100 Slot1/Slot2 ConnectX-7 cards) share a NUMA node with compute GPUs. They are intentionally not excluded because storage NIC failures also impact workloads (I/O hangs, checkpoint failures). The NUMA check only excludes NICs on NUMA nodes with zero compute GPUs.
4.2 NIC Role Classification (Topo Matrix)
4.2.1 The Problem
DGX/HGX systems have both compute fabric NICs (OSFP ports on the GPU tray) and storage NICs (Slot1/Slot2 on the CPU motherboard). These are the same hardware (ConnectX-7) but serve different roles, may have different port counts, and run at different speeds. The card homogeneity check (Section 4.3) must compare NICs of the same role — compute against compute, storage against storage — to avoid false positives.
4.2.2 Detection Mechanism: nvidia-smi topo -m Matrix Lookup
The metadata collector runs nvidia-smi topo -m on the node at startup, parses the GPU↔NIC relationship matrix into a raw per-NIC array of topology levels (one entry per GPU in gpus[] order), and publishes it to /var/lib/nvsentinel/gpu_metadata.json under the nic_topology field. The NIC Health Monitor consumes this matrix and applies the classification rules below to each NIC locally — no sysfs path walking, no PCIe-depth heuristics, and no direct invocation of nvidia-smi in the monitor.
The mapping from NVIDIA topology levels (the nvmlGpuTopologyLevel_t enum, displayed as nvidia-smi topo -m abbreviations) to NIC roles is:
Classification algorithm (applied per NIC after discovery):
Precedence explained:
-
PIX/PXB → Compute: The topo matrix authoritatively identifies NICs that share a PCIe switch with a GPU. This is the primary signal on SXM systems (DGX/HGX A100, H100).
-
Default route → Management: Runs before topology classification. The classifier reads
/proc/net/routeat startup, finds the default route interface, and maps it to an IB device via/sys/class/net/<iface>/device/infiniband/. This prevents the management NIC from being monitored as Storage, avoiding false REPLACE_VM for control-plane network failures. If/proc/net/routeis unavailable or the interface has no IB backing, the check is silently skipped. -
InfiniBand → Compute: On platforms where no NIC has PIX/PXB to a GPU (PCIe-only GPUs like L40S, or Grace where GPUs aren’t on PCIe), the link layer distinguishes compute fabric NICs (InfiniBand) from storage/management NICs (Ethernet). This is the decisive signal on on-prem L40S and GB200.
-
NODE/PHB → Storage: NICs that share a NUMA node or host bridge with a GPU but don’t share a PCIe switch and aren’t InfiniBand. Typical storage NIC layout on H100 OCI (Slot1/Slot2 ConnectX-7 Ethernet cards).
-
All-SYS fallback → Storage: NICs on a GPU NUMA but with no PCIe relationship and Ethernet link layer. Safe default: monitored.
4.2.3 Three-Tier Classification
Combined with the NUMA gate from Section 4.1, the monitor assigns each NIC to one of three roles:
Key design property: On every validated platform, InfiniBand NICs and Ethernet NICs end up in separate classification groups (Compute vs Storage). This ensures the card homogeneity check (Section 4.3) never compares IB compute fabric NICs against Ethernet storage/management NICs, preventing false positives from hardware diversity (e.g., different port counts, different link speeds).
4.2.4 Field Validation
Verified against real hardware on five distinct platforms covering x86 SXM (A100, H100), x86 PCIe (L40S OCI, on-prem L40S), and Grace (GB200). The link-layer check improves classification on on-prem and GB200 compared to the previous sysfs PCIe path-walk algorithm, while producing identical results on all other platforms.
A100 OCI RoCE (4-socket AMD EPYC, 8 GPUs, 18 PF NICs):
Result: 2 Management + 16 Compute + 0 Storage. 18/18 match current algorithm.
H100 OCI (2-socket Intel Xeon Platinum 8480+, 8 GPUs, 18 PF NICs):
Result: 0 Management + 16 Compute + 2 Storage. Matches documented storage NIC layout on OCI H100.
L40S OCI (2-socket Intel, 4 PCIe GPUs, 6 PF NICs — all Ethernet/RoCE):
Every NIC shows NODE to some GPUs and SYS to others; no NIC has any PIX or PXB (L40S is PCIe-attached, not SXM — there are no shared PCIe switches). All 6 NICs are Ethernet (RoCE). The link-layer check does not promote any to Compute (no InfiniBand). NODE → Storage for all.
Result: 0 Management + 0 Compute + 6 Storage. All NICs monitored in a single Storage homogeneity group. This is correct because OCI L40S uses RoCE for all cluster networking — there is no separate compute fabric link layer.
On-prem L40S (2-socket Intel, 8 PCIe GPUs, 5 PF NICs: 1 Ethernet mgmt + 4 IB compute):
On-prem datacenter nodes with PCIe GPUs and native InfiniBand for the compute fabric typically have a separate Ethernet NIC for pod networking. The topo matrix shows NODE to local GPUs for all 5 NICs (PCIe-only system, no shared switches). Without the link-layer check, all 5 would be classified as Storage (same group), corrupting the homogeneity check if port counts differ.
With the link-layer check:
Result: 0 Management + 4 Compute + 1 Storage. The 4 IB NICs are in the Compute homogeneity group; the Ethernet management NIC is in a separate Storage group. No cross-comparison between IB and Ethernet, preventing false positives from hardware diversity.
With the default-route check: mlx5_0 (carries default route) → Management (excluded). Result: 1 Management + 4 Compute + 0 Storage.
GB200 NVL4 (2-socket Grace Neoverse-V2, 4 GPUs, 6 PF NICs: 4 ConnectX-7 IB + 2 BlueField-3 DPU):
Every NIC↔GPU cell is SYS (GPUs are on NVLink-C2C, not PCIe — no shared PCIe ancestor exists). All NIC NUMAs are in the GPU NUMA set. No PIX/PXB or NODE/PHB relationships exist. The link-layer check and HCA-based DPU detection are the only signals that can distinguish roles:
Result: 2 Management (BlueField DPUs) + 4 Compute (IB ConnectX-7) + 0 Storage. The 4 IB NICs are monitored in the Compute homogeneity group; the 2 DPUs are excluded.
Known BlueField HCA types excluded: MT41682 (BlueField-2), MT41686 (BlueField-2 variant), MT41692 (BlueField-3). Unrecognised HCA types fall through to Storage (monitored) — the safe direction for future hardware.
With the default-route check: roceP6p3s0 (carries default route) would be excluded by Step 3 before the HCA check even runs. Same result, different detection path.
4.3 Uncabled Port Detection (Role-Based Card Homogeneity)
4.3.1 The Problem
Some NIC cards have multiple ports, but not all ports are cabled. For example, dual-port ConnectX cards may have only port 1 cabled and port 2 unused. The monitor must distinguish between a genuinely failed port and an intentionally uncabled one — without requiring static configuration like gpu_port_config.
Additionally, compute and storage NICs may have different port counts (e.g., dual-port compute cards vs single-port storage cards). The homogeneity check must compare NICs within the same role group to avoid false positives.
4.3.2 Detection Mechanism
NICs are grouped by role (Compute or Storage, from Section 4.2), then within each role group:
- Group NICs by physical card (PCI
bus:deviceaddress — e.g.,0000:47:00groups0000:47:00.0and0000:47:00.1) - Count active (
ACTIVE+LinkUp) ports per card - Calculate the mode (most common active-port-count) within the role group
- Any card with fewer active ports than its role’s mode → FATAL event
4.3.3 Algorithm
4.3.4 Field Validation
H100 OCI (compute dual-port + storage single-port):
L40 (dual-port compute NICs, 1 port cabled per card):
Probability analysis: For the mode to be incorrect (masking real failures), more than half of the cards in a role group would need to be independently failed at startup. With a ~1% per-NIC failure rate, the probability of 4+ out of 8 NICs failing simultaneously is ~0.00003% — effectively impossible.
4.4 Design Decision: Why Speed Degradation Detection Was Removed
The previous design included a speed degradation check that compared the sysfs rate against an expected rate from gpu_port_config. This was removed for the following reasons:
- Required per-GPU-type static configuration (
gpu_port_config) that doesn’t exist for non-DGX systems (L40, T4, cloud VMs, OEM servers) - Cannot distinguish compute from storage NICs: On H100 DGX, compute NICs run at 400 Gb/s (InfiniBand) while storage NICs may run at different speeds (Ethernet). Applying the same rate threshold to both causes false positives
- Counter checks already detect the underlying degradation: When a cable degrades enough to cause speed fallback, the physical layer generates errors. The
symbol_errorandlink_error_recoverycounters (see Link Counter Detection) detect this degradation before or during the retrain event - Sysfs does not expose the expected/supported speed: The
ratefile only shows the current negotiated speed, not the maximum supported speed of the NIC or cable
Note: Speed degradation remains a real failure mode in GPU clusters. A 400G link dropping to 200G halves collective operation throughput. However, this is better addressed by counter-based degradation monitoring (Layer 2) which detects the physical signal degradation that causes the speed fallback, rather than by comparing the negotiated speed against a static configuration value.
5. Device Discovery and Parsing
5.1 Discovery Logic
The NIC Health Monitor discovers and parses InfiniBand/RoCE devices by iterating over sysfs:
- Iterating over
/sys/class/infiniband - Parsing
hca_type,fw_ver, andboard_id - Enumerating ports and reading
link_layer,state, andphys_state - Identifying device type (PF vs VF) for proper alerting
5.2 Device Discovery Diagram
5.3 Vendor Detection
The monitor detects Mellanox devices using the following logic:
- Check if device name matches
mlx5_\d+(Mellanox). - Fallback: Check driver symlink in
/sys/class/infiniband/<dev>/device/driverformlx5_core.
6. State Change and Flap Detection
The NIC Health Monitor reports health boundary events — one event per port when the port transitions between healthy and unhealthy states. Intermediate transitions (e.g., DOWN/Disabled → DOWN/Polling) are suppressed. The Health Events Analyzer performs pattern detection to distinguish between persistent drops and transient flapping.
6.1 Architecture
- NIC Health Monitor reports health boundary crossings (healthy→fatal, fatal→healthy)
- Events flow to MongoDB via Platform Connector
- Health Events Analyzer applies correlation rules to detect patterns
6.2 Port Drop Detection (Analyzer Rule: NICPortDrop)
An InfiniBand port is marked as “Dropped” when the Analyzer detects:
- The port has been reporting
state=DOWNfor at least 4 minutes - No
link_downeddelta events during this period (indicating no recovery attempts)
6.3 Port Flap Detection (Analyzer Rule: RepeatedNICLinkFlap)
An InfiniBand port is marked as “Flapping” (Severity: FATAL) when the Analyzer detects:
- 3+
link_downedevents within 10 minutes on the same NICPort entity - This indicates repeated DOWN→ACTIVE transitions (unstable hardware)
6.4 Link Flap Detection Diagram
Effect: The Analyzer emits a new fatal event with
RecommendedAction_REPLACE_VM. The stabilization window logic (similar to sticky XID) can be implemented as an Analyzer rule to prevent rapid re-alerting.
7. Device Disappearance Handling
7.1 Purpose
When the State Monitor detects a device has disappeared from /sys/class/infiniband/, this is treated as a FATAL condition requiring VM replacement.
7.2 Detection
Device disappearance is detected through three complementary mechanisms:
Case 1: Runtime disappearance (monitor has in-memory or persisted state, same boot)
The monitor tracks devices across polling cycles via an in-memory device set and a persisted KnownDevices list (see Link Counter Detection, Section 6.6). If a previously-seen device is no longer present in /sys/class/infiniband/, a FATAL event is generated immediately with the exact device name.
This works both during normal operation (in-memory state from prior poll) and after pod restart on the same boot (persisted KnownDevices loaded from the state file). Without persistence, a device that disappeared while the pod was restarting would go undetected — the new pod would have no knowledge the device ever existed.
- Example:
mlx5_3was in the persistedKnownDevices, but is absent from sysfs on startup →EntityType: "NIC", EntityValue: "mlx5_3"
Case 2: Device missing after host reboot (boot ID changed — state cleared)
On host reboot, all persisted state (including KnownDevices) is cleared because the node may have had NICs replaced (see Link Counter Detection, Section 6.5). The monitor cannot compare against prior devices because they may be entirely different hardware. Device disappearance detection after reboot falls through to Case 3 (homogeneity check).
Case 3: Device missing on startup (no persisted state — fresh node, post-reboot, or corrupt state file)
On the first poll cycle after startup with no persisted state, the monitor uses the card homogeneity check (see Section 4.2) to detect anomalies without requiring prior state or static configuration. This covers fresh nodes, post-reboot startups (where state was cleared), and corrupt state files. After the first poll, all runtime state changes (cable pulls, link failures, recoveries) are handled by the per-port boundary-crossing transition detection, making repeated homogeneity checks unnecessary:
- Group all monitored PF NICs by physical card (PCI
bus:device) - Count active (
ACTIVE/LinkUp) ports per card - Calculate the mode (most common active-port-count) across all cards
- Any card with fewer active ports than the mode → FATAL event
This startup homogeneity check requires no persisted state and works immediately as a fallback. It detects missing ports by comparing against peer NICs on the same node rather than against a static expected count.
- Example: 8 single-port NIC cards, 7 are ACTIVE, 1 is DOWN → mode is 1, the DOWN card has 0 active → FATAL
- Message: “Card 0000:XX:00 has 0 active ports, expected 1 (peer mode)”
Why the homogeneity assumption is safe: Compute fabric NICs are all the same model on GPU cluster nodes (DGX, HGX, or OEM). This approach works for both InfiniBand and Ethernet (RoCE) NICs. Management NICs on separate NUMA nodes are excluded before this check runs (see Section 4.1). For the mode to be incorrect, more than half of the NICs would need to be independently failed at startup — a probability of ~0.00003% for an 8-NIC system.
7.3 Event Classification
Design Note: All device disappearances are treated as FATAL because in production environments, unexpected device loss indicates a hardware issue requiring investigation and VM replacement. The monitor does not differentiate between “clean” removals (driver unload) and “dirty” removals (hardware crash).
8. SR-IOV Virtual Function Handling
8.1 Background: Why VFs Being DOWN is Expected
SR-IOV (Single Root I/O Virtualization) is a technology that allows a single physical NIC to appear as multiple virtual NICs. Understanding this is critical for correct alerting behavior.
Note: Clusters with the NVIDIA Network Operator installed will have SR-IOV enabled by default. This applies to both VM-based and baremetal container environments. In baremetal Kubernetes with SR-IOV, unassigned VFs will still appear as DOWN — the filtering logic applies equally to both deployment types.
The Problem Without Understanding SR-IOV:
Why VFs are DOWN by default: When SR-IOV is enabled, Virtual Functions are pre-created by the driver but remain in DOWN state until they are:
- Assigned to a VM or container via passthrough/device allocation
- Administratively enabled (for InfiniBand, also requires Subnet Manager configuration)
Unassigned VFs are essentially “empty slots” waiting for workloads. A DOWN VF is not a hardware failure—it’s normal SR-IOV behavior.
8.2 Key Terminology
8.3 VF Lifecycle
8.4 Alerting Decision Matrix
8.5 Auto-Detection: PF vs VF
The Linux kernel provides clear indicators in sysfs:
8.6 Real Example from Field Validation (34-NIC System)
8.7 Implementation
To determine if a DOWN state is expected, the monitor detects if a device is an SR-IOV Virtual Function (VF) or Physical Function (PF).
- Method 1 (Primary): Check for
physfnsymlink in the device directory. If present, it’s a VF. - Method 2 (Secondary): Check for
sriov_totalvfsfile. If present, it’s a PF.
VFs are expected to be DOWN when unassigned. PFs are expected to be ACTIVE.
9. RoCE State Monitoring
RoCE (RDMA over Converged Ethernet) devices appear in both /sys/class/net and /sys/class/infiniband. The monitor accesses RoCE devices via the InfiniBand interface (/sys/class/infiniband/). The following monitoring applies to RoCE:
- State monitoring:
state,phys_state(via InfiniBand sysfs interface) - Device identification: Check
link_layerfile for “Ethernet”
9.1 GID Table Information (RoCE Routing Diagnostics)
The GID (Global Identifier) table is critical for RoCE routing. Each device exposes GIDs at:
/sys/class/infiniband/<dev>/ports/<port>/gids/<index>/sys/class/infiniband/<dev>/ports/<port>/gid_attrs/types/<index>
GID Types (Linux Kernel sysfs ABI):
IB/RoCE v1= InfiniBand and RoCE v1 (GRH-based, layer 2)RoCE v2= RoCE v2 (UDP-encapsulated, layer 3, firewall-friendly)
At the API level (ibv_gid_type), there are three distinct types:
IBV_GID_TYPE_IB(InfiniBand)IBV_GID_TYPE_ROCE_V1(RoCE v1)IBV_GID_TYPE_ROCE_V2(RoCE v2)
Example GID table from 34-NIC system:
Diagnostic value:
- Empty GID table →
Error 61 (ENODATA)during QP setup - Missing IPv4 GIDs → routing issues for RoCE v2
- GID type mismatch between peers → connection failures
Helper Functions:
getGIDCount: Enumerates/sys/class/infiniband/<dev>/ports/<port>/gids/to count valid GIDs.getNetDevForIBDevice: Discovers the network interface (e.g.,eth0,rdma4) associated with an IB device by reading/sys/class/infiniband/<dev>/device/net/. This is critical for reading Ethernet statistics on RoCE devices.
10. Supported Hardware
Current Scope: This initial implementation focuses on Mellanox/NVIDIA InfiniBand and RoCE devices only. The architecture is designed to be extensible for future support of additional NIC vendors.
10.1 Future Work
- AWS EFA Support: Device names matching
rdmap\d+s\d+ - Plain Ethernet:
operstate = downdetection via/sys/class/net/<interface>/operstate - TCPXO Support: TCP Express Offload support
11. Data Structures
11.1 State Monitoring Structures
11.2 Entity Model
NICs and Ports are modeled as separate entity types to enable precise fault localization:
Rationale: A single NIC (e.g., mlx5_0) can have multiple ports. Port-level events include both the NIC and Port entities in EntitiesImpacted, enabling:
- Precise fault localization (NIC + Port together identify the exact failing component)
- Precise cable replacement (which port’s cable is faulty)
- Targeted firmware diagnostics
- Accurate capacity planning (one failed port vs entire NIC)
12. Configuration
12.1 State Monitoring Configuration
Configuration is split between a ConfigMap mounted at
/etc/nic-health-monitor/config.toml (rendered TOML, sourced from the
Helm values.yaml shown below) and command-line flags that govern
runtime paths and polling cadence. Both surfaces are documented below.
Helm values (YAML) — covers sysfs mount points and device filtering:
Command-line flags — cover runtime wiring that changes per deployment:
GPU metadata is a hard startup dependency — see Section 4 for the fail-fast conditions and Section 12.2 for the required fields.
SR-IOV Virtual Function handling
VFs are detected via the device/physfn sysfs symlink and skipped
unconditionally. There is no configuration knob — unassigned VFs are
expected to stay DOWN by design and monitoring them would produce false
positives.
Note: The previous
gpu_port_configandMonitorNetworkTypeconfiguration options have been removed. Management NIC exclusion is automatic via NUMA detection (Section 4.1). NIC role classification uses the topo matrix published by the metadata collector (Section 4.2). Uncabled port detection uses the card homogeneity check (Section 4.3). Both InfiniBand and Ethernet (RoCE) NICs are monitored equally — no link layer filtering is required.
12.2 Metadata Collector Requirements
The NIC Health Monitor is a hard consumer of topology data produced by the NVSentinel metadata collector. The collector must run on every node before (or alongside) the NIC Health Monitor DaemonSet and must populate the following fields in /var/lib/nvsentinel/gpu_metadata.json:
nic_topology format: Keys are InfiniBand device names (e.g., mlx5_0, ibp3s0). Values are a slice of topology-level strings — one entry per GPU listed in gpus[], in the same order. Each entry is one of "X", "PIX", "PXB", "PHB", "NODE", "SYS", or "NV<n>" (an NVLink bond count). The collector publishes this matrix verbatim; interpretation is the NIC Health Monitor’s responsibility.
Example gpu_metadata.json excerpt:
Ordering guarantee: The NIC Health Monitor DaemonSet pod manifest must declare a dependency (via init container, readiness gate, or pod startup ordering) such that the metadata collector completes its write before the NIC monitor starts. If this ordering is violated, the NIC monitor will fail at startup with a clear error pointing at the missing file.
13. Event Management
13.1 State Event Construction
Events are emitted only on health boundary crossings — one consolidated event per port per transition. Logical state and physical state are combined into a single message.
Example Event Fields (Fatal - IB Port DOWN):
Example Event Fields (Fatal - RoCE Port DOWN):
Example Event Fields (Healthy - Recovery):
Example Event Fields (Fatal - Device Disappeared):
Appendix A: Quick Reference - Fatal Condition Classification
The key question: “Will the workload fail because of this?”
Fatal State Conditions (IsFatal = true)
Non-Fatal State Conditions (IsFatal = false)
Fatal Counters (IsFatal = true)
Driver/Firmware Logs
For kernel log pattern details (fatal and non-fatal classifications, regex patterns, log line examples, and kernel source references), see Syslog Detection & Correlation. This document focuses on link state detection; syslog monitoring is covered in its own dedicated document to keep each document focused on a single problem.
State Detection Paths
References
- Linux Kernel sysfs-class-infiniband documentation
- DGX A100 User Guide
- DGX H100 User Guide
- DGX B200 User Guide
- GB200 NVL2