NIC Health Monitor: Syslog Detection
NIC Health Monitor: Syslog Detection
NIC Health Monitor: Syslog Detection
Related Documents:
NIC hardware polling monitors (state checks, counter reads) can miss critical failures that occur at the driver/firmware interface level. A NIC can appear healthy (link UP, counters normal) while the driver is completely unable to communicate with the firmware, leading to silent workload failures.
This document covers the Syslog Health Monitor component for NIC driver error monitoring, which detects:
timeout. Will cause a leak of a command resource), firmware hangshealth poll failed, unrecoverable errorsmlx5_core (e.g., insufficient power); fatal PCIe link loss / device disappearance is covered by link-state-detectionmlx5e_tx_timeout)Following gpud’s design and forensic analysis of mlx5_core telemetry, kernel log events are classified based on their determinism of failure:
Key Design Principle: Only deterministically fatal events in the logs are raised as Fatal (
IsFatal=true). All other events are raised as Non-Fatal (IsFatal=false) to provide diagnostic context without triggering immediate remediation.
The syslog-health-monitor’s NIC check follows NVSentinel’s established architectural pattern where:
The combined monitoring approach ensures no blind spots by covering both layers:
Why this matters: Simple monitors that only check link state miss Driver/Firmware Interface failures, leading to silent failures where the hardware appears active but the data path is broken.
The NVIDIA Mellanox ConnectX series (ConnectX-5 through ConnectX-7) function as sophisticated, autonomous offload engines managing RDMA and complex flow steering. The host-to-NIC interaction is governed by a split-driver model where mlx5_core handles device initialization, health monitoring, and the command interface, while mlx5_ib or mlx5_en operate as protocol clients.
The primary control pathway is the Command Interface. The driver writes command blocks (e.g., CREATE_MKEY, MODIFY_QP) to PCI BAR-mapped memory and notifies the firmware via a “doorbell” register.
timeout. Will cause a leak of a command resource.cmd_exec timeouts irreversibly fatal to the driver’s device management capability.The NIC implements a dedicated ring buffer (EQ) for control plane events (e.g., thermal excursions, port changes). Additionally, a background Health Poller thread periodically (1s) reads a “Health Syndrome” register. This dual mechanism ensures that even if the interrupt system fails, the driver detects unrecoverable hardware syndromes.
mlx5_core Command Timeout Resource LeakThis is a severe driver-level error indicating Driver/Firmware Interface failure. (Reference: RHEL mlx5_core issues)
Mechanism:
mlx5_core driver sends command to NIC firmware via mailboxtimeout. Will cause a leak of a command resource to dmesgConsequence: Usually requires driver reload (systemctl restart openibd) or full node reboot. This is always Fatal—workload cannot proceed if it cannot issue commands to the NIC. The driver intentionally “leaks” command resources because it cannot safely reclaim memory without risk of corruption if the firmware eventually responds.
NIC driver and firmware errors are monitored by adding a new check to the existing syslog-health-monitor DaemonSet. This follows the established NVSentinel pattern where all kernel log monitoring is centralized in the syslog-health-monitor.
The syslog-health-monitor already has a modular handler architecture for different check types:
The NICDriverErrorHandler follows the same pattern as the existing XID handler.
Log Line Processing Steps:
mlx5_core.*timeout\. Will cause a leak of a command resource)IsFatal accordingly0000:3b:00.0)/sys/bus/pci/devices/<BDF>/driver symlink
mlx5_core → attach NIC entity (e.g., mlx5_0) to the eventmlx5_core or mlx5_cmd_out_err in the message), so non-mlx5 matches are not expected.Agent = "syslog-health-monitor"CheckName = "SysLogsNICDriverError"EntityType = "NIC", EntityValue = <device_name>Why live sysfs lookup instead of metadata collector: The metadata collector publishes relatively static GPU/NVSwitch inventory. The BDF itself is usually stable during a boot, but the kernel state used for NIC enrichment is not:
/sys/bus/pci/devices/<BDF>/drivercan disappear or reappear whenmlx5_coreis unloaded, reloaded, rebound, or recovered, and the exposedinfiniband/*ornet/*names can be recreated. The syslog handler therefore reads the current sysfs state at event-processing time. This lookup is best-effort enrichment only; if it fails, the raw syslog event is still emitted without aNICentity.
The NIC driver error handler monitors for the following patterns:
Following gpud’s design, kernel log events are classified as Non-Fatal (IsFatal=false) by default. Only deterministic hardware state changes (port drops/flaps) that will cause workload failure are escalated to affect component health.
Full Log Line Examples:
Design Note: While gpud internally treats many of these as non-fatal, NVSentinel escalates deterministically fatal signals to Fatal (
IsFatal=true) to trigger proactive remediation (REPLACE_VM) before workload failure cascades. Non-fatal signals remain asIsFatal=falsefor diagnostic correlation.
Patterns are classified according to their operational impact:
Key Principle: Kernel logs provide diagnostic context, not remediation triggers. The decision to drain/replace a node is based on actual port state (via link state detection), not log messages alone.
Fatal syslog events are emitted immediately with their configured remediation action. The Health Events Analyzer can still correlate those raw events with port state changes to provide incident context. Non-Fatal diagnostic syslog events are emitted as IsFatal=false and can also be correlated with state changes for operator investigation.
The Health Events Analyzer can correlate kernel log events with port state events:
Some kernel messages can be filtered to reduce noise:
ACCESS_REG failed: Common on systems with restricted PFs (DGX, Umbriel). Use --infiniband-exclude-devices to exclude problematic devices from monitoring.insufficient power: Often transient during BIOS/BMC power negotiation. Can be filtered by uptime context if needed.Unlike a local correlation engine, repeat/cross-signal pattern detection is handled by the Health Events Analyzer:
mlx5_core events to Platform Connector → MongoDBport state=DOWN at 10:00:05 can be correlated later to provide diagnostic contextKernel log events are emitted with their configured severity. The Health Events Analyzer can correlate them with port state events for diagnostic context:
Key Design Principle: Kernel log events for diagnostic patterns remain Non-Fatal (
IsFatal=false). Component health changes are triggered by port state (link-state-detection), not by kernel logs.
NIC driver error pattern definitions are hardcoded in the syslog-health-monitor. Configuration selects supported pattern names, enables/disables them, and can optionally override the processing strategy. Regexes, severity, recommended action, and descriptions are intentionally not user-configurable.
TOML Shape:
Fields:
Operational Note: Operators can disable noisy supported patterns or use
STORE_ONLYfor observability-only collection. Severity and recommended action changes should be handled through platform connector / analyzer overrides rather than by changing monitor regex definitions.
Events are emitted with severity based on their determinism of failure:
Example Event Fields (Fatal - command timeout resource leak):
Example Event Fields (Non-Fatal - module unplugged):
Kernel log events serve as both direct failure signals and diagnostic context:
Key Design Principle: Deterministically fatal events in logs (command timeout resource leaks, unrecoverable hardware error) trigger
REPLACE_VMdirectly. Non-fatal events (insufficient power, module unplugged) remain asIsFatal=falsefor diagnostic correlation.
The syslog monitoring capability operates at the driver and kernel level only:
Application-level logs and remote failures are out of scope:
The following table shows which hardware failures this monitor detects and how they may impact applications:
Kernel log events are classified by their determinism of failure:
Key Principle: Deterministically fatal events in logs trigger
REPLACE_VM. Diagnostic logs remain as Non-Fatal (IsFatal=false) for diagnostic context.
The following patterns are monitored in the kernel ring buffer (dmesg/kmsg):
Key Insight: Kernel logs for deterministic failures (command timeout resource leaks, etc.) are Fatal (
IsFatal=true) withRecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module unplugged) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) withRecommendedAction_REPLACE_VM.
The Health Events Analyzer should only add value where the raw monitors do not already emit a deterministic fatal event. Fatal NIC state/counter events and fatal NIC driver syslog patterns already trigger remediation directly, so repeated rules are limited to non-fatal recurrence signals and use CONTACT_SUPPORT.
Design Note: The count threshold of 3 is externally grounded by the Linux mlx5 health-poll miss threshold (
MAX_MISSES = 3) and gpud’s InfiniBand flap threshold of 3 reverts-to-active. The 1-hour window is an NVSentinel operational choice to catch clustered recurrence while avoiding stale daily maintenance or boot noise.
RepeatedNICDriverError:
syslog-health-monitor, SysLogsNICDriverError, IsFatal=false.netdev_watchdog, port_module_high_temp, pci_power_insufficient, module_unplugged.errorcode[0] pattern name.CONTACT_SUPPORT.RepeatedNICDegradation:
nic-health-monitor, InfiniBandDegradationCheck or EthernetDegradationCheck, IsFatal=false, IsHealthy=false.NIC + same NICPort.CONTACT_SUPPORT.access_reg_failed is intentionally excluded from RepeatedNICDriverError. Public gpud context identifies repeated mlx5_cmd_out_err.*ACCESS_REG.*failed messages as restricted-PF/query noise on some systems, not a standalone hardware recurrence signal.
Fatal syslog patterns are also intentionally excluded:
cmd_exec_timeouthealth_poll_failedunrecoverable_errThose are already emitted as fatal by the syslog health monitor and do not need analyzer-side repeat detection.
MAX_MISSES = 3 and logs device's health compromised - reached miss count: Linux mlx5 health.c.timeout. Will cause a leak of a command resource: Linux mlx5 cmd.c.High Temperature, Cable unplugged, and insufficient PCIe slot power: Linux mlx5 events.c.reporter_tx.c.mlx5_cmd_out_err.*ACCESS_REG.*failed as restricted-device/query noise: leptonai/gpud#1164.These rules are configured in the health-events-analyzer Helm chart values:
ACCESS_REG failedunrecoverableinsufficient power, High Temperature, module unplugged