NIC Health Monitor: Syslog Detection
NIC Health Monitor: Syslog Detection
Table of Contents
- Overview
- Architecture
- Driver/Firmware Interface Foundations
- Integration with Syslog Health Monitor
- Monitored Kernel Patterns
- Advanced Heuristics
- Repeat Failure Detection
- Configuration
- Event Management
- Monitoring Scope and Limitations
- Appendix A: Quick Reference - Kernel Log Patterns
- Appendix B: Health Events Analyzer Rules for NIC Monitoring
Related Documents:
- Link State Detection - UP/DOWN state monitoring
- Link Counter Detection - Counter-based degradation monitoring
1. Overview
1.1 Problem Statement
NIC hardware polling monitors (state checks, counter reads) can miss critical failures that occur at the driver/firmware interface level. A NIC can appear healthy (link UP, counters normal) while the driver is completely unable to communicate with the firmware, leading to silent workload failures.
1.2 Scope of Syslog Detection
This document covers the Syslog Health Monitor component for NIC driver error monitoring, which detects:
- Driver/Firmware communication failures - command timeouts (
timeout. Will cause a leak of a command resource), firmware hangs - Hardware health check failures -
health poll failed, unrecoverable errors - Driver/firmware-level PCIe events - surfaced via
mlx5_core(e.g., insufficient power); fatal PCIe link loss / device disappearance is covered by link-state-detection - Thermal and power issues - High temperature warnings, insufficient power
- Network watchdog timeouts - TX queue stalls (Non-Fatal diagnostic context; auto-recovery via
mlx5e_tx_timeout)
1.3 Why Syslog Monitoring is Essential
1.4 Severity Model for Kernel Log Events
Following gpud’s design and forensic analysis of mlx5_core telemetry, kernel log events are classified based on their determinism of failure:
Key Design Principle: Only deterministically fatal events in the logs are raised as Fatal (
IsFatal=true). All other events are raised as Non-Fatal (IsFatal=false) to provide diagnostic context without triggering immediate remediation.
1.5 Syslog Detection Overview Diagram
2. Architecture
2.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern
The syslog-health-monitor’s NIC check follows NVSentinel’s established architectural pattern where:
- Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
- Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
- MongoDB serves as the source of truth for event history and correlation queries
2.2 Component Responsibilities
2.3 NIC Driver Error Check Data Flow
2.4 System Context
2.5 Packet Path vs Driver/Firmware Interface Coverage
The combined monitoring approach ensures no blind spots by covering both layers:
Why this matters: Simple monitors that only check link state miss Driver/Firmware Interface failures, leading to silent failures where the hardware appears active but the data path is broken.
3. Driver/Firmware Interface Foundations
The NVIDIA Mellanox ConnectX series (ConnectX-5 through ConnectX-7) function as sophisticated, autonomous offload engines managing RDMA and complex flow steering. The host-to-NIC interaction is governed by a split-driver model where mlx5_core handles device initialization, health monitoring, and the command interface, while mlx5_ib or mlx5_en operate as protocol clients.
3.1 The Command Interface (CMD IF)
The primary control pathway is the Command Interface. The driver writes command blocks (e.g., CREATE_MKEY, MODIFY_QP) to PCI BAR-mapped memory and notifies the firmware via a “doorbell” register.
- Fatality Mechanism: If the firmware hangs, the driver’s watchdog expires (typically 60s), logging
timeout. Will cause a leak of a command resource. - Resource Leaks: Upon timeout, the driver intentionally “leaks” the command’s DMA-mapped memory. Freeing it could lead to silent memory corruption if the firmware later writes to that physical address. This makes
cmd_exectimeouts irreversibly fatal to the driver’s device management capability.
3.2 Driver/Firmware Communication Diagram
3.3 Asynchronous Event Queue (EQ) and Health Poller
The NIC implements a dedicated ring buffer (EQ) for control plane events (e.g., thermal excursions, port changes). Additionally, a background Health Poller thread periodically (1s) reads a “Health Syndrome” register. This dual mechanism ensures that even if the interrupt system fails, the driver detects unrecoverable hardware syndromes.
3.4 The mlx5_core Command Timeout Resource Leak
This is a severe driver-level error indicating Driver/Firmware Interface failure. (Reference: RHEL mlx5_core issues)
Mechanism:
mlx5_coredriver sends command to NIC firmware via mailbox- Driver waits for firmware to toggle “Ownership bit” indicating completion
- If firmware has crashed/hung, bit never toggles
- Driver’s watchdog expires → logs
timeout. Will cause a leak of a command resourcetodmesg
Consequence: Usually requires driver reload (systemctl restart openibd) or full node reboot. This is always Fatal—workload cannot proceed if it cannot issue commands to the NIC. The driver intentionally “leaks” command resources because it cannot safely reclaim memory without risk of corruption if the firmware eventually responds.
4. Integration with Syslog Health Monitor
NIC driver and firmware errors are monitored by adding a new check to the existing syslog-health-monitor DaemonSet. This follows the established NVSentinel pattern where all kernel log monitoring is centralized in the syslog-health-monitor.
4.1 Handler Architecture
The syslog-health-monitor already has a modular handler architecture for different check types:
4.2 Configuration (values.yaml)
4.3 Verification Command (Synthetic Fault Injection)
4.4 NIC Driver Error Handler Algorithm
The NICDriverErrorHandler follows the same pattern as the existing XID handler.
Log Line Processing Steps:
- Receive kernel log line from journald stream
- Match against NIC error patterns (see Section 5 for pattern list):
- Patterns are loaded from a configurable TOML file (see Section 8.3), not hardcoded in source
- Check if line matches any regex pattern (e.g.,
mlx5_core.*timeout\. Will cause a leak of a command resource) - If no match → skip line, return nil
- Determine severity:
- Look up pattern in severity table (Section 5.2)
- Set
IsFatalaccordingly
- Extract PCI address from message using regex (e.g.,
0000:3b:00.0) - Best-effort NIC entity enrichment by resolving
/sys/bus/pci/devices/<BDF>/driversymlink- If the driver is
mlx5_core→ attach NIC entity (e.g.,mlx5_0) to the event - Otherwise → emit the event with no NIC entity attached
- The handler does not drop events based on driver type. All shipped patterns are mlx5-specific by regex (e.g., they require
mlx5_coreormlx5_cmd_out_errin the message), so non-mlx5 matches are not expected for built-in patterns. Custom patterns must follow the same convention; see Section 8.3.
- If the driver is
- Generate HealthEvent with:
Agent = "syslog-health-monitor"CheckName = "SysLogsNICDriverError"EntityType = "NIC",EntityValue = <device_name>
- Send event to Platform Connector (no local aggregation)
Why live sysfs lookup instead of metadata collector: The metadata collector publishes relatively static GPU/NVSwitch inventory. The BDF itself is usually stable during a boot, but the kernel state used for NIC enrichment is not:
/sys/bus/pci/devices/<BDF>/drivercan disappear or reappear whenmlx5_coreis unloaded, reloaded, rebound, or recovered, and the exposedinfiniband/*ornet/*names can be recreated. The syslog handler therefore reads the current sysfs state at event-processing time. This lookup is best-effort enrichment only; if it fails, the raw syslog event is still emitted without aNICentity.
5. Monitored Kernel Patterns
The NIC driver error handler monitors for the following patterns:
5.1 Pattern Table
Following gpud’s design, kernel log events are classified as Non-Fatal (IsFatal=false) by default. Only deterministic hardware state changes (port drops/flaps) that will cause workload failure are escalated to affect component health.
Full Log Line Examples:
Design Note: While gpud internally treats many of these as non-fatal, NVSentinel escalates deterministically fatal signals to Fatal (
IsFatal=true) to trigger proactive remediation (REPLACE_VM) before workload failure cascades. Non-fatal signals remain asIsFatal=falsefor diagnostic correlation.
5.2 Pattern Classification
Patterns are classified according to their operational impact:
Key Principle: Kernel logs provide diagnostic context, not remediation triggers. The decision to drain/replace a node is based on actual port state (via link state detection), not log messages alone.
5.3 Diagnostic Commands
6. Event Correlation (via Health Events Analyzer)
Fatal syslog events are emitted immediately with their configured remediation action. The Health Events Analyzer can still correlate those raw events with port state changes to provide incident context. Non-Fatal diagnostic syslog events are emitted as IsFatal=false and can also be correlated with state changes for operator investigation.
6.1 Correlation Examples
The Health Events Analyzer can correlate kernel log events with port state events:
6.2 Noise Filtering
Some kernel messages can be filtered to reduce noise:
ACCESS_REG failed: Common on systems with restricted PFs (DGX, Umbriel). Use--infiniband-exclude-devicesto exclude problematic devices from monitoring.insufficient power: Often transient during BIOS/BMC power negotiation. Can be filtered by uptime context if needed.
7. Repeat Failure Detection
Unlike a local correlation engine, repeat/cross-signal pattern detection is handled by the Health Events Analyzer:
- Raw Event Flow: Syslog-health-monitor sends raw
mlx5_coreevents to Platform Connector → MongoDB - Correlation Rules: Health Events Analyzer queries MongoDB and correlates syslog events with port state events
- Example: command timeout with resource leak at 10:00:01 is emitted immediately as Fatal;
port state=DOWNat 10:00:05 can be correlated later to provide diagnostic context
7.1 Correlation Purpose
Kernel log events are emitted with their configured severity. The Health Events Analyzer can correlate them with port state events for diagnostic context:
Key Design Principle: Kernel log events for diagnostic patterns remain Non-Fatal (
IsFatal=false). Component health changes are triggered by port state (link-state-detection), not by kernel logs.
7.2 Repeat Failure Detection Diagram
8. Configuration
8.1 Syslog Health Monitor Configuration
8.2 Health Events Analyzer Configuration (for NIC rules)
8.3 NIC Driver Pattern Configuration
NIC driver error patterns are defined in a TOML configuration file, allowing operators to disable, add, or reclassify patterns without code changes.
TOML Shape:
Fields:
Custom patterns should be NIC/mlx5-specific. Generic PCIe/AER patterns such as
PCIe Bus Error.*Fatalare intentionally not shipped because they can match GPUs, NVMe, or root ports. The handler does not BDF-gate generic patterns; if you add one, ensure its regex includes an mlx5-specific prefix (mlx5_core,mlx5_cmd_out_err, etc.) so it cannot match other devices. This follows gpud’s same-decision approach inkmsg_matcher.go.Operational Note: Operators can disable noisy patterns, add site-specific patterns, or reclassify severity (e.g., promote a Non-Fatal pattern to Fatal) by editing the TOML file and restarting the syslog-health-monitor pod. No code changes required.
9. Event Management
9.1 Event Construction
Events are emitted with severity based on their determinism of failure:
Example Event Fields (Fatal - command timeout resource leak):
Example Event Fields (Non-Fatal - module unplugged):
9.2 Event Purpose
Kernel log events serve as both direct failure signals and diagnostic context:
Key Design Principle: Deterministically fatal events in logs (command timeout resource leaks, unrecoverable hardware error) trigger
REPLACE_VMdirectly. Non-fatal events (insufficient power, module unplugged) remain asIsFatal=falsefor diagnostic correlation.
10. Monitoring Scope and Limitations
10.1 What This Monitor CAN Detect
The syslog monitoring capability operates at the driver and kernel level only:
10.2 What This Monitor CANNOT Detect
Application-level logs and remote failures are out of scope:
10.3 Hardware Failures and Application Impact
The following table shows which hardware failures this monitor detects and how they may impact applications:
10.4 Event Severity Classification
Kernel log events are classified by their determinism of failure:
Key Principle: Deterministically fatal events in logs trigger
REPLACE_VM. Diagnostic logs remain as Non-Fatal (IsFatal=false) for diagnostic context.
Appendix A: Quick Reference - Kernel Log Patterns
The following patterns are monitored in the kernel ring buffer (dmesg/kmsg):
Kernel Log Pattern Summary
Design Principle
Key Insight: Kernel logs for deterministic failures (command timeout resource leaks, etc.) are Fatal (
IsFatal=true) withRecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module unplugged) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) withRecommendedAction_REPLACE_VM.
Appendix B: Health Events Analyzer Rules for NIC Monitoring (Optional)
The following example rules show how the Health Events Analyzer could implement NIC-specific correlation logic. These rules are optional and follow the same pattern as the existing XID correlation rules.
Design Note: Following gpud’s design, the primary health determination is via port state monitoring (link-state-detection), not kernel log events or analyzer rules. These rules provide optional diagnostic correlation, not automatic remediation triggers.
B.1 Link Flap Detection Rule
B.2 Repeated NIC Degradation Escalation Rule
B.3 Configuration in values.yaml
These rules would be added to the health-events-analyzer Helm chart values:
References
Linux Kernel & Driver
- sysfs-class-infiniband (Linux Kernel)
- RHEL8 mlx5_core Stack Overflow (Red Hat)
- mlx5_core cmd.c - Command Interface (Linux Kernel) - command timeout resource leaks,
ACCESS_REG failed - mlx5_core health.c - Health Poller (Linux Kernel) - health compromise miss count,
unrecoverable - mlx5_core events.c - Async Events (Linux Kernel) -
insufficient power,High Temperature,module unplugged - PCIe AER HOWTO (Linux Kernel) - PCIe Bus Error log format and BDF identification