For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
      • Overview
      • Link State Detection
      • Link Counter Detection
      • Syslog Detection and Correlation
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Table of Contents
  • 1. Overview
  • 1.1 Problem Statement
  • 1.2 Scope of Syslog Detection
  • 1.3 Why Syslog Monitoring is Essential
  • 1.4 Severity Model for Kernel Log Events
  • 1.5 Syslog Detection Overview Diagram
  • 2. Architecture
  • 2.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern
  • 2.2 Component Responsibilities
  • 2.3 NIC Driver Error Check Data Flow
  • 2.4 System Context
  • 2.5 Packet Path vs Driver/Firmware Interface Coverage
  • 3. Driver/Firmware Interface Foundations
  • 3.1 The Command Interface (CMD IF)
  • 3.2 Driver/Firmware Communication Diagram
  • 3.3 Asynchronous Event Queue (EQ) and Health Poller
  • 3.4 The mlx5_core Command Timeout Resource Leak
  • 4. Integration with Syslog Health Monitor
  • 4.1 Handler Architecture
  • 4.2 Configuration (values.yaml)
  • 4.3 Verification Command (Synthetic Fault Injection)
  • 4.4 NIC Driver Error Handler Algorithm
  • 5. Monitored Kernel Patterns
  • 5.1 Pattern Table
  • 5.2 Pattern Classification
  • 5.3 Diagnostic Commands
  • 6. Event Correlation (via Health Events Analyzer)
  • 6.1 Correlation Examples
  • 6.2 Noise Filtering
  • 7. Repeat Failure Detection
  • 7.1 Correlation Purpose
  • 7.2 Repeat Failure Detection Diagram
  • 8. Configuration
  • 8.1 Syslog Health Monitor Configuration
  • 8.2 Health Events Analyzer Configuration (for NIC rules)
  • 8.3 NIC Driver Pattern Configuration
  • 9. Event Management
  • 9.1 Event Construction
  • 9.2 Event Purpose
  • 10. Monitoring Scope and Limitations
  • 10.1 What This Monitor CAN Detect
  • 10.2 What This Monitor CANNOT Detect
  • 10.3 Hardware Failures and Application Impact
  • 10.4 Event Severity Classification
  • Appendix A: Quick Reference - Kernel Log Patterns
  • Kernel Log Pattern Summary
  • Design Principle
  • Appendix B: Health Events Analyzer Rules for NIC Monitoring
  • B.1 Minimal Repeated NIC Escalation Rules
  • B.2 External Source Basis
  • B.3 Configuration in values.yaml
  • References
  • Linux Kernel & Driver
  • Vendor Documentation
ComponentsNIC Health Monitor

NIC Health Monitor: Syslog Detection

||View as Markdown|
Previous

Link Counter Detection

Next

Kubernetes Object Monitor


Table of Contents

  1. Overview
  2. Architecture
  3. Driver/Firmware Interface Foundations
  4. Integration with Syslog Health Monitor
  5. Monitored Kernel Patterns
  6. Advanced Heuristics
  7. Repeat Failure Detection
  8. Configuration
  9. Event Management
  10. Monitoring Scope and Limitations
    • 10.4 Event Severity Classification
  • Appendix A: Quick Reference - Kernel Log Patterns
  • Appendix B: Health Events Analyzer Rules for NIC Monitoring

Related Documents:

  • Link State Detection - UP/DOWN state monitoring
  • Link Counter Detection - Counter-based degradation monitoring

1. Overview

1.1 Problem Statement

NIC hardware polling monitors (state checks, counter reads) can miss critical failures that occur at the driver/firmware interface level. A NIC can appear healthy (link UP, counters normal) while the driver is completely unable to communicate with the firmware, leading to silent workload failures.

1.2 Scope of Syslog Detection

This document covers the Syslog Health Monitor component for NIC driver error monitoring, which detects:

  • Driver/Firmware communication failures - command timeouts (timeout. Will cause a leak of a command resource), firmware hangs
  • Hardware health check failures - health poll failed, unrecoverable errors
  • Driver/firmware-level PCIe events - surfaced via mlx5_core (e.g., insufficient power); fatal PCIe link loss / device disappearance is covered by link-state-detection
  • Thermal and power issues - High temperature warnings, insufficient power
  • Network watchdog timeouts - TX queue stalls (Non-Fatal diagnostic context; auto-recovery via mlx5e_tx_timeout)

1.3 Why Syslog Monitoring is Essential

┌──────────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE: Firmware Failure (NIC Health Monitor sees NOTHING wrong) │
│ │
│ NIC Health Monitor: state=ACTIVE, phys_state=LinkUp <── Looks healthy! │
│ Syslog Health Monitor: "timeout. Will cause a leak..." <── DETECTS FAILURE│
│ │
│ Without monitoring kernel logs, this node would appear healthy while unable │
│ to process any NIC commands. Workloads would fail with mysterious timeouts. │
└──────────────────────────────────────────────────────────────────────────────┘

1.4 Severity Model for Kernel Log Events

Following gpud’s design and forensic analysis of mlx5_core telemetry, kernel log events are classified based on their determinism of failure:

SeverityMeaningExample
FatalDeterministically fatal hardware/driver statetimeout. Will cause a leak, unrecoverable hardware error
Non-FatalDiagnostic context or transient issuesinsufficient power, High Temperature, ACCESS_REG failed, module unplugged

Key Design Principle: Only deterministically fatal events in the logs are raised as Fatal (IsFatal=true). All other events are raised as Non-Fatal (IsFatal=false) to provide diagnostic context without triggering immediate remediation.

1.5 Syslog Detection Overview Diagram

┌────────────────────────────────────────────────────────────────────────────┐
│ SYSLOG DETECTION FLOW │
├────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ DATA SOURCE │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ journald / /var/log/journal │ │
│ │ ├── Kernel ring buffer (dmesg) │ │
│ │ │ ├── mlx5_core driver messages │ │
│ │ │ ├── mlx5_core driver/firmware messages │ │
│ │ │ └── Network watchdog messages │ │
│ │ └── Systemd unit logs │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ SYSLOG HEALTH MONITOR (Event-driven) │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ CHECK: SysLogsNICDriverError │ │
│ │ │ │
│ │ FATAL PATTERNS (IsFatal=true, RecommendedAction=REPLACE_VM): │ │
│ │ ├── mlx5_core command timeout leak → FATAL (control plane broken) │ │
│ │ ├── mlx5_core health poll failed → FATAL (firmware dead) │ │
│ │ └── mlx5_core unrecoverable → FATAL (hardware failure) │ │
│ │ │ │
│ │ NON-FATAL PATTERNS (IsFatal=false, diagnostic context): │ │
│ │ ├── High Temperature → Non-Fatal (thermal) │ │
│ │ ├── Detected insufficient power → Non-Fatal (power negotiation)│ │
│ │ ├── module unplugged → Non-Fatal (SFP unplugged) │ │
│ │ ├── ACCESS_REG.*failed → Non-Fatal (monitoring noise) │ │
│ │ └── NETDEV WATCHDOG.*mlx5_core → Non-Fatal (TX stall) │ │
│ │ │ │
│ │ EXISTING CHECKS (GPU): │ │
│ │ ├── SysLogsXIDError │ │
│ │ ├── SysLogsSXIDError │ │
│ │ └── SysLogsGPUFallenOff │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ RAW EVENTS → PLATFORM CONNECTOR → MongoDB │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ HEALTH EVENTS ANALYZER (Correlation Rules) │ │
│ ├──────────────────────────────────────────────────────────────────────┤ │
│ │ • Fatal kernel logs trigger immediate REPLACE_VM │ │
│ │ • Non-fatal logs correlated with port state for diagnostics │ │
│ │ • Provides diagnostic context for operator investigation │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────┘

2. Architecture

2.1 Design Rationale: NVSentinel’s “Report Raw, Correlate Centrally” Pattern

The syslog-health-monitor’s NIC check follows NVSentinel’s established architectural pattern where:

  1. Health Monitors (DaemonSets) report raw events as-is to the Platform Connector
  2. Health Events Analyzer (Centralized Deployment) performs all correlation, aggregation, and pattern detection
  3. MongoDB serves as the source of truth for event history and correlation queries
Architectural PrincipleImplementationPurpose
Raw Event ReportingEach kernel log match → immediate eventEnables centralized correlation with full historical context
Centralized CorrelationHealth Events Analyzer MongoDB pipelinesFlexible, configurable rules without monitor code changes
Temporal CorrelationAnalyzer rules with time windowsCorrelates command timeout + link_down within seconds

2.2 Component Responsibilities

ComponentResponsibilityWhat It Does NOT Do
Syslog Health MonitorWatch journald for NIC driver errors, send raw eventsPattern correlation, burst detection
Health Events AnalyzerCorrelate events, detect patterns, escalate severityDirect log access

2.3 NIC Driver Error Check Data Flow

Reads:
└── journald → Kernel log entries (via existing syslog-health-monitor infrastructure)
NEW CHECK: SysLogsNICDriverError
Pattern matching for:
├── mlx5_core command timeout leak → Firmware communication failure (FATAL)
├── mlx5_core health poll failed → NIC health check failed (FATAL)
├── mlx5_core unrecoverable → Hardware in error state (FATAL)
├── module unplugged → Transceiver removed (Non-Fatal)
├── NETDEV WATCHDOG.*mlx5_core → TX stall, auto-recovery (Non-Fatal)
├── High Temperature → Thermal warning (Non-Fatal)
└── pci_power_insufficient → Power negotiation status (Non-Fatal)
Implementation:
└── Add NICDriverErrorHandler to existing syslog-health-monitor
(similar to existing XIDHandler, SXIDHandler, GPUFallenHandler)
Emits: HealthEvents (IsFatal=true/false) → Platform Connector → MongoDB

2.4 System Context

┌────────────────────────────────────────────────────────────────────────────────┐
│ NVSentinel NIC SYSLOG MONITORING ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ PER-NODE DAEMONSET │ │
│ ├──────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ NIC HEALTH MONITOR │ │ SYSLOG HEALTH MONITOR │ │ │
│ │ │ ══════════════════ │ │ ════════════════════ │ │ │
│ │ │ │ │ │ │ │
│ │ │ DATA SOURCES: │ │ DATA SOURCE: │ │ │
│ │ │ • /sys/class/infiniband/ │ │ • journald / /var/log/journal │ │ │
│ │ │ • /sys/class/net/ │ │ │ │ │
│ │ │ │ │ NEW CHECK: │ │ │
│ │ │ CHECKS: │ │ • SysLogsNICDriverError │ │ │
│ │ │ • InfiniBandStateCheck │ │ (mlx5_core command timeout, │ │ │
│ │ │ • InfiniBandDegradationChk │ │ health poll failed, │ │ │
│ │ │ • EthernetStateCheck │ │ unrecoverable) │ │ │
│ │ │ • EthernetDegradationCheck │ │ │ │ │
│ │ │ │ │ EXISTING CHECKS: │ │ │
│ │ │ │ │ • SysLogsXIDError │ │ │
│ │ │ │ │ • SysLogsSXIDError │ │ │
│ │ │ │ │ • SysLogsGPUFallenOff │ │ │
│ │ └─────────────┬───────────────┘ └───────────────┬─────────────────┘ │ │
│ │ │ │ │ │
│ └────────────────┼──────────────────────────────────┼─────────────────────┘ │
│ │ │ │
│ └────────────────┬─────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ PLATFORM CONNECTOR │ │
│ │ ══════════════════ │ │
│ │ • Receives raw events │ │
│ │ • Persists to MongoDB │ │
│ │ • Triggers downstream │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ HEALTH EVENTS ANALYZER │ │
│ │ ══════════════════════ │ │
│ │ │ │
│ │ OPTIONAL CORRELATION RULES: │ │
│ │ • Correlate kernel log warnings with port state events │ │
│ │ • Provide diagnostic context for operators │ │
│ │ │ │
│ │ NOTE: Port state determines health (not kernel logs) │ │
│ │ │ │
│ │ OUTPUT: │ │
│ │ • Diagnostic correlation for operator investigation │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────┘

2.5 Packet Path vs Driver/Firmware Interface Coverage

The combined monitoring approach ensures no blind spots by covering both layers:

┌─────────────────────────────────────────────────────────────────┐
│ Coverage Map │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PACKET PATH (Data Moving Through NIC): │
│ ├── NIC Health Monitor State Check: Binary UP/DOWN │
│ └── NIC Health Monitor Degradation Check: Error rates │
│ │
│ DRIVER/FIRMWARE INTERFACE (OS <--> NIC Communication): │
│ └── Syslog Health Monitor: mlx5_core driver/firmware events │
│ │
│ Driver/firmware failures can occur while packet path appears │
│ healthy. │
│ │
└─────────────────────────────────────────────────────────────────┘

Why this matters: Simple monitors that only check link state miss Driver/Firmware Interface failures, leading to silent failures where the hardware appears active but the data path is broken.


3. Driver/Firmware Interface Foundations

The NVIDIA Mellanox ConnectX series (ConnectX-5 through ConnectX-7) function as sophisticated, autonomous offload engines managing RDMA and complex flow steering. The host-to-NIC interaction is governed by a split-driver model where mlx5_core handles device initialization, health monitoring, and the command interface, while mlx5_ib or mlx5_en operate as protocol clients.

3.1 The Command Interface (CMD IF)

The primary control pathway is the Command Interface. The driver writes command blocks (e.g., CREATE_MKEY, MODIFY_QP) to PCI BAR-mapped memory and notifies the firmware via a “doorbell” register.

  • Fatality Mechanism: If the firmware hangs, the driver’s watchdog expires (typically 60s), logging timeout. Will cause a leak of a command resource.
  • Resource Leaks: Upon timeout, the driver intentionally “leaks” the command’s DMA-mapped memory. Freeing it could lead to silent memory corruption if the firmware later writes to that physical address. This makes cmd_exec timeouts irreversibly fatal to the driver’s device management capability.

3.2 Driver/Firmware Communication Diagram

┌────────────────────────────────────────────────────────────────────────────┐
│ DRIVER/FIRMWARE COMMUNICATION │
├────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ HOST (Linux Kernel) │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ mlx5_core │ │ mlx5_ib │ │ mlx5_en │ │ │
│ │ │ (Core Driver) │ │ (IB Protocol) │ │ (Ethernet) │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ • Initialization│ │ • RDMA ops │ │ • TCP/IP ops │ │ │
│ │ │ • Health Monitor│ │ • QP management │ │ • Packet I/O │ │ │
│ │ │ • Command IF │ │ │ │ │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │ │
│ │ └────────────────────┼────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────┐ │ │
│ │ │ COMMAND INTERFACE │ │ │
│ │ │ ═══════════════════ │ │ │
│ │ │ │ │ │
│ │ │ 1. Write command to │ │ │
│ │ │ PCI BAR memory │ │ │
│ │ │ 2. Ring doorbell │ │ │
│ │ │ 3. Wait for ownership bit │ │ │
│ │ │ (60s timeout) │ │ │
│ │ │ │ │ │
│ │ │ IF timeout: │ │ │
│ │ │ └── "timeout. Will cause..." │ │ │
│ │ │ logged to dmesg │ │ │
│ │ │ (IRREVERSIBLY FATAL) │ │ │
│ │ └───────────────┬─────────────────┘ │ │
│ └──────────────────────────────┼───────────────────────────────────────┘ │
│ │ │
│ │ PCIe │
│ │ │
│ ┌──────────────────────────────┼───────────────────────────────────────┐ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ NIC FIRMWARE │ │ │
│ │ │ ════════════ │ │ │
│ │ │ │ │ │
│ │ │ • Processes commands (CREATE_MKEY, MODIFY_QP, etc.) │ │ │
│ │ │ • Toggles ownership bit on completion │ │ │
│ │ │ • Health Syndrome register (polled by driver) │ │ │
│ │ │ • Asynchronous Event Queue (EQ) for port changes, thermal │ │ │
│ │ │ │ │ │
│ │ │ IF firmware hangs: │ │ │
│ │ │ └── Ownership bit never toggles │ │ │
│ │ │ Driver watchdog fires → "timeout. Will cause..." │ │ │
│ │ │ │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ CONNECTX NIC HARDWARE │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────┘

3.3 Asynchronous Event Queue (EQ) and Health Poller

The NIC implements a dedicated ring buffer (EQ) for control plane events (e.g., thermal excursions, port changes). Additionally, a background Health Poller thread periodically (1s) reads a “Health Syndrome” register. This dual mechanism ensures that even if the interrupt system fails, the driver detects unrecoverable hardware syndromes.

3.4 The mlx5_core Command Timeout Resource Leak

This is a severe driver-level error indicating Driver/Firmware Interface failure. (Reference: RHEL mlx5_core issues)

Mechanism:

  1. mlx5_core driver sends command to NIC firmware via mailbox
  2. Driver waits for firmware to toggle “Ownership bit” indicating completion
  3. If firmware has crashed/hung, bit never toggles
  4. Driver’s watchdog expires → logs timeout. Will cause a leak of a command resource to dmesg

Consequence: Usually requires driver reload (systemctl restart openibd) or full node reboot. This is always Fatal—workload cannot proceed if it cannot issue commands to the NIC. The driver intentionally “leaks” command resources because it cannot safely reclaim memory without risk of corruption if the firmware eventually responds.


4. Integration with Syslog Health Monitor

NIC driver and firmware errors are monitored by adding a new check to the existing syslog-health-monitor DaemonSet. This follows the established NVSentinel pattern where all kernel log monitoring is centralized in the syslog-health-monitor.

4.1 Handler Architecture

The syslog-health-monitor already has a modular handler architecture for different check types:

Check NameHandlerPurpose
XIDErrorCheckXID HandlerGPU XID errors
SXIDErrorCheckSXID HandlerNVSwitch SXID errors
GPUFallenOffCheckGPU Fallen HandlerGPU disappeared from bus
NICDriverErrorCheckNIC Driver HandlerNEW: NIC driver/firmware errors

4.2 Configuration (values.yaml)

1syslog-health-monitor:
2 enabledChecks:
3 - SysLogsXIDError
4 - SysLogsSXIDError
5 - SysLogsGPUFallenOff
6 - SysLogsNICDriverError # NEW: NIC driver/firmware errors

4.3 Verification Command (Synthetic Fault Injection)

$# Generate synthetic kernel message (requires root)
$# This injects a fake mlx5_core error into dmesg
$echo "<3>mlx5_core 0000:3b:00.0: ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource" > /dev/kmsg

4.4 NIC Driver Error Handler Algorithm

The NICDriverErrorHandler follows the same pattern as the existing XID handler.

Log Line Processing Steps:

  1. Receive kernel log line from journald stream
  2. Match against NIC error patterns (see Section 5 for pattern list):
    • Pattern definitions (regex, severity, recommended action, description) are owned by Go code
    • The TOML file only selects supported pattern names and enables/disables them (see Section 8.3)
    • Check if line matches any regex pattern (e.g., mlx5_core.*timeout\. Will cause a leak of a command resource)
    • If no match → skip line, return nil
  3. Determine severity:
    • Look up the pattern definition in code (summarized in Section 5.2)
    • Set IsFatal accordingly
  4. Extract PCI address from message using regex (e.g., 0000:3b:00.0)
  5. Best-effort NIC entity enrichment by resolving /sys/bus/pci/devices/<BDF>/driver symlink
    • If the driver is mlx5_core → attach NIC entity (e.g., mlx5_0) to the event
    • Otherwise → emit the event with no NIC entity attached
    • The handler does not drop events based on driver type. Supported patterns are mlx5-specific by regex (e.g., they require mlx5_core or mlx5_cmd_out_err in the message), so non-mlx5 matches are not expected.
  6. Generate HealthEvent with:
    • Agent = "syslog-health-monitor"
    • CheckName = "SysLogsNICDriverError"
    • EntityType = "NIC", EntityValue = <device_name>
  7. Send event to Platform Connector (no local aggregation)

Why live sysfs lookup instead of metadata collector: The metadata collector publishes relatively static GPU/NVSwitch inventory. The BDF itself is usually stable during a boot, but the kernel state used for NIC enrichment is not: /sys/bus/pci/devices/<BDF>/driver can disappear or reappear when mlx5_core is unloaded, reloaded, rebound, or recovered, and the exposed infiniband/* or net/* names can be recreated. The syslog handler therefore reads the current sysfs state at event-processing time. This lookup is best-effort enrichment only; if it fails, the raw syslog event is still emitted without a NIC entity.


5. Monitored Kernel Patterns

The NIC driver error handler monitors for the following patterns:

5.1 Pattern Table

Following gpud’s design, kernel log events are classified as Non-Fatal (IsFatal=false) by default. Only deterministic hardware state changes (port drops/flaps) that will cause workload failure are escalated to affect component health.

Event NameRegex PatternSeverityKernel SourceSystemic Impact
pci_power_insufficientmlx5_core.*Detected insufficient power on the PCIe slotNon-Fatalevents.c#L295-L299Power negotiation issue. Often transient during boot BIOS/BMC handshake.
port_module_high_tempmlx5_core.*Port module event.*High TemperatureNon-Fatalevents.c#L252-L254Thermal warning. May indicate cooling issue or impending PHY throttling.
cmd_exec_timeoutmlx5_core.*timeout\. Will cause a leak of a command resourceFatalcmd.cControl plane broken. Driver cannot manage device. Always Fatal.
health_poll_failedmlx5_core.*device's health compromised.*reached miss countFatalhealth.cFirmware heartbeat lost. Device is non-functional. Always Fatal.
unrecoverable_errmlx5_core.*unrecoverable hardware errorFatalhealth.cHardware admission of failure. Always Fatal.
access_reg_failedmlx5_cmd_out_err.*ACCESS_REG.*failedNon-Fatalcmd.cMonitoring tool conflict on restricted PFs. Non-Fatal Noise.
netdev_watchdogNETDEV WATCHDOG:.*mlx5_core.*transmit queue.*timed outNon-Fatalsch_generic.c (generic kernel mechanism)TX queue stall with auto-recovery via mlx5e_tx_timeout. Non-Fatal.
module_unpluggedmlx5_core.*Port module event.*Cable unpluggedNon-Fatalevents.cSFP/transceiver unplugged. Informational (though port will be DOWN).

Full Log Line Examples:

Event NameFull Log Line Example
pci_power_insufficientmlx5_core 0000:12:00.0: mlx5_pcie_event:299: Detected insufficient power on the PCIe slot (27W).
port_module_high_tempmlx5_core 0000:5c:00.0: Port module event[error]: module 0, Cable error, High Temperature
cmd_exec_timeoutmlx5_core 0000:03:00.0: wait_func:964:(pid 112): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
health_poll_failedmlx5_core 0000:d2:00.0: poll_health:174: device's health compromised - reached miss count.
unrecoverable_errmlx5_core: INFO: synd 0x8: unrecoverable hardware error.
access_reg_failedmlx5_cmd_out_err:838: ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2)
netdev_watchdogNETDEV WATCHDOG: eth0 (mlx5_core): transmit queue 0 timed out
module_unpluggedmlx5_core 0000:12:00.0: Port module event: module 0, Cable unplugged

Design Note: While gpud internally treats many of these as non-fatal, NVSentinel escalates deterministically fatal signals to Fatal (IsFatal=true) to trigger proactive remediation (REPLACE_VM) before workload failure cascades. Non-fatal signals remain as IsFatal=false for diagnostic correlation.

5.2 Pattern Classification

Patterns are classified according to their operational impact:

CategoryPatternsSeverityRecommended Action
Always Fatal (Device)command timeout with resource leak, health poll failed, unrecoverableFatalREPLACE_VM
Non-Fatal / Evidenceinsufficient power, High Temperature, ACCESS_REG failed, module unplugged, NETDEV WATCHDOGNon-FatalNONE (Diagnostic)

Key Principle: Kernel logs provide diagnostic context, not remediation triggers. The decision to drain/replace a node is based on actual port state (via link state detection), not log messages alone.

5.3 Diagnostic Commands

$# Check for driver/firmware errors in kernel log
$dmesg | grep -E "(mlx5_core|PCIe|AER)"
$# Output:
$# [ 42.123] mlx5_core 0000:3b:00.0: ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
$
$# Check journald for NIC errors
$journalctl -k | grep -E "(mlx5_core|PCIe)"
$
$# Watch for real-time NIC driver messages
$journalctl -k -f | grep --line-buffered mlx5_core

6. Event Correlation (via Health Events Analyzer)

Fatal syslog events are emitted immediately with their configured remediation action. The Health Events Analyzer can still correlate those raw events with port state changes to provide incident context. Non-Fatal diagnostic syslog events are emitted as IsFatal=false and can also be correlated with state changes for operator investigation.

6.1 Correlation Examples

The Health Events Analyzer can correlate kernel log events with port state events:

Kernel Log EventPort State EventCorrelation Insight
High Temperature + health poll failedPort DOWNThermal-induced failure
command timeout with resource leakPort DOWNDriver/firmware failure
insufficient powerPort DOWNPower delivery issue

6.2 Noise Filtering

Some kernel messages can be filtered to reduce noise:

  • ACCESS_REG failed: Common on systems with restricted PFs (DGX, Umbriel). Use --infiniband-exclude-devices to exclude problematic devices from monitoring.
  • insufficient power: Often transient during BIOS/BMC power negotiation. Can be filtered by uptime context if needed.

7. Repeat Failure Detection

Unlike a local correlation engine, repeat/cross-signal pattern detection is handled by the Health Events Analyzer:

  1. Raw Event Flow: Syslog-health-monitor sends raw mlx5_core events to Platform Connector → MongoDB
  2. Correlation Rules: Health Events Analyzer queries MongoDB and correlates syslog events with port state events
  3. Example: command timeout with resource leak at 10:00:01 is emitted immediately as Fatal; port state=DOWN at 10:00:05 can be correlated later to provide diagnostic context

7.1 Correlation Purpose

Kernel log events are emitted with their configured severity. The Health Events Analyzer can correlate them with port state events for diagnostic context:

1. Diagnostic Correlation:
├── Purpose: Link kernel log events to port state changes
├── Input: Kernel log events + Port state events
└── Output: Correlated diagnostic information for operators
2. Port Drop/Flap Detection (via Link State Detection):
├── Source: NIC Health Monitor (port state monitoring)
├── Condition: Port drops or flaps detected via sysfs
└── Effect: Component health set to Unhealthy + REPLACE_VM
3. Sticky Window (similar to gpud):
├── Purpose: Keep component unhealthy for stabilization period after port recovery
├── Effect: Prevents confusing Unhealthy→Healthy flips
└── Window: Configurable (e.g., 10 minutes after port recovery)

Key Design Principle: Kernel log events for diagnostic patterns remain Non-Fatal (IsFatal=false). Component health changes are triggered by port state (link-state-detection), not by kernel logs.

7.2 Repeat Failure Detection Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│ REPEAT FAILURE DETECTION (via Health Events Analyzer) │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIME EVENT SOURCE EVENT TYPE │
│ ──── ──────────── ────────── │
│ │
│ T+0:00 Syslog Monitor command timeout leak (mlx5_0) │
│ │ │
│ └──────────────────────────────────────────────────────────────────►│
│ │
│ T+0:05 NIC Health Monitor state=DOWN (mlx5_0_port1) │
│ │ │
│ └──────────────────────────────────────────────────────────────────►│
│ │
│ ┌─────────────────────────────────────┐ │
│ MongoDB ◄────────┤ RAW EVENTS PERSISTED │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ HEALTH EVENTS ANALYZER │ │
│ │ │ │
│ │ Rule: NICDriverErrorCorrelation │ │
│ │ │ │
│ │ MongoDB Aggregation (pseudo-query for illustration): │ │
│ │ db.health_events.find(\{ │ │
│ │ timestamp: \{$gt: NOW() - 10s\}, │ │
│ │ nodename: 'node-xyz', │ │
│ │ $or: [\{message: /timeout. Will cause a leak/\}, │ │
│ │ \{message: /state=DOWN/\}] │ │
│ │ \}) │ │
│ │ │ │
│ │ Result: │ │
│ │ 1. command timeout leak at T+0:00 │ │
│ │ 2. state=DOWN at T+0:05 │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────┐ │ │
│ │ │ CORRELATION DETECTED │ │ │
│ │ │ │ │ │
│ │ │ Pattern: command timeout + link_down │ │ │
│ │ │ within 5 seconds │ │ │
│ │ │ │ │ │
│ │ │ OUTPUT: DIAGNOSTIC CORRELATION │ │ │
│ │ │ Message: "Driver warning correlated with │ │ │
│ │ │ port state change" │ │ │
│ │ │ Purpose: Context for operator investigation │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘

8. Configuration

8.1 Syslog Health Monitor Configuration

1# values.yaml for syslog-health-monitor
2syslog-health-monitor:
3 enabledChecks:
4 - SysLogsXIDError
5 - SysLogsSXIDError
6 - SysLogsGPUFallenOff
7 - SysLogsNICDriverError # NEW: NIC driver/firmware errors

8.2 Health Events Analyzer Configuration (for NIC rules)

1# health-events-analyzer/values.yaml additions
2enableRepeatedNICDriverErrorRule: true
3enableRepeatedNICDegradationRule: true

8.3 NIC Driver Pattern Configuration

NIC driver error pattern definitions are hardcoded in the syslog-health-monitor. Configuration selects supported pattern names, enables/disables them, and can optionally override the processing strategy. Regexes, severity, recommended action, and descriptions are intentionally not user-configurable.

TOML Shape:

1[[nicDriverDetection.patterns]]
2name = "cmd_exec_timeout"
3enabled = true
4processingStrategy = "EXECUTE_REMEDIATION"
5
6[[nicDriverDetection.patterns]]
7name = "netdev_watchdog"
8enabled = true
9processingStrategy = "EXECUTE_REMEDIATION"

Fields:

FieldTypeDescription
namestringSupported pattern identifier; must exist in the Go-owned pattern registry
enabledboolWhether the pattern is active; set false to disable without removing
processingStrategystringOptional per-pattern override (EXECUTE_REMEDIATION or STORE_ONLY); falls back to the monitor-wide strategy when omitted

Operational Note: Operators can disable noisy supported patterns or use STORE_ONLY for observability-only collection. Severity and recommended action changes should be handled through platform connector / analyzer overrides rather than by changing monitor regex definitions.


9. Event Management

9.1 Event Construction

Events are emitted with severity based on their determinism of failure:

Example Event Fields (Fatal - command timeout resource leak):

FieldValue
Agentsyslog-health-monitor
CheckNameSysLogsNICDriverError
ComponentClassNIC
Message”mlx5_core 0000:3b:00.0: ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource”
IsFataltrue
IsHealthyfalse
RecommendedActionREPLACE_VM
EntitiesImpacted[\{EntityType: "NIC", EntityValue: "mlx5_0"\}]

Example Event Fields (Non-Fatal - module unplugged):

FieldValue
Agentsyslog-health-monitor
CheckNameSysLogsNICDriverError
ComponentClassNIC
Message”mlx5_core 0000:12:00.0: Port module event: module 0, Cable unplugged”
IsFatalfalse
IsHealthyfalse
RecommendedActionNONE (informational)
EntitiesImpacted[\{EntityType: "NIC", EntityValue: "mlx5_0"\}]

9.2 Event Purpose

Kernel log events serve as both direct failure signals and diagnostic context:

Event SeverityPurposeAction
FatalDeterministic hardware/driver failureImmediate remediation via REPLACE_VM
Non-FatalDiagnostic information / EvidenceLogged for investigation; provides context for state-based failures

Key Design Principle: Deterministically fatal events in logs (command timeout resource leaks, unrecoverable hardware error) trigger REPLACE_VM directly. Non-fatal events (insufficient power, module unplugged) remain as IsFatal=false for diagnostic correlation.


10. Monitoring Scope and Limitations

10.1 What This Monitor CAN Detect

The syslog monitoring capability operates at the driver and kernel level only:

Data SourceMonitorDetection Capability
journald/dmesgSyslog Health MonitorDriver errors (mlx5_core), firmware failures, PCIe events

10.2 What This Monitor CANNOT Detect

Application-level logs and remote failures are out of scope:

CategoryExamplesWhy Out of Scope
Application logsRDMA library errors, framework failuresNot in kernel ring buffer
Remote node failuresPeer node crash, peer NIC hangNo local kernel/hardware signature
Fabric issuesSwitch failures, routing black holesRequires fabric-level monitoring
Subnet Manager issuesSM unreachable (may partially detect via LID changes)Fabric management layer, not local NIC

10.3 Hardware Failures and Application Impact

The following table shows which hardware failures this monitor detects and how they may impact applications:

Hardware FailureDetection MethodPotential Application Impact
Firmware freezeKernel log: timeout. Will cause a leak of a command resourceAll NIC operations stall
Driver crashKernel log: device's health compromised - reached miss countNIC becomes unusable
TX stallKernel log: NETDEV WATCHDOGNetwork transmission fails

10.4 Event Severity Classification

Kernel log events are classified by their determinism of failure:

Kernel Log EventSeverityPurposeRecommended Action
mlx5_core.*timeout. Will cause a leakFatalControl plane brokenREPLACE_VM
mlx5_core.*device's health compromisedFatalFirmware heartbeat lostREPLACE_VM
mlx5_core unrecoverableFatalHardware admission of failureREPLACE_VM
NETDEV WATCHDOGNon-FatalTX stall with auto-recoveryNONE
Detected insufficient powerNon-FatalPower negotiation statusNONE
High TemperatureNon-FatalThermal warningNONE
module unpluggedNon-FatalSFP unpluggedNONE
ACCESS_REG failedNon-FatalMonitoring noiseNONE

Key Principle: Deterministically fatal events in logs trigger REPLACE_VM. Diagnostic logs remain as Non-Fatal (IsFatal=false) for diagnostic context.


Appendix A: Quick Reference - Kernel Log Patterns

The following patterns are monitored in the kernel ring buffer (dmesg/kmsg):

Kernel Log Pattern Summary

PatternSeverityMeaningAction
mlx5_core.*timeout\. Will cause a leak of a command resourceFatalFirmware/driver command timeoutREPLACE_VM
mlx5_core.*device's health compromised.*reached miss countFatalHealth check missedREPLACE_VM
mlx5_core.*unrecoverable hardware errorFatalHardware error detectedREPLACE_VM
NETDEV WATCHDOG:.*mlx5_core.*timed outNon-FatalTX queue timeoutInvestigate; persistent stalls caught by link-state-detection
mlx5_core.*Detected insufficient powerNon-FatalPower negotiation statusInvestigate; can be transient
mlx5_core.*Port module event.*High TempNon-FatalThermal warningInvestigate; check cooling
mlx5_cmd_out_err.*ACCESS_REG.*failedNon-FatalRestricted PF accessFilter/Ignore
mlx5_core.*Port module event.*Cable unpluggedNon-FatalSFP/transceiver unpluggedMonitor port state

Design Principle

SourceIsFatalRecommended ActionPurpose
Deterministic LogstrueREPLACE_VMFatal driver/firmware condition
Port State Changes (link-state-detection)trueREPLACE_VMFatal NIC condition detected
Fatal Counters (link-counter-detection)trueREPLACE_VMFatal NIC condition detected
Diagnostic LogsfalseNONEEvidence/context for investigation

Key Insight: Kernel logs for deterministic failures (command timeout resource leaks, etc.) are Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM. Diagnostic logs (insufficient power, High Temperature, module unplugged) are Non-Fatal (IsFatal=false). State and counter conditions are also Fatal (IsFatal=true) with RecommendedAction_REPLACE_VM.


Appendix B: Health Events Analyzer Rules for NIC Monitoring

The Health Events Analyzer should only add value where the raw monitors do not already emit a deterministic fatal event. Fatal NIC state/counter events and fatal NIC driver syslog patterns already trigger remediation directly, so repeated rules are limited to non-fatal recurrence signals and use CONTACT_SUPPORT.

Design Note: The count threshold of 3 is externally grounded by the Linux mlx5 health-poll miss threshold (MAX_MISSES = 3) and gpud’s InfiniBand flap threshold of 3 reverts-to-active. The 1-hour window is an NVSentinel operational choice to catch clustered recurrence while avoiding stale daily maintenance or boot noise.

B.1 Minimal Repeated NIC Escalation Rules

RepeatedNICDriverError:

  • Input events: syslog-health-monitor, SysLogsNICDriverError, IsFatal=false.
  • Included pattern names: netdev_watchdog, port_module_high_temp, pci_power_insufficient, module_unplugged.
  • Grouping: same node + same errorcode[0] pattern name.
  • Threshold: 3 events in 1 hour.
  • Action: CONTACT_SUPPORT.
  • Purpose: escalate repeated driver/firmware diagnostic warnings when auto-recovery or environmental stabilization is failing.

RepeatedNICDegradation:

  • Input events: nic-health-monitor, InfiniBandDegradationCheck or EthernetDegradationCheck, IsFatal=false, IsHealthy=false.
  • Grouping: same node + same NIC + same NICPort.
  • Threshold: 3 events in 1 hour.
  • Action: CONTACT_SUPPORT.
  • Purpose: escalate repeated non-fatal counter degradation on the same physical port.

access_reg_failed is intentionally excluded from RepeatedNICDriverError. Public gpud context identifies repeated mlx5_cmd_out_err.*ACCESS_REG.*failed messages as restricted-PF/query noise on some systems, not a standalone hardware recurrence signal.

Fatal syslog patterns are also intentionally excluded:

  • cmd_exec_timeout
  • health_poll_failed
  • unrecoverable_err

Those are already emitted as fatal by the syslog health monitor and do not need analyzer-side repeat detection.

B.2 External Source Basis

  • Linux mlx5 health poll uses MAX_MISSES = 3 and logs device's health compromised - reached miss count: Linux mlx5 health.c.
  • Linux mlx5 command timeout logs timeout. Will cause a leak of a command resource: Linux mlx5 cmd.c.
  • Linux mlx5 async events log High Temperature, Cable unplugged, and insufficient PCIe slot power: Linux mlx5 events.c.
  • Linux mlx5 TX timeout has a devlink health reporter recovery path: Linux mlx5 reporter_tx.c.
  • gpud uses 3 reverts-to-active as the InfiniBand port-flap threshold: leptonai/gpud#971.
  • gpud ACCESS_REG context identifies mlx5_cmd_out_err.*ACCESS_REG.*failed as restricted-device/query noise: leptonai/gpud#1164.

B.3 Configuration in values.yaml

These rules are configured in the health-events-analyzer Helm chart values:

1enableRepeatedNICDriverErrorRule: true
2enableRepeatedNICDegradationRule: true

References

Linux Kernel & Driver

  1. sysfs-class-infiniband (Linux Kernel)
  2. RHEL8 mlx5_core Stack Overflow (Red Hat)
  3. mlx5_core cmd.c - Command Interface (Linux Kernel) - command timeout resource leaks, ACCESS_REG failed
  4. mlx5_core health.c - Health Poller (Linux Kernel) - health compromise miss count, unrecoverable
  5. mlx5_core events.c - Async Events (Linux Kernel) - insufficient power, High Temperature, module unplugged
  6. PCIe AER HOWTO (Linux Kernel) - PCIe Bus Error log format and BDF identification

Vendor Documentation

  1. ibdiagnet User Manual (NVIDIA)