Syslog Health Monitor Configuration
Overview
The Syslog Health Monitor module watches system logs for GPU errors (XID/SXID), GPU-fallen-off, and GPU reset events by reading journald logs. This document covers all Helm configuration options for system administrators.
Configuration Reference
Module Enable/Disable
Controls whether the syslog-health-monitor module is deployed in the cluster.
Resources
Defines CPU and memory resource requests and limits for the syslog-health-monitor container.
Logging
Sets the verbosity level for syslog-health-monitor logs.
Enabled Checks
Configures which health checks are active. The module monitors journald logs for specific error patterns. The only supported checks at this time are SysLogsXIDError, SysLogsSXIDError and SysLogsGPUFallenOff.
Check Types
SysLogsXIDError
Monitors for XID (GPU error) and GPU reset messages in system logs. XIDs are NVIDIA GPU error codes that indicate hardware or software issues.
SysLogsSXIDError
Monitors for SXID messages specific to NVSwitch errors in multi-GPU configurations.
SysLogsGPUFallenOff
Monitors for GPU fallen off events where the GPU becomes unresponsive or inaccessible to the system.
XID Analyzer Sidecar
Optional sidecar container that provides enhanced XID error analysis and mapping.
Configuration
Parameters
enabled
Enables the XID analyzer sidecar container. When disabled, the monitor uses an embedded XID mapping file.
image.repository
Container image for the XID analyzer sidecar service.
image.tag
Image tag for the XID analyzer sidecar.
image.pullPolicy
Pull policy for the sidecar image.
XID Parsing Modes
Embedded Parser (Default)
When xidSideCar.enabled: false, the monitor uses an embedded Excel-based XID mapping file for parsing and analysis.
Characteristics:
- No additional container needed
- Uses static XID mapping data
- Suitable for most deployments
Sidecar Parser
When xidSideCar.enabled: true, the monitor sends XID messages to the sidecar service via HTTP for analysis.
Characteristics:
- Dedicated analysis service
- Dynamic XID interpretation
- Can provide enhanced error context
Example with Sidecar Enabled
XID Analyzer Sidecar API
When the sidecar is enabled, the syslog health monitor communicates with it via HTTP REST API.
Endpoint
The sidecar should listen on localhost:8080 as it runs in the same pod as the syslog health monitor.
Request Format
Request Fields
xid_message(string, required) - Raw XID error message from system logs
Response Format
Success Response
Error Response
Response Fields
Top Level
success(boolean, required) - Indicates if parsing was successfulresult(object, optional) - XID details object, present whensuccessistrueerror(string, optional) - Error message, present whensuccessisfalse
Result Object
number(integer) - XID error code number (e.g., 48, 64, 79)name(string) - Human-readable name of the XID errormnemonic(string) - Short mnemonic code for the error typecontext(string) - Additional context about the errorresolution(string) - Recommended resolution action that should be performed by the systeminvestigatory_action(string) - Steps to investigate the errorpcie_bdf(string) - PCIe Bus:Device.Function identifierdriver(string) - Driver versionmachine(string) - Machine architecturedecoded_xid_string(string) - Human-readable decoded error message
Implementation Requirements
The XID analyzer sidecar must:
- Listen on port
8080 - Implement
POST /decode-xidendpoint - Accept JSON requests with
xid_messagefield - Return JSON responses with the documented format
- Handle malformed or unparseable XID messages gracefully
- Return appropriate HTTP status codes (200 for success, 4xx/5xx for errors)
Kata Containers Support
The module automatically deploys separate DaemonSets for standard and Kata Container nodes.
Kata Mode Differences
For nodes labeled with nvsentinel.dgxc.nvidia.com/kata: "true":
- Adds containerd service filter to journald queries
- Removes
SysLogsSXIDErrorcheck (not supported in Kata environment) - Uses separate DaemonSet with
-katasuffix
Configuration is automatic based on node labels. No manual configuration needed.