Platform Connectors Configuration

View as Markdown

Overview

The Platform Connectors module acts as the central communication hub for NVSentinel. It receives health events from monitors via gRPC, processes them through a transformer pipeline, stores them in the database, and propagates them to Kubernetes. This document covers all Helm configuration options for system administrators.

Configuration Reference

Resources

Defines CPU and memory resource requests and limits for the platform-connectors pod.

1platformConnector:
2 resources:
3 limits:
4 cpu: 200m
5 memory: 512Mi
6 requests:
7 cpu: 200m
8 memory: 512Mi

Logging

Sets the verbosity level for platform-connectors logs.

1platformConnector:
2 logLevel: info # Options: debug, info, warn, error

Transformer Pipeline

Configures the event transformation pipeline that processes health events before storage and Kubernetes propagation.

1platformConnector:
2 pipeline:
3 - name: MetadataAugmentor
4 enabled: false
5 config: /etc/config/metadata.toml
6 - name: OverrideTransformer
7 enabled: false
8 config: /etc/config/overrides.toml
9
10 transformers:
11 MetadataAugmentor:
12 cacheSize: 50
13 cacheTTLSeconds: 3600
14 allowedLabels:
15 - "topology.kubernetes.io/zone"
16
17 OverrideTransformer:
18 rules: []

Parameters

pipeline

Array of transformers to execute in order:

  • name: Transformer identifier (MetadataAugmentor, OverrideTransformer)
  • enabled: Enable/disable the transformer
  • config: Path to transformer-specific configuration file

transformers

Transformer-specific configurations, nested by transformer name.

Note: Transformers execute sequentially. MetadataAugmentor should run first to provide node metadata for subsequent transformers.

Metadata Augmentor Configuration

Enriches health events with node labels and metadata from Kubernetes.

1platformConnector:
2 transformers:
3 MetadataAugmentor:
4 cacheSize: 50
5 cacheTTLSeconds: 3600
6 allowedLabels:
7 - "topology.kubernetes.io/zone"
8 - "topology.kubernetes.io/region"
9 - "node.kubernetes.io/instance-type"

Parameters

cacheSize

Number of node metadata entries to cache in memory.

cacheTTLSeconds

Time-to-live for cached node metadata entries in seconds.

allowedLabels

List of node label keys to include in health event enrichment. Only labels in this list are read from nodes and added to events.

Note: The complete default list is defined in distros/kubernetes/nvsentinel/values.yaml

Example

1platformConnector:
2 transformers:
3 MetadataAugmentor:
4 cacheSize: 100
5 cacheTTLSeconds: 3600
6 allowedLabels:
7 - "topology.kubernetes.io/zone"
8 - "topology.kubernetes.io/region"
9 - "custom.company.com/rack-id"

Override Transformer Configuration

Applies CEL-based rules to modify health event properties (isFatal, isHealthy, recommendedAction).

1platformConnector:
2 transformers:
3 OverrideTransformer:
4 rules:
5 - name: "suppress-xid-109"
6 when: 'event.agent == "syslog-health-monitor" && "109" in event.errorCode'
7 override:
8 isFatal: false
9 recommendedAction: "NONE"

Parameters

rules

Array of override rules evaluated in order (first match wins):

  • name: Human-readable rule name for logging
  • when: CEL expression that evaluates to boolean
  • override: Properties to modify (isFatal, isHealthy, recommendedAction)

CEL Expression Context

CEL expressions have access to the event object with the following fields:

FieldTypeDescription
event.nodeNamestringNode where event occurred
event.agentstringHealth monitor that generated event
event.componentClassstringComponent class (e.g., “GPU”, “Network”)
event.checkNamestringName of the health check
event.messagestringHuman-readable error message
event.errorCode[]stringArray of error codes
event.entitiesImpacted[]EntityAffected entities (GPUs, NICs, etc.)
event.isFatalboolWhether error is fatal
event.isHealthyboolOverall health status
event.recommendedActionstringRecommended remediation action
event.metadatamapNode metadata from MetadataAugmentor

Entity fields: Each entity in entitiesImpacted has:

  • entityType - Type of entity (e.g., “GPU”, “NIC”)
  • entityValue - Entity identifier (e.g., GPU UUID, PCI address)

Examples

Suppress known errors:

1transformers:
2 OverrideTransformer:
3 rules:
4 - name: "suppress-xid-109"
5 when: 'event.agent == "syslog-health-monitor" && "109" in event.errorCode'
6 override:
7 isFatal: false
8 recommendedAction: "NONE"

Kubernetes Connector

Configures the Kubernetes API client for creating node conditions and events.

1platformConnector:
2 k8sConnector:
3 enabled: true
4 maxNodeConditionMessageLength: 1024
5 qps: 5.0
6 burst: 10

Parameters

enabled

Enables Kubernetes connector for creating node conditions and events.

maxNodeConditionMessageLength

Maximum length of node condition messages in characters.

qps

Queries per second allowed to the Kubernetes API server.

burst

Maximum burst of queries allowed to the Kubernetes API server.

Example

1platformConnector:
2 k8sConnector:
3 enabled: true
4 maxNodeConditionMessageLength: 1024
5 qps: 10.0
6 burst: 20