NVSentinel Metrics

View as Markdown

This document outlines all Prometheus metrics exposed by NVSentinel components.

Table of Contents


Fault Quarantine Module

Event Processing Metrics

Metric NameTypeLabelsDescription
fault_quarantine_events_received_totalCounter-Total number of events received from the watcher
fault_quarantine_events_successfully_processed_totalCounter-Total number of events successfully processed
fault_quarantine_processing_errors_totalCountererror_typeTotal number of errors encountered during event processing
fault_quarantine_event_backlog_countGauge-Number of health events which fault quarantine is yet to process
fault_quarantine_event_handling_duration_secondsHistogram-Histogram of event handling durations

Node Quarantine Metrics

Metric NameTypeLabelsDescription
fault_quarantine_nodes_quarantined_totalCounternodeTotal number of nodes quarantined
fault_quarantine_nodes_unquarantined_totalCounternodeTotal number of nodes unquarantined
fault_quarantine_nodes_manually_uncordoned_totalCounternodeTotal number of manually uncordons for nodes
fault_quarantine_current_quarantined_nodesGaugenodeCurrent number of quarantined nodes

Taint and Cordon Metrics

Metric NameTypeLabelsDescription
fault_quarantine_taints_applied_totalCountertaint_key, taint_effectTotal number of taints applied to nodes
fault_quarantine_taints_removed_totalCountertaint_key, taint_effectTotal number of taints removed from nodes
fault_quarantine_cordons_applied_totalCounter-Total number of cordons applied to nodes
fault_quarantine_cordons_removed_totalCounter-Total number of cordons removed from nodes
fault_quarantine_node_quarantine_duration_secondsHistogram-Time from health event generation to node quarantine completion. Buckets: Prometheus DefBuckets
fault_quarantine_node_remediation_duration_secondsHistogram-End-to-end node remediation time: generatedTimestamp (from original unhealthy event) to node unquarantine. Emitted on both auto unquarantine (via healthy event) and manual uncordon. Buckets: ExponentialBuckets(start=10s, factor=1.5, count=27), max ~4.4 days
fault_quarantine_node_remediation_duration_excluding_drain_secondsHistogram-Remediation time excluding node-drainer duration: (unquarantineTime - generatedTimestamp) - (drainFinishTimestamp - quarantineFinishTimestamp). Emitted only when both quarantineFinishTimestamp and drainFinishTimestamp are present in the original event document. Buckets: ExponentialBuckets(start=10s, factor=1.5, count=19), max ~4.1 hours

Ruleset Evaluation Metrics

Metric NameTypeLabelsDescription
fault_quarantine_ruleset_evaluations_totalCounterruleset, statusTotal number of ruleset evaluations. Status values: passed, failed

Circuit Breaker Metrics

Metric NameTypeLabelsDescription
fault_quarantine_breaker_stateGaugestateState of the fault quarantine breaker
fault_quarantine_breaker_utilizationGauge-Utilization of the fault quarantine breaker
fault_quarantine_get_total_nodes_duration_secondsHistogramresultDuration of getTotalNodesWithRetry calls in seconds
fault_quarantine_get_total_nodes_errors_totalCountererror_typeTotal number of errors from getTotalNodesWithRetry
fault_quarantine_get_total_nodes_retry_attemptsHistogram-Number of retry attempts needed for getTotalNodesWithRetry (buckets: 0, 1, 2, 3, 5, 10)

Node Drainer Module

Event Processing Metrics

Metric NameTypeLabelsDescription
node_drainer_events_received_totalCounter-Total number of events received from the watcher
node_drainer_events_replayed_totalCounter-Total number of in-progress events replayed at startup
node_drainer_events_processed_totalCounterdrain_status, nodeTotal number of events processed by drain status outcome. Drain status values: drained, cancelled, skipped
node_drainer_cancelled_event_totalCounternode, check_nameTotal number of cancelled drain events (due to manual uncordon or healthy recovery)
node_drainer_processing_errors_totalCountererror_type, nodeTotal number of errors encountered during event processing and node draining
node_drainer_event_handling_duration_secondsHistogram-Histogram of event handling durations
node_drainer_queue_depthGauge-Total number of pending events in the queue
node_drainer_pod_eviction_duration_secondsHistogram-Time from event receipt by node-drainer to successful pod eviction completion. Buckets: Exponential (0.1s, factor 2, 23 buckets, up to ~3 days)

Node Draining Metrics

Metric NameTypeLabelsDescription
node_drainer_waiting_for_timeoutGaugenodeShows if node drainer operation is waiting for timeout before force deletion (1=waiting, 0=not waiting)
node_drainer_force_delete_pods_after_timeoutCounternode, namespaceTotal number of node drainer operations that reached timeout and force deleted pods

Fault Remediation Module

Event Processing Metrics

Metric NameTypeLabelsDescription
fault_remediation_events_received_totalCounter-Total number of events received from the watcher
fault_remediation_events_processed_totalCountercr_status, node_nameTotal number of remediation events processed by CR creation status. CR status values: created, skipped
fault_remediation_processing_errors_totalCountererror_type, node_nameTotal number of errors encountered during event processing
fault_remediation_unsupported_actions_totalCounteraction, node_nameTotal number of health events with currently unsupported remediation actions
fault_remediation_event_handling_duration_secondsHistogram-Histogram of event handling durations
fault_remediation_cr_generate_duration_secondsHistogram-Time from drain completion (or quarantine completion if drain timestamp unavailable) to maintenance CR creation. Buckets: Prometheus DefBuckets

Log Collector Metrics

Metric NameTypeLabelsDescription
fault_remediation_log_collector_jobs_totalCounternode_name, statusTotal number of log collector jobs. Status values: success, failure, timeout
fault_remediation_log_collector_job_duration_secondsHistogramnode_name, statusDuration of log collector jobs in seconds
fault_remediation_log_collector_errors_totalCountererror_type, node_nameTotal number of errors encountered in log collector operations

File Server Metrics

HTTP Request Metrics

Metric NameTypeLabelsDescription
http_response_count_totalCountermethod, status, appTotal HTTP responses by method and status code
http_request_duration_secondsHistogrammethod, statusHTTP request duration in seconds

Log Rotation Metrics

Metric NameTypeLabelsDescription
fileserver_log_rotation_successful_totalCounter-Total successful log cleanup operations
fileserver_log_rotation_failed_totalCounter-Total failed log cleanup operations
fileserver_disk_space_free_bytesGauge-Free disk space in bytes

Labeler Module

Event Processing Metrics

Metric NameTypeLabelsDescription
labeler_events_processed_totalCounterstatusTotal number of pod events processed. Status values: success, failed
labeler_node_update_failures_totalCounter-Total number of node update failures during reconciliation
labeler_event_handling_duration_secondsHistogram-Histogram of event handling durations

Janitor

Action Metrics

Metric NameTypeLabelsDescription
janitor_actions_countCounteraction_type, status, nodeTotal number of janitor actions by type and status. Action types: reboot, terminate. Status values: started, succeeded, failed
janitor_action_mttr_secondsHistogramaction_typeTime from CR creation to action completion (Mean Time To Repair). Uses exponential buckets (10, 2, 10) for log-scale MTTR measurement

Platform Connectors

Kubernetes Connector Metrics

Metric NameTypeLabelsDescription
k8s_platform_connector_node_condition_update_totalCounterstatusTotal number of node condition updates by status. Status values: success, failed
k8s_platform_connector_node_event_operations_totalCounternode_name, operation, statusTotal number of node event operations by type and status. Operation values: create, update. Status values: success, failed
k8s_platform_connector_node_condition_update_duration_millisecondsHistogram-Duration of node condition updates in milliseconds. Uses linear buckets (0, 10, 500)
k8s_platform_connector_node_event_update_create_duration_millisecondsHistogram-Duration of node event updates/creations in milliseconds. Uses linear buckets (0, 10, 500)

Workqueue Metrics

These metrics track the internal ring buffer workqueue performance:

Metric NameTypeLabelsDescription
platform_connector_workqueue_depth_<name>GaugeworkqueueCurrent depth of Platform connector workqueue
platform_connector_workqueue_adds_total_<name>CounterworkqueueTotal number of adds handled by Platform connector workqueue
platform_connector_workqueue_latency_seconds_<name>HistogramworkqueueHow long an item stays in Platform connector workqueue before being requested. Uses linear buckets (0, 10, 500)
platform_connector_workqueue_work_duration_seconds_<name>HistogramworkqueueHow long processing an item from Platform connector workqueue takes. Uses linear buckets (0, 10, 500)
platform_connector_workqueue_retries_total_<name>CounterworkqueueTotal number of retries handled by Platform connector workqueue
platform_connector_workqueue_longest_running_processor_seconds_<name>GaugeworkqueueHow many seconds the longest running processor for Platform connector workqueue has been running
platform_connector_workqueue_unfinished_work_seconds_<name>GaugeworkqueueThe total time in seconds of work in progress in Platform connector workqueue

Note: <name> in the metric names is replaced with the actual workqueue name at runtime.


Health Monitors

GPU Health Monitor

These metrics track GPU health events detected via DCGM (Data Center GPU Manager):

Metric NameTypeLabelsDescription
dcgm_health_events_publish_time_to_grpc_channelHistogramoperation_nameAmount of time spent in publishing DCGM health events on the gRPC channel
health_events_insertion_to_uds_succeedCounter-Total number of successful insertions of health events to UDS
health_events_insertion_to_uds_errorCounter-Total number of failed insertions of health events to UDS
dcgm_health_active_eventsGaugeevent_type, gpu_id, severityTotal number of active health events at any given time by severity. Severity values: fatal, non_fatal
dcgm_api_latencyHistogramoperation_nameAmount of time spent calling DCGM APIs
dcgm_reconcile_timeHistogram-Amount of time spent running a single DCGM reconcile loop
number_of_health_watchesGauge-Number of DCGM health watches available
number_of_fieldsGauge-Number of available DCGM fields to monitor
callback_failuresCounterclass_name, func_nameNumber of times a callback function has thrown an exception
callback_successCounterclass_name, func_nameNumber of times a callback function has successfully completed
dcgm_api_failuresCountererror_nameNumber of DCGM API errors
dcgm_health_check_unknown_system_skippedCounter-Number of DCGM health check incidents skipped due to unrecognized system value

Syslog Health Monitor

The syslog health monitor tracks GPU-related errors detected from system logs.

XID Error Metrics

XID (GPU Error ID) errors are NVIDIA GPU driver errors:

Metric NameTypeLabelsDescription
syslog_health_monitor_xid_errorsCounternode, err_codeTotal number of XID errors found
syslog_health_monitor_xid_processing_errorsCountererror_type, nodeTotal number of errors encountered during XID processing
syslog_health_monitor_xid_processing_latency_secondsHistogram-Histogram of XID processing latency

SXID Error Metrics

SXID errors are NVSwitch-related errors:

Metric NameTypeLabelsDescription
syslog_health_monitor_sxid_errorsCounternode, err_code, link, nvswitchTotal number of SXID errors found

GPU Fallen Off Bus Metrics

Metric NameTypeLabelsDescription
syslog_health_monitor_gpu_fallen_errorsCounternodeTotal number of GPU fallen off bus errors detected

CSP Health Monitor

The CSP health monitor tracks cloud provider maintenance events and node health issues.

CSP Client Metrics

Metric NameTypeLabelsDescription
csp_health_monitor_csp_events_received_totalCountercspTotal number of raw events received from CSP API/source
csp_health_monitor_csp_polling_duration_secondsHistogramcspDuration of CSP polling cycles
csp_health_monitor_csp_api_errors_totalCountercsp, error_typeTotal number of errors encountered during CSP API calls
csp_health_monitor_csp_api_polling_duration_secondsHistogramcsp, apiDuration of CSP API polling cycles
csp_health_monitor_csp_monitor_errors_totalCountercsp, error_typeTotal number of errors initializing or starting CSP monitors
csp_health_monitor_csp_events_by_type_unsupported_totalCountercsp, event_typeTotal number of raw CSP events received, partitioned by event type code

Event Processing Metrics

Metric NameTypeLabelsDescription
csp_health_monitor_main_events_to_normalize_totalCountercspTotal number of events passed to the normalizer
csp_health_monitor_main_normalization_errors_totalCountercspTotal number of errors during event normalization
csp_health_monitor_main_events_received_totalCountercspTotal number of normalized events received by the main processor
csp_health_monitor_main_events_processed_success_totalCountercspTotal number of events successfully processed
csp_health_monitor_main_processing_errors_totalCountercsp, error_typeTotal number of errors during event processing
csp_health_monitor_main_event_processing_duration_secondsHistogramcspDuration of processing a single event

Datastore Metrics

Metric NameTypeLabelsDescription
csp_health_monitor_main_datastore_upsert_attempts_totalCountercspTotal number of attempts to upsert maintenance events
csp_health_monitor_main_datastore_upsert_totalCountercsp, statusTotal number of maintenance event upserts by status. Status values: success, failed

Trigger Engine Metrics

Metric NameTypeLabelsDescription
csp_health_monitor_trigger_poll_cycles_totalCounter-Total number of polling cycles executed by the trigger engine
csp_health_monitor_trigger_poll_errors_totalCounter-Total number of errors during a trigger engine poll cycle
csp_health_monitor_trigger_events_found_totalCountertrigger_typeTotal number of events found potentially needing a trigger
csp_health_monitor_trigger_attempts_totalCountertrigger_typeTotal number of trigger attempts made
csp_health_monitor_trigger_success_totalCountertrigger_typeTotal number of successful triggers
csp_health_monitor_trigger_failures_totalCountertrigger_type, failure_reasonTotal number of failed trigger attempts
csp_health_monitor_trigger_datastore_query_duration_secondsHistogramquery_typeDuration of datastore queries performed by the trigger engine
csp_health_monitor_trigger_datastore_query_errors_totalCounterquery_typeTotal number of errors during datastore queries
csp_health_monitor_trigger_datastore_update_errors_totalCountertrigger_typeTotal number of errors updating event status after trigger
csp_health_monitor_trigger_uds_send_duration_secondsHistogram-Duration of sending health events via UDS
csp_health_monitor_trigger_uds_send_errors_totalCounter-Total number of errors encountered when sending events via UDS
csp_health_monitor_node_not_ready_timeout_totalCounternode_nameTotal number of nodes that remained not ready after the timeout period
csp_health_monitor_node_readiness_monitoring_started_totalCounternode_nameTotal number of times background node readiness monitoring was started

Metrics Configuration

Scraping Metrics

All NVSentinel components expose Prometheus metrics on a metrics endpoint (typically :2112/metrics). The metrics can be scraped by Prometheus using standard scrape configurations.

Helm Chart Configuration

The NVSentinel Helm chart automatically creates a PodMonitor resource for Prometheus Operator integration:

$helm install nvsentinel ./distros/kubernetes/nvsentinel \
> --namespace nvsentinel --create-namespace

The PodMonitor is configured to scrape all NVSentinel component pods on their metrics endpoints (/metrics on port metrics).

Annotation-based Discovery

Components can be configured to include Prometheus scrape annotations:

1annotations:
2 prometheus.io/scrape: "true"
3 prometheus.io/port: "2112"
4 prometheus.io/path: "/metrics"

Metric Types Reference

  • Counter: A cumulative metric that only increases or resets to zero on restart
  • Gauge: A metric that can arbitrarily go up and down
  • Histogram: Samples observations and counts them in configurable buckets
  • Summary: Similar to histogram but calculates configurable quantiles over a sliding time window

Common Label Values

Status Labels

  • success / failed - Operation outcome
  • started / succeeded / failed - Action lifecycle status

Action Types

  • reboot - Node reboot action
  • terminate - Node termination action

CSP Labels

  • gcp - Google Cloud Platform
  • aws - Amazon Web Services

Trigger Types

  • quarantine - Node quarantine trigger
  • healthy - Node healthy trigger