Runbook: Health Event Analyzer High Error Rate
Runbook: Health Event Analyzer High Error Rate
Symptoms
Prometheus Alert: HealthEventAnalyzerHighErrorRate
Metric: High ratio of health_event_analyzer_event_processing_errors to health_event_analyzer_events_received_total
Common Causes:
- Malformed MongoDB aggregation queries in the health-events-analyzer configuration
- MongoDB connection errors (network issues, service unavailable, authentication failures)
Overview
The Health Events Analyzer (HEA) processes health events from MongoDB and applies rules defined in the health-events-analyzer-config ConfigMap. These rules contain MongoDB aggregation pipeline queries. Errors can occur due to:
- Syntax errors in the aggregation pipeline queries
- Typos in field names (e.g.,
generatedtimestapinstead ofgeneratedtimestamp) - Invalid MongoDB operators (e.g.,
$notequalinstead of$ne) - MongoDB connectivity issues
Diagnosis Steps
Issue 1: Malformed Database Queries in Configuration
The Health Events Analyzer executes rules defined in the configuration file. These rules contain queries written in MongoDB aggregation pipeline syntax. Typos or syntax errors will cause processing failures every time a rule is evaluated for processing event.
Diagnosis:
Solution:
Issue 2: MongoDB Connection Error
The Health Events Analyzer establishes a connection to MongoDB to listen for inserted events and to publish new health events. Connection errors prevent event processing and increase health_event_analyzer_event_processing_errors metric for execute_pipeline_error error type.
Follow the MongoDB Connection Error runbook to diagnose and resolve MongoDB connection issues.
If MongoDB pods are in a healthy state, check for connectivity errors in the health-events-analyzer pod: