Runbook: Stale Events Troubleshooting
Runbook: Stale Events Troubleshooting
Overview
NVSentinel uses MongoDB change streams to watch for new health events. These change streams use resume tokens to track their position in the event stream. In some cases, these tokens can become stale, causing the fault handling modules to fail resuming the change stream. This runbook guides you through clearing stale resume tokens.
Prerequisites:
kubectlaccess to the clustermongosh(MongoDB Shell) installed locally- Access to the
nvsentinelnamespace
Symptoms
You may see errors in the fault handling module logs (fault-quarantine, fault-remediation, or node-drainer) such as:
In other cases, modules may have been disabled for a long time and need to start fresh without processing accumulated events:
- Disabled modules - Fault handling modules were disabled for a long time while health events continued accumulating
- Circuit breaker tripped - The circuit breaker was tripped for an extended period, accumulating many unprocessed events
Procedure
1. Connect to MongoDB
Follow the Datastore Connection Runbook to connect to MongoDB:
2. Clear All Resume Tokens
Once connected, clear all resume tokens:
3. Restart Fault Handling Deployments
Restart the deployments to start processing events from the current point in the change stream:
Wait for the rollout to complete:
4. Verify Recovery
Check that the modules are processing events without errors:
The modules will now start processing events from the current point in the change stream. They will not retroactively process old events that accumulated while the resume tokens were stale.