Runbook: MongoDB Connection Error
Runbook: MongoDB Connection Error
Overview
MongoDB is used to persist health events that are created by health monitors like syslog-health-monitor, gpu-health-monitor etc for 30 days. The fault-handling modules(like fault-quarantine, node-drainer, fault-remediation) connect to MongoDB to process these events and cordon/uncordon/remediate the node. When MongoDB pods crash or fail to start, all event processing in NVSentinel stops.
Common Causes
MongoDB connection failure can occur due to the following reasons:
1. TLS Certificates Not Installed or Expired
- Missing
mongodb-tls-secretormongo-app-client-cert-secretsecrets - Expired certificates
2. Initialization Job Not Completed
create-mongodb-databasejob failed or still running- Job creates required database, collections, and indexes
- Without successful completion, MongoDB pod can’t initialize properly
3. Storage Class Is Missing
- PersistentVolumeClaims (PVCs) use StorageClass to dynamically provision volume. Without a valid StorageClass, PVCs cannot bind to storage and MongoDB pods cannot start without bounded PVCs
Quick Diagnosis
Run these commands to identify the issue:
Detailed Troubleshooting
Issue 1: Expired Certificates/Secrets
Diagnosis:
Solution:
Scenario A: Mongo secrets do NOT exist (cert-manager webhook is the problem)
If Step 1 showed no mongo secrets, or Step 3 showed cert-manager-webhook-ca is expired:
Scenario B: Mongo secrets exist but are expired (cert-manager is healthy)
If Step 1 showed mongo secrets exist, and Step 4 showed they are expired, but Step 3 showed cert-manager webhook is healthy: