Log Collection Job Failures
Runbook: Log Collection Job Failures
Symptoms
- Prometheus alert:
HighLogCollectionFailureRate - Failed log collector jobs visible in Kubernetes
- Missing diagnostic logs for faulted nodes
- Metric
fault_remediation_log_collector_jobs_total{status="failure"}increasing
Diagnosis Steps
1. Check Recent Job Status
2. Examine Job Logs
3. Check Common Issues
Common Failure Causes and Solutions
Issue 1: NVIDIA Driver Pod Not Found
Error in logs:
Cause: GPU operator not deployed or driver pod not running on the node
Solution:
Issue 2: Timeout Errors
Error in logs:
Cause: Log collection taking longer than configured timeout
Solution:
Issue 3: Upload Failures
Error in logs:
Cause: File server not accessible or not running
Solution:
Issue 4: Permission Errors
Error in logs:
Cause: Insufficient privileges or security context issues
Solution:
Issue 5: Disk Space Issues on File Server
Error in logs:
Cause: File server persistent volume is full
Solution:
Resolution Steps
-
Identify and fix the root cause using diagnosis steps above
-
Manually retry failed collection (if needed):
-
Verify fix:
-
Verify logs are uploaded: