Caution
GENERATED CONTENT WARNING
This is LLM-generated content and is provided as a suggestion/placeholder while the actual documentation is being created.
Health Check#
Overview#
Purpose of periodic health checks and cadence
Tools used: system logs, nvidia-smi, DCGM (placeholder)
Preflight#
Confirm system time and NTP status
Verify storage availability and SMART health
Validate network connectivity and DNS
GPU Health#
Temperature, clock, and power limits
ECC and memory error review (if applicable)
Utilization baselines
System Metrics#
CPU, memory, and I/O saturation checks
Background services and process review
Log anomalies and alert status
Reporting#
Capture a concise health report
Store snapshots for trend analysis