Caution

GENERATED CONTENT WARNING

This is LLM-generated content and is provided as a suggestion/placeholder while the actual documentation is being created.

Health Check#

Overview#

  • Purpose of periodic health checks and cadence

  • Tools used: system logs, nvidia-smi, DCGM (placeholder)

Preflight#

  • Confirm system time and NTP status

  • Verify storage availability and SMART health

  • Validate network connectivity and DNS

GPU Health#

  • Temperature, clock, and power limits

  • ECC and memory error review (if applicable)

  • Utilization baselines

System Metrics#

  • CPU, memory, and I/O saturation checks

  • Background services and process review

  • Log anomalies and alert status

Reporting#

  • Capture a concise health report

  • Store snapshots for trend analysis