NVIDIA Autonomous Job Recovery Installation#

Introduction#

NVIDIA Mission Control provides Autonomous Job Recovery (AJR) for clusters managed by Base Command Manager (BCM).

AJR can:

  • Monitor Kubernetes jobs and pods.

  • Monitor Slurm jobs and scheduler health, diagnose issues, and take corrective actions to minimize downtime.

  • Visualize monitoring data in Grafana, using Loki for log collection.

  • Integrate with BCM and install when Kubernetes and Slurm are already present.

For installation instructions, see NVIDIA Autonomous Hardware Recovery in the NVIDIA Base Command Manager Mission Control Manual.