NVIDIA Autonomous Job Recovery Installation#
Introduction#
NVIDIA Mission Control provides Autonomous Job Recovery (AJR) for clusters managed by Base Command Manager (BCM).
AJR can:
Monitor Kubernetes jobs and pods.
Monitor Slurm jobs and scheduler health, diagnose issues, and take corrective actions to minimize downtime.
Visualize monitoring data in Grafana, using Loki for log collection.
Integrate with BCM and install when Kubernetes and Slurm are already present.
For installation instructions, see NVIDIA Autonomous Hardware Recovery in the NVIDIA Base Command Manager Mission Control Manual.