Installation Guide for Autonomous Job Recovery#
Introduction#
This guide provides installation instructions for the Autonomous Job Recovery (AJR). AJR is included in the NVIDIA Mission Control “On-Prem” software bundle for DGX SuperPod B200 and DGX SuperPod GB200. AJR is a suite of microservices deployed in a Kubernetes (k8s) cluster, designed to improve AI cluster efficiency by integrating all components of the AI training lifecycle into a unified, automated flow that minimizes downtime. It offers core services tailored to automate both manual and automated recovery processes. AJR autonomously handles most failures in the AI workflow, initiating immediate recovery without input from model engineers. This capability increases automation-driven productivity and enables the scaling of successful AI training practices across current and future generations of AI supercomputers.
Prerequisites#
The following prerequisites for the Autonomous Job Recovery (AJR) are installed by NVIDIA Mission Control using the installation wizard:
BCM license that allows AJR installation
Kubernetes is deployed and configured with the cm-kubernetes-setup wizard. In addition to Kubernetes, the following configuration changes and additional packages must be applied:
Kyverno is disabled during Kubernetes installation
Prometheus Operator Stack is installed
Prometheus Adapter is installed
Grafana Loki is installed
Grafana Promtail is installed
Disable collection of /var/log logs
Kubernetes Metrics Server is installed
Kubernetes State Metrics is installed
Local Path storage class is enabled with the default NFS-based storage path
Ingress NGINX Controller is installed
Slurm is deployed with the cm-wlm-setup wizard
MySQL is installed on the head node
To obtain the NVCR (nvcr.io, NVIDIA container registry) token used to pull the images:
Log in to https://ngc.nvidia.com/signin using your credentials and select your organization.
Generate a token at https://org.ngc.nvidia.com/setup/api-keys
Please refer to NGC documentation for details: https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#ngc-api-keys
An example configuration is shown. Some options or packages might look different.

All prerequisites must be installed before you begin the AJR installation.
Before the Autonomous Job Recovery installation#
The following additional steps are required before the AJR installation can begin:
- Kubernetes namespace for the Heimdall must be created with a
particular label.
- Create a file create-are-namespace.yaml on the active headnode
with the following content:
apiVersion: v1
kind: Namespace
metadata:
name: heimdall
labels:
zarf.dev/agent: ignore
Apply this file on the active headnode:
root@bcm11-head-01:~# kubectl apply -f create-are-namespace.yaml
namespace/heimdall created
Autonomous Job Recovery Installation#
After all prerequisites are met, start the AJR installation with the cm-mission-control-setup wizard. Select Install NVIDIA Mission Control autonomous job recovery when the wizard starts. Some steps may vary depending on the installation wizard version.

AJR requires a MySQL database. By default, the database is installed on the head node. The installation wizard prompts for admin credentials to create the AJR MySQL database and user:

When prompted, provide credentials for the Helm chart repository. You must supply an NVIDIA Container Registry (NVCR) personal access token for the operator and container images.

Optionally, provide credentials for the Loki API.

Select the log collection application (currently, only Grafana Promtail is available).
If NVIDIA Mission control autonomous hardware recovery has been installed, the AJR installation wizard will prompt you to enable integration with it. This is optional but enhances capabilities of AJR.

In case if integration is desired, select ‘yes’ and proceed to the configuration step:

Default values do not need to be changed. Please refer to the NVIDIA Mission Control autonomous hardware recovery documentation for the ways to obtain the API token. Proceed to the next step after the token is entered.
In the next step, select Save the configuration and deploy. The installation wizard saves the collected information to the specified file and starts the AJR installation process. After installation completes, verify the installation by accessing the Grafana dashboards or the AJR UI.
Autonomous Job Recovery post-installation steps#
For the Heimdall efficiency Grafana dashboard to operate correctly, an additional step should be performed.
MySQL user heimdall-kpis-reader must be created
a. Log in to the MySQL database as a root user and execute the following statements. Replace the ‘password’ with a random generated one.
mysql> CREATE USER 'heimdall-kpis-reader' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.02 sec)
mysql> GRANT SELECT ON heimdall.job_efficiency_records TO 'heimdall-kpis-reader';
Query OK, 0 rows affected (0.01 sec)
After the AJR is installed, a new mysql data source has to be created in Grafana using heimdall-kpis-reader username and the password specified above. Without it, the Heimdall efficiency Grafana dashboard will not function properly.
a. To add a new mysql data source, access the Grafana UI which is located at https://headnode_ip_address/grafana URL. Replace the headnode_ip_address with the actual IP address. Refer to the Observability Stack Configuration guide for any Grafana setup details and authorization information.
Log in to Grafana and click Data Sources in the Connections section. Select Add new data source and click on MySQL datasource in the SQL section.
On the following screen, configure the following fields:
For the name, use ‘mysql’
For the Host URL, use ‘master:3306’
For the database, enter ‘heimdall’
For the username, use ‘heimdall-kpis-reader’
For the password, use the password specified during the creation of the heimdall-kpis-reader user.
Leave the rest of the fields with the default settings
Click the Save and Test button. A green popup with the “Database connection ok” message should appear.
Autonomous Job Recovery uninstallation#
AJR uninstallation should be initiated with the cm-mission-control-setup wizard started on the active headnode. Select ‘Uninstall’ and confirm on the next page.


Uninstallation wizard would uninstall AJR and remove the ‘heimdall’ namespace. Existing data in MySQL database and ‘heimdall-kpis-reader’ MySQL user would not be affected. In case there is a need to perform complete uninstall, log in to MySQL database as root user and perform the following steps:
DROP USER IF EXISTS 'heimdall-kpis-reader'@'%';
DROP USER IF EXISTS 'heimdall_user'@'%';
DROP DATABASE heimdall;
This will remove ‘heimdall-kpis-reader’ and ‘heimdall_user’ MySQL users and related ‘heimdall’ database.