Installation Guide for Autonomous Job Recovery#

Introduction#

This guide provides installation instructions for the Autonomous Job Recovery (AJR). AJR is included in the NVIDIA Mission Control “On-Prem” software bundle for DGX SuperPod B200 and DGX SuperPod GB200. AJR is a suite of microservices deployed in a Kubernetes (k8s) cluster, designed to improve AI cluster efficiency by integrating all components of the AI training lifecycle into a unified, automated flow that minimizes downtime. It offers core services tailored to automate both manual and automated recovery processes. AJR autonomously handles most failures in the AI workflow, initiating immediate recovery without input from model engineers. This capability increases automation-driven productivity and enables the scaling of successful AI training practices across current and future generations of AI supercomputers.

Prerequisites#

The following prerequisites for the Autonomous Job Recovery (AJR) are installed by NVIDIA Mission Control using the installation wizard:

  • BCM license that allows AJR installation

  • Kubernetes is deployed and configured with the cm-kubernetes-setup wizard. In addition to Kubernetes, the following configuration changes and additional packages must be applied:

    • Kyverno is disabled during Kubernetes installation

    • Prometheus Operator Stack is installed

    • Prometheus Adapter is installed

    • Grafana Loki is installed

    • Grafana Promtail is installed

      • Disable collection of /var/log logs

    • Kubernetes Metrics Server is installed

    • Kubernetes State Metrics is installed

    • Local Path storage class is enabled with the default NFS-based storage path

    • Ingress NGINX Controller is installed

    • Slurm is deployed with the cm-wlm-setup wizard

    • MySQL is installed on the head node

  • To obtain the NVCR (nvcr.io, NVIDIA container registry) token used to pull the images:

An example configuration is shown. Some options or packages might look different.

Example configuration of prerequisites for AJR installation in Mission Control.

All prerequisites must be installed before you begin the AJR installation.

Before the Autonomous Job Recovery installation#

The following additional steps are required before the AJR installation can begin:

  1. Kubernetes namespace for the Heimdall must be created with a

    particular label.

    1. Create a file create-are-namespace.yaml on the active headnode

      with the following content:

apiVersion: v1
kind: Namespace
metadata:
  name: heimdall
  labels:
    zarf.dev/agent: ignore
  1. Apply this file on the active headnode:

root@bcm11-head-01:~# kubectl apply -f create-are-namespace.yaml
namespace/heimdall created

Autonomous Job Recovery Installation#

After all prerequisites are met, start the AJR installation with the cm-mission-control-setup wizard. Select Install NVIDIA Mission Control autonomous job recovery when the wizard starts. Some steps may vary depending on the installation wizard version.

Mission Control wizard showing the Install NVIDIA Mission Control autonomous job recovery option.

AJR requires a MySQL database. By default, the database is installed on the head node. The installation wizard prompts for admin credentials to create the AJR MySQL database and user:

Prompt for MySQL admin credentials used to create the AJR database and user.

When prompted, provide credentials for the Helm chart repository. You must supply an NVIDIA Container Registry (NVCR) personal access token for the operator and container images.

Fields for providing NVCR credentials for Helm chart repository and container images.

Optionally, provide credentials for the Loki API.

Optional fields to enter Loki API access credentials.

Select the log collection application (currently, only Grafana Promtail is available).

If NVIDIA Mission control autonomous hardware recovery has been installed, the AJR installation wizard will prompt you to enable integration with it. This is optional but enhances capabilities of AJR.

AHR Integration prompt.

In case if integration is desired, select ‘yes’ and proceed to the configuration step:

Field to configure AHR integration.

Default values do not need to be changed. Please refer to the NVIDIA Mission Control autonomous hardware recovery documentation for the ways to obtain the API token. Proceed to the next step after the token is entered.

In the next step, select Save the configuration and deploy. The installation wizard saves the collected information to the specified file and starts the AJR installation process. After installation completes, verify the installation by accessing the Grafana dashboards or the AJR UI.

Autonomous Job Recovery post-installation steps#

For the Heimdall efficiency Grafana dashboard to operate correctly, an additional step should be performed.

  1. MySQL user heimdall-kpis-reader must be created

    a. Log in to the MySQL database as a root user and execute the following statements. Replace the ‘password’ with a random generated one.

mysql> CREATE USER 'heimdall-kpis-reader' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.02 sec)

mysql> GRANT SELECT ON heimdall.job_efficiency_records TO 'heimdall-kpis-reader';
Query OK, 0 rows affected (0.01 sec)
  1. After the AJR is installed, a new mysql data source has to be created in Grafana using heimdall-kpis-reader username and the password specified above. Without it, the Heimdall efficiency Grafana dashboard will not function properly.

    a. To add a new mysql data source, access the Grafana UI which is located at https://headnode_ip_address/grafana URL. Replace the headnode_ip_address with the actual IP address. Refer to the Observability Stack Configuration guide for any Grafana setup details and authorization information.

    Grafana login screen.
    1. Log in to Grafana and click Data Sources in the Connections section. Select Add new data source and click on MySQL datasource in the SQL section.

    Datasource selection screen.
    1. On the following screen, configure the following fields:

    • For the name, use ‘mysql’

    • For the Host URL, use ‘master:3306’

    • For the database, enter ‘heimdall’

    • For the username, use ‘heimdall-kpis-reader’

    • For the password, use the password specified during the creation of the heimdall-kpis-reader user.

    • Leave the rest of the fields with the default settings

    • Click the Save and Test button. A green popup with the “Database connection ok” message should appear.

Autonomous Job Recovery uninstallation#

AJR uninstallation should be initiated with the cm-mission-control-setup wizard started on the active headnode. Select ‘Uninstall’ and confirm on the next page.

Initial uninstall or upgrade of AJR. Confirm AJR uninstall

Uninstallation wizard would uninstall AJR and remove the ‘heimdall’ namespace. Existing data in MySQL database and ‘heimdall-kpis-reader’ MySQL user would not be affected. In case there is a need to perform complete uninstall, log in to MySQL database as root user and perform the following steps:

DROP USER IF EXISTS 'heimdall-kpis-reader'@'%';

DROP USER IF EXISTS 'heimdall_user'@'%';

DROP DATABASE heimdall;

This will remove ‘heimdall-kpis-reader’ and ‘heimdall_user’ MySQL users and related ‘heimdall’ database.