Introducing Run:ai

This document describes how to deploy the Run:ai Atlas Platform on NVIDIA DGX BasePOD™ configurations through a simplified deployment using NVIDIA Base Command™ Manager (BCM) software.

The Run:ai Atlas platform enables IT organizations to architect their AI environment with cloud-like resource accessibility and management on any infrastructure. It also enables researchers to use any machine learning (ML) and data science tools they choose. The platform builds off powerful distributed computing and scheduling concepts from High Performance Computing (HPC) and is implemented as a Kubernetes (K8s) plug-in. It speeds up data science workflows and creates visibility for IT teams who can now manage valuable resources more efficiently and ultimately reduce idle GPU time.

The platform consists of the Run:ai cluster (Figure 1, right) and the Run:ai control plane or backend (Figure 1, left). For installation on BCM, the platform hosts the control plane component, and the customer deploys the cluster components onto the K8s cluster or clusters created by BCM.

Figure 1: Run:ai Architecture

The Run:ai scheduler extends the capabilities of the default K8s scheduler without replacing it. The Run:ai scheduler uses business rules based on project quotas to schedule workloads sent by researchers and data scientists. Fractional GPU, a Run:ai technology, enables researchers to allocate subsets of a GPU rather than the whole GPU, enhancing the underlying resource utilization. Additionally, the Run:ai agent and other monitoring tools are responsible for sending monitoring data to the Run:ai control plane.

Through Run:ai, researchers and data scientists have various methods for submitting and interacting with their workloads. They can submit ML workloads using the Run:ai CLI, directly by sending YAML files to K8s, through the K8s or Run:ai API, or through the Run:ai researcher user interface (UI). The researcher UI enables the simple submission of workloads based on templates that prefill necessary fields for workload submission. This enables administrators to provide a streamlined process for onboarding new users to the platform. Templates delivered through the Run:ai web UI enable a streamlined approach to deploying Jupyter notebooks while the UI also enables researchers and data scientists to connect directly to their notebook after it is running.

The Run:ai Atlas platform gives IT the control and visibility they need into their AI and ML environment while also abstracting away all the complex underlying infrastructure, enabling researchers and data scientists to leverage the tools of their choosing to drive innovation within the enterprise.

The following section details the preferred method for installing Run:ai on DGX BasePOD configurations.