Run:ai Install and Configuration#

In the previous steps, we have configured all of the required requirements for a Run:ai Self Hosted Install. The most up to date documentation can be found here: https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation

Run:ai Control Plane Install#

Create the runai-backend namespace:

kubectl create ns runai-backend

Create a kubernetes tls secret to present the certificate for secure communications:

kubectl create secret tls runai-backend-tls -n runai-backend \
  --cert ./fullchain.pem \
  --key ./private-key.pem

Create a kubernetes docker-registry secret that will be used to pull the Run:ai Images from the private repository:

export RUNAI_TOKEN=<JWT Token>
kubectl create secret docker-registry runai-reg-creds \
  --docker-server=https://runai.jfrog.io \
  --docker-username=self-hosted-image-puller-prod \
  --docker-password=$RUNAI_TOKEN \
  --docker-email=support@run.ai \
  --namespace=runai-backend

Using helm, add the Run:ai Control Plane repository to the local helm chart repositories, update the repo contents, and deploy the Run:ai Control Plane:

helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update

helm upgrade -i runai-backend -n runai-backend \
  runai-backend/control-plane \
  --set global.domain=<YOUR_RUNAI_CONTROL_PLANE_URL> \
  --set global.management.user=<YOUR_RUN_AI_USERNAME> \
  --set global.management.password=<YOUR_RUN_AI_PASSWORD>

Wait for deployment to complete and then using a browser navigate to https://<YOUR_RUNAI_CONTROL_PLANE_URL> and login with the initial credentials of User: <YOUR_RUN_AI_USERNAME> (e.g. test@run.ai) and Password: <YOUR_RUN_AI_PASSWORD>.

Run:ai Cluster Setup#

Upon logging into the GUI for the first time, you will be greeted with a wizard that will walk you through the initial cluster deployment process. In the wizard fill in the following:

  1. Cluster Name — A way to identify the cluster when working with multiple clusters.

  2. Run:ai Version — Generally select the default unless there is a specific need for a specific version.

  3. Cluster Location: Same as control plane — Only select Remote Control Plane if the Control Plane and cluster are not hosted on the same server.

  4. Click Continue.

_images/physical-ai-runai-cluster-wizard.png

Figure 10 Run:ai cluster setup wizard showing cluster name, version, and location fields#

A summary screen will appear next. Copy the command text to your clipboard and continue with the following steps.

Create the runai namespace:

kubectl create ns runai

Create the ingress secret for secure communications with the FQDN for HTTPS access:

kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
  --cert ./fullchain.pem \
  --key ./private-key.pem

Now we will paste the command from our clipboard into the terminal to add the Run:ai Cluster repo to our local Helm repos, update it to get the latest chart definitions and then install the Run:ai Cluster.

Note

Your secrets, URLs, and UIDs will be specific to your control plane.

helm repo add runai \
  https://runai.jfrog.io/artifactory/api/helm/run-ai-charts \
  --force-update
helm repo update

helm upgrade -i runai-cluster runai/runai-cluster -n runai \
  --set controlPlane.url=<YOUR_RUNAI_CONTROL_PLANE_URL> \
  --set controlPlane.clientSecret=<YOUR_CLIENT_SECRET> \
  --set cluster.uid=<YOUR_CLUSTER_UID> \
  --set cluster.url=<YOUR_RUNAI_CONTROL_PLANE_URL> \
  --version="2.22.52" \
  --create-namespace

Wait for the process to complete and in the Run:ai UI, click done on the wizard and then you will be taken to the Clusters screen in Run:ai. Once the cluster shows a Status of “Connected” the install is complete.

Run:ai Cluster — Optional Additional Components#

Distributed Training Install#

If you have a need to run TensorFlow, PyTorch, XGBoost, MPI v2, or JAX distributed workloads, run these commands to install the necessary Custom Resource Definitions into the cluster.

kubectl apply --server-side -k \
  "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.2"

kubectl patch deployment training-operator -n kubeflow --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob","--enable-scheme=pytorchjob","--enable-scheme=xgboostjob","--enable-scheme=jaxjob"]}]'

kubectl delete crd mpijobs.kubeflow.org

kubectl apply --server-side -f \
  https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml

Inference Install#

For inference, we will leverage Knative-Serving and Kourier. If you have a need for a different networking option with Knative-Serving, please refer to the Knative-Serving installation documentation to see your options.

Run these commands to install and configure the Knative-Serving and Kourier CRDs:

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.19.6/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.19.6/serving-core.yaml
kubectl apply -f https://github.com/knative-extensions/net-kourier/releases/download/knative-v1.19.5/kourier.yaml

kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.19.6/serving-hpa.yaml

kubectl patch configmap/config-autoscaler \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"enable-scale-to-zero":"true"}}' && \
kubectl patch configmap/config-features \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-nodeselector":"enabled","kubernetes.podspec-affinity":"enabled","kubernetes.podspec-tolerations":"enabled","kubernetes.podspec-volumes-emptydir":"enabled","kubernetes.podspec-securitycontext":"enabled","kubernetes.containerspec-addcapabilities":"enabled","kubernetes.podspec-persistent-volume-claim":"enabled","kubernetes.podspec-persistent-volume-write":"enabled","multi-container":"enabled","kubernetes.podspec-init-containers":"enabled","kubernetes.podspec-fieldref":"enabled"}}'

If enabling external access to the Inferencing Endpoints, you will need to configure Knative-Serving to use your domain by running this command:

# Replace <runai-inference.mycorp.local> with your FQDN for Inference
kubectl patch configmap/config-domain \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"<runai-inference.mycorp.local>":""}}'

Next you will need the External IP address of the Kourier Ingress service:

kubectl --namespace kourier-system get service kourier

Finally you will need to setup a wildcard DNS record that points to the IP address you obtained in the previous step. The DNS record should be in the following format: *.<k8s-namespace>.<runai-inference.mycorp.local>. You may need to hold off on this step until after you identify the namespace that will be used to host workloads.

Distributed Inference Setup#

If you have a need for distributed inference, you will need to run these commands to install the Leader Worker Set CRDs.

CHART_VERSION=0.6.2
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=$CHART_VERSION \
  --namespace lws-system \
  --create-namespace \
  --wait --timeout 300s

User Access#

The default Admin account should not be used for general access to the Run:ai system. Please create individual accounts for Local Administrators that could need access by following these instructions:

https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/authentication/users#creating-a-local-user

Run:ai does have the option to integrate into an Identity Provider via either OIDC or SAML 2.0. Please refer to the official documentation to configure:

https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/authentication/sso

Run:ai Install Complete#

The Run:ai installation is now complete. At this point we need to await the installation of OSMO to configure the Run:ai Cluster and associated Run:ai Projects. This will be covered in the OSMO installation section.