NVIDIA KAI Scheduler Integration Guide#

KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale. To use the KAI Scheduler for NVCF Workloads the following configuration should be applied post the installation of the KAI Scheduler in the cluster and the Optimized AI Workload Scheduling enabled on the cluster. NVCF Workloads deployed will be automatically BinPacked upon this cluster configuration changes.

KAI Scheduler Installation

Note

Upgrade to latest KAI Scheduler release is recommended to get latest fixes and security patches

Installation#
1helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version v0.12.6

KAI Scheduler Configuration

nvcf-kai-scheduler-config.yaml#
 1apiVersion: scheduling.run.ai/v2
 2kind: Queue
 3metadata:
 4    name: default-parent-queue
 5spec:
 6    resources:
 7        cpu:
 8            quota: -1
 9            limit: -1
10            overQuotaWeight: 1
11        gpu:
12            quota: -1
13            limit: -1
14            overQuotaWeight: 1
15        memory:
16            quota: -1
17            limit: -1
18            overQuotaWeight: 1
19---
20apiVersion: scheduling.run.ai/v2
21kind: Queue
22metadata:
23    name: default-queue
24spec:
25    parentQueue: default-parent-queue
26    resources:
27        cpu:
28            quota: -1
29            limit: -1
30            overQuotaWeight: 1
31        gpu:
32            quota: -1
33            limit: -1
34            overQuotaWeight: 1
35        memory:
36            quota: -1
37            limit: -1
38            overQuotaWeight: 1

Save the configuration template to nvcf-kai-scheduler-config.yaml and apply as:

1kubectl create -f nvcf-kai-scheduler-config.yaml