MIG Support in Kubernetes
This document provides steps on getting started and running some example CUDA workloads on MIG-enabled GPUs in a Kubernetes cluster.
Software Pre-requisites
The deployment workflow requires these prerequisites. Once these prerequisites have been met,
you can proceed to deploy a MIG capable version of the NVIDIA k8s-device-plugin
and
the gpu-feature-discovery
component in your cluster, so that Kubernetes can schedule
pods on the available MIG devices.
You already have a Kubernetes deployment up and running with access to at least one NVIDIA A100 GPU.
The node with the NVIDIA A100 GPU is running the following versions of NVIDIA software:
NVIDIA datacenter driver >= 450.80.02
NVIDIA Container Toolkit (
nvidia-docker2
) >= 2.5.0 (and correspondinglibnvidia-container
>= 1.3.3)
NVIDIA k8s-device-plugin: v0.7.0+
NVIDIA gpu-feature-discovery: v0.2.0+
Getting Started
Install Kubernetes
As a first step, ensure that you have a Kubernetes deployment set up with a control plane and nodes joined to the cluster. Follow the Install Kubernetes guide for getting started with setting up a Kubernetes cluster.
Configuration Strategy
TBD.
Setting up MIG Geometry
You can either use NVML (or its command-line interface nvidia-smi
) to configure the desired MIG geometry. For automation,
we recommend using tooling such as mig-parted that allows configuring MIG mode
and creating the desired profiles on the GPUs.
In this step, let’s use mig-parted
to configure the A100 into 7 GPUs (using the 1g.5gb
profile):
$ sudo nvidia-mig-parted apply -f config.yaml -c all-1g.5gb
Now, the A100 should be configured into 7 MIG devices:
$ sudo nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 1g.5gb 19 7 0:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 8 1:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 9 2:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 10 3:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 11 4:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 12 5:1 |
+----------------------------------------------------+
| 0 MIG 1g.5gb 19 13 6:1 |
+----------------------------------------------------+
Deploying the NVIDIA Device Plugin and GFD
NVIDIA Device Plugin
Depending on the MIG configuration strategy used for the cluster, deploy the NVIDIA device plugin with the right options.
In this example, we assume that the user has chosen a single
MIG strategy for the cluster.
$ helm install \
--generate-name \
--set migStrategy=single \
nvdp/nvidia-device-plugin
At this point, the nvidia-device-plugin daemonset should be deployed and enumerated the MIG devices to Kubernetes:
2021/04/26 23:19:15 Loading NVML
2021/04/26 23:19:15 Starting FS watcher.
2021/04/26 23:19:15 Starting OS watcher.
2021/04/26 23:19:15 Retreiving plugins.
2021/04/26 23:19:16 Starting GRPC server for 'nvidia.com/gpu'
2021/04/26 23:19:16 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/04/26 23:19:16 Registered device plugin for 'nvidia.com/gpu' with Kubelet
GPU Feature Discovery
Next, we deploy the GPU Feature Discovery (GFD) plugin to label the GPU nodes so that users can specific MIG devices as resources in their podspec. Note that the GFD Helm chart also deploys the Node Feature Discovery (NFD) as a prerequisite:
$ helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery \
&& helm repo update
$ helm install \
--generate-name \
--set migStrategy=single \
nvgfd/gpu-feature-discovery
At this point, we can verify that all pods are running:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-6d8ccdbf46-wjst8 1/1 Running 1 4h58m
kube-system calico-node-qp5wf 1/1 Running 1 4h58m
kube-system coredns-558bd4d5db-c6nhk 1/1 Running 1 4h59m
kube-system coredns-558bd4d5db-cgjr7 1/1 Running 1 4h59m
kube-system etcd-ipp1-0552 1/1 Running 1 5h
kube-system kube-apiserver-ipp1-0552 1/1 Running 1 5h
kube-system kube-controller-manager-ipp1-0552 1/1 Running 1 5h
kube-system kube-proxy-d7tqd 1/1 Running 1 4h59m
kube-system kube-scheduler-ipp1-0552 1/1 Running 1 5h
kube-system nvidia-device-plugin-1619479152-646qm 1/1 Running 0 115m
node-feature-discovery gpu-feature-discovery-1619479450-f7rvv 1/1 Running 0 110m
node-feature-discovery nfd-master-74f76f6c68-zgt9d 1/1 Running 0 110m
node-feature-discovery nfd-worker-pkdn2 1/1 Running 0 110m
And the node has been labeled:
$ kubectl get node -o json | jq '.items[].metadata.labels'
with labels:
...
"node-role.kubernetes.io/master": "",
"node.kubernetes.io/exclude-from-external-load-balancers": "",
"nvidia.com/cuda.driver.major": "460",
"nvidia.com/cuda.driver.minor": "73",
"nvidia.com/cuda.driver.rev": "01",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "2",
"nvidia.com/gfd.timestamp": "1619479472",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "7",
"nvidia.com/gpu.engines.copy": "1",
"nvidia.com/gpu.engines.decoder": "0",
"nvidia.com/gpu.engines.encoder": "0",
"nvidia.com/gpu.engines.jpeg": "0",
"nvidia.com/gpu.engines.ofa": "0",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "SYS-1019GP-TT-02-NC24B",
"nvidia.com/gpu.memory": "4864",
"nvidia.com/gpu.multiprocessors": "14",
"nvidia.com/gpu.product": "A100-PCIE-40GB-MIG-1g.5gb",
"nvidia.com/gpu.slices.ci": "1",
"nvidia.com/gpu.slices.gi": "1",
"nvidia.com/mig.strategy": "single"
}
We can now proceed to run some sample workloads.
Running Sample CUDA Workloads
CUDA VectorAdd
Let’s run a simple CUDA sample, in this case vectorAdd
by requesting a GPU resource as you would
normally do in Kubernetes. In this case, Kubernetes will schedule the pod on a single MIG device and
we use a nodeSelector
to direct the pod to be scheduled on the node with the MIG devices.
$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: vectoradd
image: nvidia/samples:vectoradd-cuda11.2.1
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
EOF
Concurrent Job Launch
Now, let’s try a more complex example. In this example, we will use Argo Workflows to launch concurrent
jobs on MIG devices. In this example, the A100 has been configured into 2 MIG devices using the: 3g.20gb
profile.
First, install the Argo Workflows components into your Kubernetes cluster.
$ kubectl create ns argo \
&& kubectl apply -n argo \
-f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml
Next, download the latest Argo CLI from the releases page and follow the instructions to install the binary.
Now, we will craft an Argo example that launches multiple CUDA containers onto the MIG devices on the GPU.
We will reuse the same vectorAdd
example from before. Here is the job description, saved as vector-add.yaml
:
$ cat << EOF > vector-add.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: argo-mig-example-
spec:
entrypoint: argo-mig-result-example
templates:
- name: argo-mig-result-example
steps:
- - name: generate
template: gen-mig-device-list
# Iterate over the list of numbers generated by the generate step above
- - name: argo-mig
template: argo-mig
arguments:
parameters:
- name: argo-mig
value: "{{item}}"
withParam: "{{steps.generate.outputs.result}}"
# Generate a list of numbers in JSON format
- name: gen-mig-device-list
script:
image: python:alpine3.6
command: [python]
source: |
import json
import sys
json.dump([i for i in range(0, 2)], sys.stdout)
- name: argo-mig
retryStrategy:
limit: 10
retryPolicy: "Always"
inputs:
parameters:
- name: argo-mig
container:
image: nvidia/samples:vectoradd-cuda11.2.1
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb
EOF
Launch the workflow:
$ argo submit -n argo --watch vector-add.yaml
Argo will print out the pods that have been launched:
Name: argo-mig-example-z6mqd
Namespace: argo
ServiceAccount: default
Status: Succeeded
Conditions:
Completed True
Created: Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Started: Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Finished: Wed Mar 24 14:45:11 -0700 (now)
Duration: 20 seconds
Progress: 3/3
ResourcesDuration: 9s*(1 cpu),9s*(100Mi memory),1s*(1 nvidia.com/gpu)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ argo-mig-example-z6mqd argo-mig-result-example
├───✔ generate gen-mig-device-list argo-mig-example-z6mqd-562792713 8s
└─┬─✔ argo-mig(0:0)(0) argo-mig argo-mig-example-z6mqd-845918106 2s
└─✔ argo-mig(1:1)(0) argo-mig argo-mig-example-z6mqd-870679174 2s
If you observe the logs, you can see that the vector-add
sample has completed on both devices:
$ argo logs -n argo @latest
argo-mig-example-z6mqd-562792713: [0, 1]
argo-mig-example-z6mqd-870679174: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-870679174: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-870679174: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-870679174: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-870679174: Test PASSED
argo-mig-example-z6mqd-870679174: Done
argo-mig-example-z6mqd-845918106: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-845918106: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-845918106: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-845918106: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-845918106: Test PASSED
argo-mig-example-z6mqd-845918106: Done