GPU Operator with MIG

About Multi-Instance GPU
Enabling MIG During Installation
Configuring MIG Profiles
Reconfiguring MIG Profiles
Verification: Running Sample CUDA Workloads
- CUDA VectorAdd
- Concurrent Job Launch
Disabling MIG
MIG Manager with Preinstalled Drivers
- Install
- Managing Host GPU Clients
Architecture

About Multi-Instance GPU

Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications. Refer to the MIG User Guide for more details on MIG.

This documents provides an overview of how to use the GPU Operator with nodes that support MIG.

Enabling MIG During Installation

In this example workflow, we start with a MIG strategy of single. The mixed strategy can also be specified and used in a similar manner.

Note

In a CSP IaaS environment such as Google Cloud, ensure that the mig-manager variable WITH_REBOOT is set to “true”. Refer to the note in the MIG User Guide for more information on the constraints with enabling MIG mode.

We can use the following option to install the GPU Operator:

$ helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set mig.strategy=single

Note

mig.strategy should be set to mixed when MIG mode is not enabled on all GPUs on a node.

Note

Starting with v1.9, MIG Manager supports preinstalled drivers. If drivers are preinstalled, use an additional option during installation --set driver.enabled=false. See MIG Manager with Preinstalled Drivers for more details.

At this point, all the pods, including the nvidia-mig-manager will be deployed on nodes that have MIG capable GPUs:

$ kubectl get pods -n gpu-operator

NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-operator-d6ccd4d8d-9cgzr                                  1/1     Running     2          6m58s
gpu-operator-node-feature-discovery-master-867c4f7bfb-4nlq7   1/1     Running     0          6m58s
gpu-operator-node-feature-discovery-worker-6rvr2              1/1     Running     1          6m58s
gpu-feature-discovery-sclxr                                   1/1     Running     0          6m39s
nvidia-container-toolkit-daemonset-tnh82                      1/1     Running     0          6m39s
nvidia-cuda-validator-qt6wq                                   0/1     Completed   0          3m11s
nvidia-dcgm-exporter-dh46q                                    1/1     Running     0          6m39s
nvidia-device-plugin-daemonset-t6qkz                          1/1     Running     0          6m39s
nvidia-device-plugin-validator-sd5f7                          0/1     Completed   0          105s
nvidia-driver-daemonset-f7ktr                                 1/1     Running     0          6m40s
nvidia-mig-manager-gzg8n                                      1/1     Running     0          79s
nvidia-operator-validator-vsccj                               1/1     Running     0          6m39s

You can also check the labels applied to the node:

$ kubectl get node -o json | jq '.items[].metadata.labels'

"nvidia.com/cuda.driver.major": "460",
"nvidia.com/cuda.driver.minor": "73",
"nvidia.com/cuda.driver.rev": "01",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "2",
"nvidia.com/gfd.timestamp": "1621375725",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.deploy.container-toolkit": "true",
"nvidia.com/gpu.deploy.dcgm-exporter": "true",
"nvidia.com/gpu.deploy.device-plugin": "true",
"nvidia.com/gpu.deploy.driver": "true",
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
"nvidia.com/gpu.deploy.mig-manager": "true",
"nvidia.com/gpu.deploy.operator-validator": "true",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "Google-Compute-Engine",
"nvidia.com/gpu.memory": "40536",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "A100-SXM4-40GB",
"nvidia.com/mig.strategy": "single"

Warning

The MIG Manager currently requires that all user workloads on the GPUs being configured be stopped. In some cases, the node may need to be rebooted (esp. in CSP IaaS), so the node may need to be cordoned before changing the MIG mode or the MIG geometry on the GPUs.

This requirement may be relaxed in future releases.

Configuring MIG Profiles

Now, let’s configure the GPU into a supported by setting the mig.config label on the GPU node.

Note

The mig-manager uses a ConfigMap called mig-parted-config in the GPU Operator namespace in the daemonset to include supported MIG profiles. Refer to the ConfigMap to use when changing the label below or modify the ConfigMap appropriately for your use-case.

In this example, we use the 1g.5gb profile:

$ kubectl label nodes $NODE nvidia.com/mig.config=all-1g.5gb

The MIG manager will proceed to apply a mig.config.state label to the GPU and then terminate all the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry:

"nvidia.com/mig.config": "all-1g.5gb",
"nvidia.com/mig.config.state": "pending"

kube-system              kube-scheduler-a100-mig-k8s                                   1/1     Running       1          45m
gpu-operator             nvidia-dcgm-exporter-dh46q                                    1/1     Terminating   0          13m
gpu-operator             gpu-feature-discovery-sclxr                                   1/1     Terminating   0          13m
gpu-operator             nvidia-device-plugin-daemonset-t6qkz                          1/1     Terminating   0          13m

Note

As described above, if the WITH_REBOOT option is set then the MIG manager will proceed to reboot the node:

"nvidia.com/mig.config": "all-1g.5gb",
"nvidia.com/mig.config.state": "rebooting"

Once the MIG manager has completed applying the configuration changes (including a node reboot if required), the node labels should appear as shown below:

"nvidia.com/cuda.driver.major": "460",
"nvidia.com/cuda.driver.minor": "73",
"nvidia.com/cuda.driver.rev": "01",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "2",
"nvidia.com/gfd.timestamp": "1621442537",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "7",
"nvidia.com/gpu.deploy.container-toolkit": "true",
"nvidia.com/gpu.deploy.dcgm-exporter": "true",
"nvidia.com/gpu.deploy.device-plugin": "true",
"nvidia.com/gpu.deploy.driver": "true",
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
"nvidia.com/gpu.deploy.mig-manager": "true",
"nvidia.com/gpu.deploy.operator-validator": "true",
"nvidia.com/gpu.engines.copy": "1",
"nvidia.com/gpu.engines.decoder": "0",
"nvidia.com/gpu.engines.encoder": "0",
"nvidia.com/gpu.engines.jpeg": "0",
"nvidia.com/gpu.engines.ofa": "0",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "Google-Compute-Engine",
"nvidia.com/gpu.memory": "4864",
"nvidia.com/gpu.multiprocessors": "14",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "A100-SXM4-40GB-MIG-1g.5gb",
"nvidia.com/gpu.slices.ci": "1",
"nvidia.com/gpu.slices.gi": "1",
"nvidia.com/mig.config": "all-1g.5gb",
"nvidia.com/mig.config.state": "success",
"nvidia.com/mig.strategy": "single"

The labels gpu.count and gpu.slices indicate that the devices are configured. We can also run nvidia-smi in the driver container to verify that the GPU has been configured:

$ sudo docker exec 629b93e200d9eea35be35a1b30991d007e48497d52a38e18a472945e44e52a8e nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-5c89852c-d268-c3f3-1b07-005d5ae1dc3f/14/0)

Finally, verify that the GPU Operator pods are in running state:

NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-operator-d6ccd4d8d-hhhq4                                  1/1     Running     4          38m
gpu-operator-node-feature-discovery-master-867c4f7bfb-jt95x   1/1     Running     1          38m
gpu-operator-node-feature-discovery-worker-rjpfb              1/1     Running     3          38m
gpu-feature-discovery-drzft                                   1/1     Running     0          97s
nvidia-container-toolkit-daemonset-885b5                      1/1     Running     1          38m
nvidia-cuda-validator-kh4tv                                   0/1     Completed   0          94s
nvidia-dcgm-exporter-6d5kd                                    1/1     Running     0          97s
nvidia-device-plugin-daemonset-kspv5                          1/1     Running     0          97s
nvidia-device-plugin-validator-mpgv9                          0/1     Completed   0          83s
nvidia-driver-daemonset-mgmdb                                 1/1     Running     3          38m
nvidia-mig-manager-svv7b                                      1/1     Running     1          35m
nvidia-operator-validator-w44q8                               1/1     Running     0          97s

Reconfiguring MIG Profiles

The MIG manager supports dynamic reconfiguration of the MIG geometry. In this example, let’s reconfigure the GPU into a 3g.20gb profile:

$ kubectl label nodes $NODE nvidia.com/mig.config=all-3g.20gb --overwrite

We can see from the logs of the MIG manager that it has reconfigured the GPU into the new MIG geometry:

Applying the selected MIG config to the node
time="2021-05-19T16:42:14Z" level=debug msg="Parsing config file..."
time="2021-05-19T16:42:14Z" level=debug msg="Selecting specific MIG config..."
time="2021-05-19T16:42:14Z" level=debug msg="Running apply-start hook"
time="2021-05-19T16:42:14Z" level=debug msg="Checking current MIG mode..."
time="2021-05-19T16:42:14Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2021-05-19T16:42:14Z" level=debug msg="  GPU 0: 0x20B010DE"
time="2021-05-19T16:42:14Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2021-05-19T16:42:14Z" level=debug msg="    MIG capable: true\n"
time="2021-05-19T16:42:14Z" level=debug msg="    Current MIG mode: Enabled"
time="2021-05-19T16:42:14Z" level=debug msg="Checking current MIG device configuration..."
time="2021-05-19T16:42:14Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2021-05-19T16:42:14Z" level=debug msg="  GPU 0: 0x20B010DE"
time="2021-05-19T16:42:14Z" level=debug msg="    Asserting MIG config: map[1g.5gb:7]"
time="2021-05-19T16:42:14Z" level=debug msg="Running pre-apply-config hook"
time="2021-05-19T16:42:14Z" level=debug msg="Applying MIG device configuration..."
time="2021-05-19T16:42:14Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2021-05-19T16:42:14Z" level=debug msg="  GPU 0: 0x20B010DE"
time="2021-05-19T16:42:14Z" level=debug msg="    MIG capable: true\n"
time="2021-05-19T16:42:14Z" level=debug msg="    Updating MIG config: map[1g.5gb:7]"
time="2021-05-19T16:42:14Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting all GPU clients previouly shutdown by reenabling their component-specific nodeSelector labels
node/pramarao-a100-mig-k8s labeled
Changing the 'nvidia.com/mig.config.state' node label to 'success'

And the node labels have been updated appropriately:

"nvidia.com/gpu.product": "A100-SXM4-40GB-MIG-3g.20gb",
"nvidia.com/gpu.slices.ci": "3",
"nvidia.com/gpu.slices.gi": "3",
"nvidia.com/mig.config": "all-3g.20gb",

Verification: Running Sample CUDA Workloads

CUDA VectorAdd

Let’s run a simple CUDA sample, in this case vectorAdd by requesting a GPU resource as you would normally do in Kubernetes. In this case, Kubernetes will schedule the pod on a single MIG device and we use a nodeSelector to direct the pod to be scheduled on the node with the MIG devices.

$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvidia/samples:vectoradd-cuda11.2.1
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
EOF

Concurrent Job Launch

Now, let’s try a more complex example. In this example, we will use Argo Workflows to launch concurrent jobs on MIG devices. In this example, the A100 has been configured into 2 MIG devices using the: 3g.20gb profile.

First, install the Argo Workflows components into your Kubernetes cluster.

$ kubectl create ns argo \
    && kubectl apply -n argo \
    -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml

Next, download the latest Argo CLI from the releases page and follow the instructions to install the binary.

Now, we will craft an Argo example that launches multiple CUDA containers onto the MIG devices on the GPU. We will reuse the same vectorAdd example from before. Here is the job description, saved as vector-add.yaml:

$ cat << EOF > vector-add.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: argo-mig-example-
spec:
entrypoint: argo-mig-result-example
templates:
- name: argo-mig-result-example
    steps:
    - - name: generate
        template: gen-mig-device-list
    # Iterate over the list of numbers generated by the generate step above
    - - name: argo-mig
        template: argo-mig
        arguments:
        parameters:
        - name: argo-mig
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

# Generate a list of numbers in JSON format
- name: gen-mig-device-list
    script:
    image: python:alpine3.6
    command: [python]
    source: |
        import json
        import sys
        json.dump([i for i in range(0, 2)], sys.stdout)

- name: argo-mig
    retryStrategy:
    limit: 10
    retryPolicy: "Always"
    inputs:
    parameters:
    - name: argo-mig
    container:
    image: nvidia/samples:vectoradd-cuda11.2.1
    resources:
        limits:
        nvidia.com/gpu: 1
    nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb
EOF

Launch the workflow:

$ argo submit -n argo --watch vector-add.yaml

Argo will print out the pods that have been launched:

Name:                argo-mig-example-z6mqd
Namespace:           argo
ServiceAccount:      default
Status:              Succeeded
Conditions:
Completed           True
Created:             Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Started:             Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Finished:            Wed Mar 24 14:45:11 -0700 (now)
Duration:            20 seconds
Progress:            3/3
ResourcesDuration:   9s*(1 cpu),9s*(100Mi memory),1s*(1 nvidia.com/gpu)

STEP                       TEMPLATE                 PODNAME                           DURATION  MESSAGE
✔ argo-mig-example-z6mqd  argo-mig-result-example
├───✔ generate            gen-mig-device-list      argo-mig-example-z6mqd-562792713  8s
└─┬─✔ argo-mig(0:0)(0)    argo-mig                 argo-mig-example-z6mqd-845918106  2s
└─✔ argo-mig(1:1)(0)    argo-mig                 argo-mig-example-z6mqd-870679174  2s

If you observe the logs, you can see that the vector-add sample has completed on both devices:

$ argo logs -n argo @latest

argo-mig-example-z6mqd-562792713: [0, 1]
argo-mig-example-z6mqd-870679174: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-870679174: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-870679174: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-870679174: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-870679174: Test PASSED
argo-mig-example-z6mqd-870679174: Done
argo-mig-example-z6mqd-845918106: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-845918106: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-845918106: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-845918106: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-845918106: Test PASSED
argo-mig-example-z6mqd-845918106: Done

Disabling MIG

You can disable MIG on a node by setting the nvidia.con/mig.config label to all-disabled:

$ kubectl label nodes $NODE nvidia.com/mig.config=all-disabled --overwrite

MIG Manager with Preinstalled Drivers

Starting with v1.9, MIG Manager supports preinstalled drivers. Everything detailed in this document still applies, however there are a few additional details to consider.

Install

During GPU Operator installation, driver.enabled=false must be set. The following options can be used to install the GPU Operator:

$ helm install gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false

Managing Host GPU Clients

The MIG Manager stops all operator-managed pods that have access to GPUs when applying a MIG reconfiguration. When drivers are preinstalled, there may be GPU clients on the host that also need to be stopped.

When drivers are preinstalled, the MIG Manager will try stopping and restarting a list of systemd services on the host across a MIG reconfiguration. The list of services are specified in a ConfigMap to the MIG Manager daemonset. By default, the GPU Operator creates a ConfigMap, named default-gpu-clients, containing a default list of systemd services.

Below is a sample GPU clients file, clients.yaml, used when creating the default-gpu-clients ConfigMap:

version: v1
systemd-services:
  - nvsm.service
  - nvsm-mqtt.service
  - nvsm-core.service
  - nvsm-api-gateway.service
  - nvsm-notifier.service
  - nv_peer_mem.service
  - nvidia-dcgm.service
  - dcgm.service
  - dcgm-exporter.service

In the future, the GPU clients file will be extended to allow specifying more than just systemd services.

The user may modify the default list by directly editing the default-gpu-clients ConfigMap post-install. The user can also create their own custom ConfigMap to be used by the MIG Manager by performing the following steps:

Create the gpu-operator namespace:

$ kubectl create namespace gpu-operator

Create a ConfigMap containing the custom clients.yaml file with a list of GPU clients:

$ kubectl create configmap -n gpu-operator gpu-clients --from-file=clients.yaml

Install the GPU Operator:

$ helm install gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set migManager.gpuClientsConfig.name=gpu-clients
    --set driver.enabled=false

Architecture

The MIG manager is designed as a controller within Kubernetes. It watches for changes to the nvidia.com/mig.config label on the node and then applies the user requested MIG configuration When the label changes, the MIG Manager first stops all GPU pods (including the device plugin, gfd and dcgm-exporter). It then stops all host GPU clients listed in the clients.yaml ConfigMap if drivers are preinstalled. Finally, it applies the MIG reconfiguration and restarts the GPU pods (and possibly host GPU clients). The MIG reconfiguration may also involve a node reboot if required for enabling MIG mode.

The available MIG profiles are specified in a ConfigMap to the MIG manager daemonset. The user may choose one of these profiles to apply to the mig.config label to trigger a reconfiguration of the MIG geometry.

The MIG manager relies on the mig-parted tool to apply the configuration changes to the GPU, including enabling MIG mode (with a node reboot as required by some scenarios).

flowchart subgraph mig[MIG Manager] direction TB A[Controller] <--> B[MIG-Parted] end A -- on change --> C subgraph recon[Reconfiguration] C["Config is Pending or Rebooting"] --> D["Stop Operator Pods"] --> E["Enable MIG Mode and Reboot if Required"] --> F["Use mig-parted to Configure MIG Geometry"] --> G["Restart Operator Pods"] end H["Set mig.config label to Success"] I["Set mig.config label to Failed"] G --> H G -- on failure --> I