GPU Operator with MIG#
About Multi-Instance GPU#
Multi-Instance GPU (MIG) enables GPUs based on the NVIDIA Ampere and later architectures, such as NVIDIA A100, to be partitioned into separate and secure GPU instances for CUDA applications. Refer to the MIG User Guide for more information about MIG.
GPU Operator deploys MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. You must enable MIG during installation by choosing a MIG strategy before you can configure MIG.
Refer to the architecture section for more information about how MIG is implemented in the GPU Operator.
Enabling MIG During Installation#
Use the following steps to enable MIG and deploy MIG Manager.
Install the Operator:
$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --version=v26.3.0 \ --set mig.strategy=single
This example sets
singleas the MIG strategy. Available MIG strategy options:single: MIG mode is enabled on all GPUs on a node.mixed: MIG mode is not enabled on all GPUs on a node.
In a cloud service provider (CSP) environment such as Google Cloud, also specify
--set migManager.env[0].name=WITH_REBOOT --set-string migManager.env[0].value=trueto ensure that the node reboots and can apply the MIG configuration.MIG Manager supports preinstalled drivers, meaning drivers that are not managed by the GPU Operator and you installed directly on the host. If drivers are preinstalled, also specify
--set driver.enabled=false. Refer to MIG Manager with Preinstalled Drivers for more details.After several minutes, all GPU Operator pods, including the
nvidia-mig-managerare deployed on nodes that have MIG capable GPUs.Note
MIG Manager requires that no user workloads are running on the GPUs being configured. In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned before changing the MIG mode or the MIG geometry on the GPUs.
Optional: Display the pods in the Operator namespace:
$ kubectl get pods -n gpu-operatorExample Output
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-qmwb2 1/1 Running 0 14m gpu-operator-7bbf8bb6b7-xz664 1/1 Running 0 14m gpu-operator-node-feature-discovery-gc-79d6d968bb-sg4t6 1/1 Running 0 14m gpu-operator-node-feature-discovery-master-6d9f8d497c-7cwrp 1/1 Running 0 14m gpu-operator-node-feature-discovery-worker-x5z62 1/1 Running 0 14m nvidia-container-toolkit-daemonset-pkcpr 1/1 Running 0 14m nvidia-cuda-validator-wt6bc 0/1 Completed 0 12m nvidia-dcgm-exporter-zsskv 1/1 Running 0 14m nvidia-device-plugin-daemonset-924x6 1/1 Running 0 14m nvidia-driver-daemonset-klj5s 1/1 Running 0 14m nvidia-mig-manager-8d6wz 1/1 Running 0 12m nvidia-operator-validator-fnsmk 1/1 Running 0 14m
Optional: Display the labels applied to the node:
$ kubectl get node -o json | jq '.items[].metadata.labels'
Partial Output
"nvidia.com/gpu.present": "true", "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3", "nvidia.com/gpu.replicas": "1", "nvidia.com/gpu.sharing-strategy": "none", "nvidia.com/mig.capable": "true", "nvidia.com/mig.config": "all-disabled", "nvidia.com/mig.config.state": "success", "nvidia.com/mig.strategy": "single", "nvidia.com/mps.capable": "false" }
Configuring MIG Profiles#
When MIG is enabled, nodes are labeled with nvidia.com/mig.config: all-disabled by default.
To use a profile on a node, update the label value with the desired profile, for example, nvidia.com/mig.config=all-1g.10gb.
Introduced in GPU Operator v26.3.0, MIG Manager generates the MIG configuration for a node at runtime from the available hardware.
The configuration is generated on startup, discovering MIG profiles for each MIG-capable GPU on a node using NVIDIA Management Library (NVML), then writing it to a ConfigMap for each MIG-capable node in your cluster.
The ConfigMap is named <node-name>-mig-config, where <node-name> is the name of each MIG-capable node.
Each ConfigMap contains a complete mig-parted config, including all-disabled, all-enabled, per-profile configs such as all-1g.10gb, and all-balanced with device-filter support for mixed GPU types.
When a new MIG-capable GPU is added to a node, the new GPU is automatically added to the ConfigMap.
If you need custom profiles, you can use a custom MIG configuration instead of the generated one. You can use the Helm chart to create a ConfigMap from values at install time, or create and reference your own ConfigMap. For an example, refer to Example: Custom MIG Configuration During Installation.
Note
Generated MIG configuration might not be available on older drivers, such as 535 branch GPU drivers, as they do not support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use a static Configmap, default-mig-parted-config, for MIG profiles.
Example: Single MIG Strategy#
The following steps show how to use the single MIG strategy and configure the 1g.10gb profile on one node.
Configure the MIG strategy to
singleif you are unsure of the current strategy:$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ --type='json' \ -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]'
Label the nodes with the profile to configure:
$ kubectl label nodes <node-name> nvidia.com/mig.config=all-1g.10gb --overwrite
MIG Manager proceeds to apply a
mig.config.statelabel to the node and terminates all the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry.Optional: Display the node labels:
$ kubectl get node <node-name> -o=jsonpath='{.metadata.labels}' | jq .
Partial Output
"nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3", "nvidia.com/gpu.replicas": "1", "nvidia.com/gpu.sharing-strategy": "none", "nvidia.com/mig.capable": "true", "nvidia.com/mig.config": "all-1g.10gb", "nvidia.com/mig.config.state": "pending", "nvidia.com/mig.strategy": "single" }
When the
WITH_REBOOToption is set, MIG Manager sets the label tonvidia.com/mig.config.state: rebooting.Confirm that MIG Manager completed the configuration by checking the node labels:
$ kubectl get node <node-name> -o=jsonpath='{.metadata.labels}' | jq .
Check for the following labels:
nvidia.com/gpu.count: 7(the value differs according to the GPU model)nvidia.com/gpu.slices.ci: 1nvidia.com/gpu.slices.gi: 1nvidia.com/mig.config.state: success
Partial Output
"nvidia.com/gpu.count": "7", "nvidia.com/gpu.present": "true", "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-1g.10gb", "nvidia.com/gpu.slices.ci": "1", "nvidia.com/gpu.slices.gi": "1", "nvidia.com/mig.capable": "true", "nvidia.com/mig.config": "all-1g.10gb", "nvidia.com/mig.config.state": "success", "nvidia.com/mig.strategy": "single"
Optional: Run the
nvidia-smicommand in the driver container to verify that the MIG configuration has been applied.$ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L
Example Output
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-b4895dbf-9350-2524-a89b-98161ddd9fe4) MIG 1g.10gb Device 0: (UUID: MIG-3f6f389f-b0cc-5e5c-8e32-eaa8fd067902) MIG 1g.10gb Device 1: (UUID: MIG-35f93699-4b53-5a19-8289-80b8418eec60) MIG 1g.10gb Device 2: (UUID: MIG-9d14fb21-4ae1-546f-a636-011582899c39) MIG 1g.10gb Device 3: (UUID: MIG-0f709664-740c-52b0-ae79-3e4c9ede6d3b) MIG 1g.10gb Device 4: (UUID: MIG-5d23f73a-d378-50ac-a6f5-3bf5184773bb) MIG 1g.10gb Device 5: (UUID: MIG-6cea15c7-8a56-578c-b965-0e73cb6dfc10) MIG 1g.10gb Device 6: (UUID: MIG-981c86e9-3607-57d7-9426-295347e4b925)
Example: Mixed MIG Strategy#
The following steps show how to use the mixed MIG strategy and configure the all-balanced profile on one node.
Configure the MIG strategy to
mixedif you are unsure of the current strategy:$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ --type='json' \ -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'
Label the nodes with the profile to configure:
$ kubectl label nodes <node-name> nvidia.com/mig.config=all-balanced --overwrite
MIG Manager proceeds to apply a
mig.config.statelabel to the node and terminates all the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry.Confirm that MIG Manager completed the configuration by checking the node labels:
$ kubectl get node <node-name> -o=jsonpath='{.metadata.labels}' | jq .
Check for labels like the following. The profiles and GPU counts differ according to the GPU model.
nvidia.com/mig-1g.10gb.count: 2nvidia.com/mig-2g.20gb.count: 1nvidia.com/mig-3g.40gb.count: 1nvidia.com/mig.config.state: success
Partial Output
"nvidia.com/gpu.present": "true", "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3", "nvidia.com/gpu.replicas": "0", "nvidia.com/gpu.sharing-strategy": "none", "nvidia.com/mig-1g.10gb.count": "2", "nvidia.com/mig-1g.10gb.engines.copy": "1", "nvidia.com/mig-1g.10gb.engines.decoder": "1", "nvidia.com/mig-1g.10gb.engines.encoder": "0", "nvidia.com/mig-1g.10gb.engines.jpeg": "1", "nvidia.com/mig-1g.10gb.engines.ofa": "0", "nvidia.com/mig-1g.10gb.memory": "9984", "nvidia.com/mig-1g.10gb.multiprocessors": "16", "nvidia.com/mig-1g.10gb.product": "NVIDIA-H100-80GB-HBM3-MIG-1g.10gb", "nvidia.com/mig-1g.10gb.replicas": "1", "nvidia.com/mig-1g.10gb.sharing-strategy": "none", "nvidia.com/mig-1g.10gb.slices.ci": "1", "nvidia.com/mig-1g.10gb.slices.gi": "1", "nvidia.com/mig-2g.20gb.count": "1", "nvidia.com/mig-2g.20gb.engines.copy": "2", "nvidia.com/mig-2g.20gb.engines.decoder": "2", "nvidia.com/mig-2g.20gb.engines.encoder": "0", "nvidia.com/mig-2g.20gb.engines.jpeg": "2", "nvidia.com/mig-2g.20gb.engines.ofa": "0", "nvidia.com/mig-2g.20gb.memory": "20096", "nvidia.com/mig-2g.20gb.multiprocessors": "32", "nvidia.com/mig-2g.20gb.product": "NVIDIA-H100-80GB-HBM3-MIG-2g.20gb", "nvidia.com/mig-2g.20gb.replicas": "1", "nvidia.com/mig-2g.20gb.sharing-strategy": "none", "nvidia.com/mig-2g.20gb.slices.ci": "2", "nvidia.com/mig-2g.20gb.slices.gi": "2", "nvidia.com/mig-3g.40gb.count": "1", "nvidia.com/mig-3g.40gb.engines.copy": "3", "nvidia.com/mig-3g.40gb.engines.decoder": "3", "nvidia.com/mig-3g.40gb.engines.encoder": "0", "nvidia.com/mig-3g.40gb.engines.jpeg": "3", "nvidia.com/mig-3g.40gb.engines.ofa": "0", "nvidia.com/mig-3g.40gb.memory": "40320", "nvidia.com/mig-3g.40gb.multiprocessors": "60", "nvidia.com/mig-3g.40gb.product": "NVIDIA-H100-80GB-HBM3-MIG-3g.40gb", "nvidia.com/mig-3g.40gb.replicas": "1", "nvidia.com/mig-3g.40gb.sharing-strategy": "none", "nvidia.com/mig-3g.40gb.slices.ci": "3", "nvidia.com/mig-3g.40gb.slices.gi": "3", "nvidia.com/mig.capable": "true", "nvidia.com/mig.config": "all-balanced", "nvidia.com/mig.config.state": "success", "nvidia.com/mig.strategy": "mixed", "nvidia.com/mps.capable": "false" }
Optional: Run the
nvidia-smicommand in the driver container to verify that the GPU has been configured.$ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L
Example Output
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-b4895dbf-9350-2524-a89b-98161ddd9fe4) MIG 3g.40gb Device 0: (UUID: MIG-7089d0f3-293f-58c9-8f8c-5ea666eedbde) MIG 2g.20gb Device 1: (UUID: MIG-56c30729-347f-5dd6-8da0-c3cc59e969e0) MIG 1g.10gb Device 2: (UUID: MIG-9d14fb21-4ae1-546f-a636-011582899c39) MIG 1g.10gb Device 3: (UUID: MIG-0f709664-740c-52b0-ae79-3e4c9ede6d3b)
Example: Reconfiguring MIG Profiles#
MIG Manager supports dynamic reconfiguration of the MIG geometry.
The following steps show how to update a GPU on a node to the 3g.40gb profile with the single MIG strategy.
Label the node with the profile:
$ kubectl label nodes <node-name> nvidia.com/mig.config=all-3g.40gb --overwrite
Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied:
$ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager
Example Output
Applying the selected MIG config to the node time="2024-05-14T18:31:26Z" level=debug msg="Parsing config file..." time="2024-05-14T18:31:26Z" level=debug msg="Selecting specific MIG config..." time="2024-05-14T18:31:26Z" level=debug msg="Running apply-start hook" time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG mode..." time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG mode: Enabled" time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" time="2024-05-14T18:31:26Z" level=debug msg=" Current MIG mode: Enabled" time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG device configuration..." time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG config: map[3g.40gb:2]" time="2024-05-14T18:31:26Z" level=debug msg="Running pre-apply-config hook" time="2024-05-14T18:31:26Z" level=debug msg="Applying MIG device configuration..." time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" time="2024-05-14T18:31:26Z" level=debug msg=" Updating MIG config: map[3g.40gb:2]" MIG configuration applied successfully time="2024-05-14T18:31:27Z" level=debug msg="Running apply-exit hook" Restarting validator pod to re-run all validations pod "nvidia-operator-validator-kmncw" deleted Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels node/node-name labeled Changing the 'nvidia.com/mig.config.state' node label to 'success'
Optional: Display the node labels to confirm the GPU count (
2), slices (3), and profile are set:$ kubectl get node <node-name> -o=jsonpath='{.metadata.labels}' | jq .
Partial Output
"nvidia.com/gpu.count": "2", "nvidia.com/gpu.present": "true", "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-3g.40gb", "nvidia.com/gpu.replicas": "1", "nvidia.com/gpu.sharing-strategy": "none", "nvidia.com/gpu.slices.ci": "3", "nvidia.com/gpu.slices.gi": "3", "nvidia.com/mig.capable": "true", "nvidia.com/mig.config": "all-3g.40gb", "nvidia.com/mig.config.state": "success", "nvidia.com/mig.strategy": "single", "nvidia.com/mps.capable": "false" }
Example: Custom MIG Configuration During Installation#
If you need to use custom profiles, you can create a custom ConfigMap during installation by passing in a name and data for the ConfigMap with the Helm command.
The MIG Manager daemonset is configured to use this ConfigMap instead of the auto-generated one.
In your values.yaml file, set migManager.config.create to true, set migManager.config.name, and add the ConfigMap data under migManager.config.data, for example:
In your
values.yamlfile, add the data for the ConfigMap, like the following example:migManager: config: name: custom-mig-config create: true data: config.yaml: |- version: v1 mig-configs: all-disabled: - devices: all mig-enabled: false custom-mig: - devices: [0] mig-enabled: false - devices: [1] mig-enabled: true mig-devices: "1g.10gb": 2 - devices: [2] mig-enabled: true mig-devices: "2g.20gb": 2 "3g.40gb": 1 - devices: [3] mig-enabled: true mig-devices: "3g.40gb": 1 "4g.40gb": 1
Note
Custom ConfigMaps must contain a key named “config.yaml”
Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap:
$ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator --version=v26.3.0 \ -f values.yaml
If the custom configuration specifies more than one instance profile, set the strategy to
mixed:$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ --type='json' \ -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'
Label the nodes with the profile to configure:
$ kubectl label nodes <node-name> nvidia.com/mig.config=custom-mig --overwrite
Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied:
$ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager
Example Output
Applying the selected MIG config to the node time="2024-05-15T13:40:08Z" level=debug msg="Parsing config file..." time="2024-05-15T13:40:08Z" level=debug msg="Selecting specific MIG config..." time="2024-05-15T13:40:08Z" level=debug msg="Running apply-start hook" time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG mode..." time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG mode: Enabled" time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" time="2024-05-15T13:40:08Z" level=debug msg=" Current MIG mode: Enabled" time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG device configuration..." time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG config: map[1g.10gb:5 2g.20gb:1]" time="2024-05-15T13:40:08Z" level=debug msg="Running pre-apply-config hook" time="2024-05-15T13:40:08Z" level=debug msg="Applying MIG device configuration..." time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" time="2024-05-15T13:40:08Z" level=debug msg=" Updating MIG config: map[1g.10gb:5 2g.20gb:1]" time="2024-05-15T13:40:09Z" level=debug msg="Running apply-exit hook" MIG configuration applied successfully
Example: Custom MIG Configuration#
You can create and apply a ConfigMap yourself if the default profiles do not meet your needs.
Create a file, such as
custom-mig-config.yaml, with contents like the following example:apiVersion: v1 kind: ConfigMap metadata: name: custom-mig-config data: config.yaml: | version: v1 mig-configs: all-disabled: - devices: all mig-enabled: false five-1g-one-2g: - devices: all mig-enabled: true mig-devices: "1g.10gb": 5 "2g.20gb": 1
Note
Custom ConfigMaps must contain a key named “config.yaml”
Apply the manifest:
$ kubectl apply -n gpu-operator -f custom-mig-config.yamlIf the custom configuration specifies more than one instance profile, set the strategy to
mixed:$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ --type='json' \ -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]'
Patch the cluster policy so MIG Manager uses the custom ConfigMap:
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ --type='json' \ -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'
Label the nodes with the profile to configure:
$ kubectl label nodes <node-name> nvidia.com/mig.config=five-1g-one-2g --overwrite
Verification: Running Sample CUDA Workloads#
CUDA VectorAdd#
Let’s run a simple CUDA sample, in this case vectorAdd by requesting a GPU resource as you would
normally do in Kubernetes. In this case, Kubernetes will schedule the pod on a single MIG device and
we use a nodeSelector to direct the pod to be scheduled on the node with the MIG devices.
$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: vectoradd
image: nvidia/samples:vectoradd-cuda11.2.1
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
EOF
Concurrent Job Launch#
Now, let’s try a more complex example. In this example, we will use Argo Workflows to launch concurrent
jobs on MIG devices. In this example, the A100 has been configured into 2 MIG devices using the: 3g.20gb profile.
First, install the Argo Workflows components into your Kubernetes cluster.
$ kubectl create ns argo \
&& kubectl apply -n argo \
-f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml
Next, download the latest Argo CLI from the releases page and follow the instructions to install the binary.
Now, we will craft an Argo example that launches multiple CUDA containers onto the MIG devices on the GPU.
We will reuse the same vectorAdd example from before. Here is the job description, saved as vector-add.yaml:
$ cat << EOF > vector-add.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: argo-mig-example-
spec:
entrypoint: argo-mig-result-example
templates:
- name: argo-mig-result-example
steps:
- - name: generate
template: gen-mig-device-list
# Iterate over the list of numbers generated by the generate step above
- - name: argo-mig
template: argo-mig
arguments:
parameters:
- name: argo-mig
value: "{{item}}"
withParam: "{{steps.generate.outputs.result}}"
# Generate a list of numbers in JSON format
- name: gen-mig-device-list
script:
image: python:alpine3.6
command: [python]
source: |
import json
import sys
json.dump([i for i in range(0, 2)], sys.stdout)
- name: argo-mig
retryStrategy:
limit: 10
retryPolicy: "Always"
inputs:
parameters:
- name: argo-mig
container:
image: nvidia/samples:vectoradd-cuda11.2.1
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb
EOF
Launch the workflow:
$ argo submit -n argo --watch vector-add.yaml
Argo will print out the pods that have been launched:
Name: argo-mig-example-z6mqd
Namespace: argo
ServiceAccount: default
Status: Succeeded
Conditions:
Completed True
Created: Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Started: Wed Mar 24 14:44:51 -0700 (20 seconds ago)
Finished: Wed Mar 24 14:45:11 -0700 (now)
Duration: 20 seconds
Progress: 3/3
ResourcesDuration: 9s*(1 cpu),9s*(100Mi memory),1s*(1 nvidia.com/gpu)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ argo-mig-example-z6mqd argo-mig-result-example
├───✔ generate gen-mig-device-list argo-mig-example-z6mqd-562792713 8s
└─┬─✔ argo-mig(0:0)(0) argo-mig argo-mig-example-z6mqd-845918106 2s
└─✔ argo-mig(1:1)(0) argo-mig argo-mig-example-z6mqd-870679174 2s
If you observe the logs, you can see that the vector-add sample has completed on both devices:
$ argo logs -n argo @latest
argo-mig-example-z6mqd-562792713: [0, 1]
argo-mig-example-z6mqd-870679174: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-870679174: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-870679174: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-870679174: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-870679174: Test PASSED
argo-mig-example-z6mqd-870679174: Done
argo-mig-example-z6mqd-845918106: [Vector addition of 50000 elements]
argo-mig-example-z6mqd-845918106: Copy input data from the host memory to the CUDA device
argo-mig-example-z6mqd-845918106: CUDA kernel launch with 196 blocks of 256 threads
argo-mig-example-z6mqd-845918106: Copy output data from the CUDA device to the host memory
argo-mig-example-z6mqd-845918106: Test PASSED
argo-mig-example-z6mqd-845918106: Done
Disabling MIG#
You can disable MIG on a node by setting the nvidia.com/mig.config label to all-disabled:
$ kubectl label nodes <node-name> nvidia.com/mig.config=all-disabled --overwrite
MIG Manager with Preinstalled Drivers#
MIG Manager supports preinstalled drivers. Information in the preceding sections still applies, however there are a few additional details to consider.
Install#
During GPU Operator installation, driver.enabled=false must be set. The following options
can be used to install the GPU Operator:
$ helm install gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v26.3.0 \
--set driver.enabled=false
Managing Host GPU Clients#
MIG Manager stops all operator-managed pods that have access to GPUs when applying a MIG reconfiguration. When drivers are preinstalled, there can be GPU clients on the host that also need to be stopped.
When drivers are preinstalled, MIG Manager attempts to stop and restart a list of systemd services on the host across a MIG reconfiguration.
The list of services is specified in the default-gpu-clients ConfigMap.
The following sample GPU clients file, clients.yaml, is used to create the default-gpu-clients ConfigMap:
version: v1
systemd-services:
- nvsm.service
- nvsm-mqtt.service
- nvsm-core.service
- nvsm-api-gateway.service
- nvsm-notifier.service
- nv_peer_mem.service
- nvidia-dcgm.service
- dcgm.service
- dcgm-exporter.service
You can modify the list by editing the ConfigMap after installation. Alternatively, you can create a custom ConfigMap for use by MIG Manager by performing the following steps:
Create the
gpu-operatornamespace:$ kubectl create namespace gpu-operatorCreate a
ConfigMapcontaining the custom clients.yaml file with a list of GPU clients:$ kubectl create configmap -n gpu-operator gpu-clients --from-file=clients.yaml
Install the GPU Operator:
$ helm install gpu-operator \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --version=v26.3.0 \ --set migManager.gpuClientsConfig.name=gpu-clients \ --set driver.enabled=false
Architecture#
MIG Manager is designed as a controller within Kubernetes. It watches for changes to the
nvidia.com/mig.config label on the node and then applies the user-requested MIG configuration.
When the label changes, MIG Manager first stops all GPU pods, including device plugin, GPU feature discovery,
and DCGM exporter.
MIG Manager then stops all host GPU clients listed in the clients.yaml ConfigMap if drivers are preinstalled.
Finally, it applies the MIG reconfiguration and restarts the GPU pods and possibly, host GPU clients.
The MIG reconfiguration can also involve rebooting a node if a reboot is required to enable MIG mode.
The default MIG profiles are specified in the <node-name>-mig-config ConfigMap.
This ConfigMap is auto-generated by the MIG Manager for each MIG-capable node and contains the standard MIG profiles for the available GPUs on the node.
You can also configure the operator to configure a custom ConfigMap to use instead of the auto-generated one.
You can specify one of these profiles to apply to the mig.config label to trigger a reconfiguration of the MIG geometry.
MIG Manager uses the mig-parted tool to apply the configuration changes to the GPU, including enabling MIG mode, with a node reboot as required by some scenarios.