MIG Support in OpenShift Container Platform

Introduction

NVIDIA Mult-Instance GPU (MIG) is useful anytime you have an application that does not require the full power of an entire GPU. The new NVIDIA Ampere architecture’s MIG feature allows you to split your hardware resources into multiple GPU instances, each exposed to the operating system as an independent CUDA-enabled GPU. The NVIDIA GPU Operator version 1.7.0 and above provides MIG feature support for the A100 and A30 Ampere cards. These GPU instances are designed to support multiple independent CUDA applications (up to 7), so they operate completely isolated from each other using dedicated hardware resources.

The compute units of the GPU, in addition to its memory, can be partitioned into multiple MIG instances. Each of these instances presents as a stand-alone GPU device from the system perspective and can be bound to any application, container, or virtual machine running on the node.

From the perspective of the software consuming the GPU each of these MIG instances looks like its own individual GPU.

MIG geometry

The NVIDIA GPU Operator version 1.7.0 and above enables OpenShift Container Platform administrators to dynamically reconfigure the geometry of the MIG partitioning. The geometry of the MIG partitioning is how hardware resources are bound to MIG instances, so it directly influences their performance and the number of instances that can be allocated. The A100-40GB, for example, has eight compute units and 40 GB of RAM. When the MIG mode is enabled, the eighth instance is reserved for resource management.

The table below provides a summary of the MIG instance properties of the NVIDIA A100-40GB product:

Profile

Memory

Compute Units

Maximum number of homogeneous instances

1g.5gb

5 GB

1

7

2g.10gb

10 GB

2

3

3g.20gb

20 GB

3

2

4g.20gb

20 GB

4

1

7g.40gb

40 GB

7

1

In addition to homogeneous instances, some heterogeneous combinations can be chosen. See the Multi-Instance GPU User Guide documentation for an exhaustive listing.

Here is an example, again for the A100-40GB, with heterogeneous (or “mixed”) geometries:

  • 2x 1g.5gb

  • 1x 2g.10gb

  • 1x 3g.10gb

Prerequisites

The deployment workflow requires these prerequisites.

  1. You already have a OpenShift Container Platform cluster up and running with access to at least one MIG-capable GPU.

  2. You have followed the guidance in Installation and Upgrade Overview proceeding as far as creating the cluster policy <create-cluster-policy>.

Note

The node must be free (drained) of GPU workloads before any reconfiguration is triggered. For guidance on draining a node see, the OpenShift Container Platform documentation Understanding how to evacuate pods on nodes.

Configuring MIG Devices in OpenShift

MIG advertisement strategies

The NVIDIA GPU Operator exposes GPUs to Kubernetes as extended resources that can be requested and exposed into Pods and containers. The first step of the MIG configuration is to decide what Strategy you want. The advertisement strategies are described here:

  • Single defines a homogeneous advertisement strategy, with MIG instances exposed as usual GPUs. This strategy exposes the MIG instances as nvidia.com/gpu resources, identically, as usual non-MIG capable (or with MIG disabled) devices. In this strategy, all the GPUs in a single node must be configured in a homogenous manner (same number of compute units, same memory size). This strategy is best for a large cluster where the infrastructure teams can configure “node pools” of different MIG geometries and make them available to users. Another advantage of this strategy is backward compatibility where the existing application does not have to be modified to be scheduled this way.

    Examples for the A100-40GB:

    • 1g.5gb: 7 nvidia.com/gpu instances, or

    • 2g.10gb: 3 nvidia.com/gpu instances, or

    • 3g.20gb: 2 nvidia.com/gpuinstances, or

    • 7g.40gb: 1 nvidia.com/gpu instances

      _images/Mig-profile-A100.png
  • Mixed defines a heterogeneous advertisement strategy. There is no constraint on the geometry; all the combinations allowed by the GPU are permitted. This strategy is appropriate for a smaller cluster, where on a single node with multiple GPUs, each GPU can be configured in a different MIG geometry.

    Examples for the A100-40GB:

    • All the single configurations are possible

    • A “balanced” configuration:

      • 1g.5gb: 2 nvidia.com/mig-1g.5gb instances, and

      • 2g.10gb: 1 nvidia.com/mig-2g.10gb instance, and

      • 3g.20gb: 1 nvidia.com/mig-3g.20gb instance

      _images/mig-mixed-profile-A100.png

Version 1.8 and greater of the NVIDIA GPU Operator supports updating the Strategy in the ClusterPolicy after deployment.

The default configmap defines the combination of single (homogeneous) and mixed (heterogeneous) profiles that are supported for A100-40GB, A100-80GB and A30-24GB. The configmap allows administrators to declaratively define a set of possible MIG configurations they would like applied to all GPUs on a node. The tables below describe these configurations:

Single configuration

GPU Type

Custom label

Profile

MIG instances

A100-40GB

all-1g.5gb

1g.5gb

7

all-2g.10gb

2g.10gb

3

all-3g.20gb

3g.20gb

2

all-7g.40gb

7g.40gb

1

A100-80GB

all-1g.10gb

1g.10gb

7

all-2g.20gb

2g.20gb

3

all-3g.40gb

3g.40gb

2

all-7g.80gb

7g.80gb

1

A30-24GB

all-1g.6gb

1g.6gb

4

all-2g.12gb

2g.12gb

2

all-4g.24gb

4g.24gb

1

All-balanced is composed of 3 distinct configurations, with a device-filter filtering, based on the device UID. The possible supported combinations are described below:

Balanced configuration

GPU Type

Custom label

Profile and MIG instances

A100-40GB

all-balanced

1g.5gb: 2

2g.10gb:1

3g.20gb:1

A100-80GB

all-balanced

1g.10gb:2

2g.20gb:1

3g.40gb:1

A30-24GB

all-balanced

1g.6gb: 2

2g.12gb:1

Set the MIG advertisement strategy and apply the MIG partitioning

Having decided on your advertisement strategy you need to set this by editing the default cluster policy and then apply the MIG partitioning profile.

For example to set the advertisement strategy to mixed and the MIG partitioning profile to 3x 2g.10gb MIG devices follow the step below:

  1. In the OpenShift Container Platform CLI run the following:

    $ STRATEGY=mixed && \
      oc patch clusterpolicy/gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'
    

    Note

    This may take a while so be patient and wait at least 10-20 minutes before digging deeper into any form of troubleshooting.

  2. In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, then click the NVIDIA GPU Operator.

  3. Select the ClusterPolicy tab. The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator displays State:ready once the installation succeeded.

    _images/cluster_policy_suceed.png
  4. Apply the desired MIG partitioning profile. To configure 3x 2g.10gb MIG devices run the following:

    $ MIG_CONFIGURATION=all-2g.10gb && \
      oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
    
  5. Wait for the mig-manager to perform the reconfiguration:

    $ oc -n nvidia-gpu-operator logs ds/nvidia-mig-manager --all-containers -f --prefix
    

    The status of the reconfiguration should change from success → pending → success.

  6. Verify the new configuration is applied:

    $ oc get pods -n nvidia-gpu-operator -lapp=nvidia-driver-daemonset -owide
    

    Select the name of the Pod on the MIG GPU enabled node and run the following:

    $ oc rsh -n nvidia-gpu-operator $POD_NAME nvidia-smi mig -lgi
    
    +----------------------------------------------------+
    | GPU instances:                                     |
    | GPU   Name          Profile  Instance   Placement  |
    |                       ID       ID       Start:Size |
    |====================================================|
    |   0  MIG 2g.10gb       19        3          4:2    |
    +----------------------------------------------------+
    |   0  MIG 2g.10gb       19        5          0:2    |
    +----------------------------------------------------+
    |   0  MIG 2g.10gb       19        6          2:2    |
    +----------------------------------------------------+
    

    With the profile in step 4 applied the A100 is configured into 3 MIG devices.

  7. Check the node has been labeled:

    $ oc get nodes/$NODE_NAME --show-labels | tr ',' '\n' | grep nvidia.com
    

    with labels:

    nvidia.com/gpu.present=true
    nvidia.com/cuda.driver.major=470
    nvidia.com/cuda.driver.minor=57
    nvidia.com/cuda.driver.rev=02
    nvidia.com/cuda.runtime.major=11
    nvidia.com/cuda.runtime.minor=4
    nvidia.com/gpu.compute.major=8
    nvidia.com/gpu.compute.minor=0
    nvidia.com/gpu.count=1
    nvidia.com/gpu.family=ampere
    nvidia.com/gpu.machine=...
    nvidia.com/gpu.memory=40536
    nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB
    nvidia.com/mig-2g.10gb.count=3
    nvidia.com/mig-2g.10gb.engines.copy=2
    nvidia.com/mig-2g.10gb.engines.decoder=1
    nvidia.com/mig-2g.10gb.engines.encoder=0
    nvidia.com/mig-2g.10gb.engines.jpeg=0
    nvidia.com/mig-2g.10gb.engines.ofa=0
    nvidia.com/mig-2g.10gb.memory=9984
    nvidia.com/mig-2g.10gb.multiprocessors=28
    nvidia.com/mig-2g.10gb.slices.ci=2
    nvidia.com/mig-2g.10gb.slices.gi=2
    nvidia.com/mig.config.state=success
    nvidia.com/mig.config=all-2g.10gb
    nvidia.com/mig.strategy=mixed
    [...]
    

    Note

    The extract above shows the strategy is set to mixed with the MIG configuration set to all-2g.10gb.

  8. Verify that the MIG instances are exposed:

    $ oc get node/$NODE_NAME -ojsonpath={.status.allocatable} | jq . | grep nvidia
    
    "nvidia.com/mig-2g.10gb": "3",
    

    Note

    You can ignore values set to 0.

Creating and applying a custom MIG configuration

Follow the guidance below to create a new slicing profile.

  1. Prepare a custom configmap resource file for example custom_configmap.yaml. Use the configmap as guidance to help you build that custom configuration. For more documentation about the file format see mig-parted.

    Note

    For a list of all supported combinations and placements of profiles on A100 and A30, refer to the section on supported profiles.

  2. Create the custom configuration within the nvidia-gpu-operator namespace:

    $ CONFIG_FILE=/path/to/custom_configmap.yaml && \
      oc create configmap custom-mig-parted-config \
         --from-file=config.yaml=$CONFIG_FILE \
         -n nvidia-gpu-operator
    
  3. Edit the cluster policy and enter the name of the config map in the field spec.migManager.config.name:

    $ oc edit clusterpolicy
      spec:
        migManager:
          config:
            name: custom-mig-parted-config
    
  4. Label the node with this newly created profile following the guidance in Set the MIG advertisement strategy and apply the MIG partitioning.

Example: Mixed MIG strategy

Introduction and default MIG configuration

For each MIG configuration, you specify a strategy and a MIG configuration label.

This example shows how to configure a mixed strategy with the all-balanced configuration on one NVIDIA DGX H100 host with 8 x H100 80GB GPUs. The DGX H100 host runs a single node installation of OpenShift.

By default, MIG is disabled and is configured with the single strategy:

$ oc describe node | grep nvidia.com/mig

Example Output

nvidia.com/mig.capable=true
nvidia.com/mig.config=all-disabled
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=single

With the default configuration, the host supports up to 8 pods with GPUs:

$ oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu|Allocatable:|Requests +Limits"

Example Output

Name:               myworker.redhat.com
Roles:              control-plane,master,worker
Capacity:
nvidia.com/gpu:     8
Allocatable:
nvidia.com/gpu:     8
Resource           Requests      Limits
nvidia.com/gpu     0             0

Procedure

The following steps show how to apply the mixed strategy with the MIG configuration label all-balanced.

With this strategy and label, each H100 GPU enables these MIG profiles:

  • 2 x 1g.10gb

  • 1 x 2g.20gb

  • 1 x 3g.40g

For the NVIDIA DGX H100 that has 8 H100 GPUs, performing the steps results in the following GPU capacity on the cluster:

  • 16 x 1g.10gb (8 x 2)

  • 8 x 2g.20gb (8 x 1)

  • 8 x 3g.40gb (8 x 1)

  1. Specify the host name, strategy, and configuration label in environment variables:

    $ NODE_NAME=myworker.redhat.com
    $ STRATEGY=mixed
    $ MIG_CONFIGURATION=all-balanced
    
  2. Apply the strategy:

    $ oc patch clusterpolicy/gpu-cluster-policy --type='json' \
        -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'
    
  3. Label the node with the configuration label:

    $ oc label node $NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
    

    MIG manager applies a mig.config.state label to the GPU and then terminates all the GPU pods in preparation to enable MIG mode and configure the GPU into the specified configuration.

  4. Optional: Verify that MIG manager configured the GPUs:

    $ oc describe node | grep nvidia.com/mig.config
    

    Example Output

    nvidia.com/mig.config=all-balanced
    nvidia.com/mig.config.state=success
    
  5. Confirm that the GPU resources are available:

    $ oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
    

    The following sample output shows the expected 32 GPU resources:

    • 16 x 1g.10gb

    • 8 x 1g.10gb

    • 8 x 3g.40gb

    Name:               myworker.redhat.com
    Roles:              control-plane,master,worker
    Capacity:
    nvidia.com/gpu:          0
    nvidia.com/mig-1g.10gb:  16
    nvidia.com/mig-2g.20gb:  8
    nvidia.com/mig-3g.40gb:  8
    Allocatable:
    nvidia.com/gpu:          0
    nvidia.com/mig-1g.10gb:  16
    nvidia.com/mig-2g.20gb:  8
    nvidia.com/mig-3g.40gb:  8
    Resource                Requests      Limits
    nvidia.com/mig-1g.10gb  0             0
    nvidia.com/mig-2g.20gb  0             0
    nvidia.com/mig-3g.40gb  0             0
    
  6. Optional: Start a pod to run the nvidia-smi command and display the GPU resources.

    1. Start the pod:

      $ cat <<EOF | oc apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
        name: command-nvidia-smi
      spec:
        restartPolicy: Never
        containers:
        - name: cuda-container
          image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
          command: ["/bin/sh","-c"]
          args: ["nvidia-smi"]
      EOF
      
    2. Confirm the pod ran successfully:

      $ oc get pods
      

      Example Output

      NAME                 READY   STATUS      RESTARTS   AGE
      command-nvidia-smi   0/1     Completed   0          3m34s
      
    3. Confirm that the nvidia-smi output includes 32 MIG devices:

      $ oc logs command-nvidia-smi
      

      Example Output

      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |                   On |
      | N/A   25C    P0              71W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |                   On |
      | N/A   26C    P0              70W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                   On |
      | N/A   31C    P0              72W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |                   On |
      | N/A   29C    P0              71W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |                   On |
      | N/A   26C    P0              71W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |                   On |
      | N/A   25C    P0              70W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |                   On |
      | N/A   29C    P0              73W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                   On |
      | N/A   31C    P0              72W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      
      +---------------------------------------------------------------------------------------+
      | MIG devices:                                                                          |
      +------------------+--------------------------------+-----------+-----------------------+
      | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
      |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
      |                  |                                |        ECC|                       |
      |==================+================================+===========+=======================|
      |  0    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  0    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  0    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  0   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4    5   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4   13   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4   14   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5    5   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5   13   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5   14   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
      |                  |               0MiB / 32767MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
      |                  |               0MiB / 16383MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      |  No running processes found                                                           |
      +---------------------------------------------------------------------------------------+
      
    4. Delete the sample pod:

      $ oc delete pod command-nvidia-smi
      

      Example Output

      pod "command-nvidia-smi" deleted
      

Example: Single MIG strategy

This example shows how to configure a single strategy with the all-3g.40gb configuration on one NVIDIA DGX H100 host with 8 x H100 80GB GPUs. The DGX H100 host runs a single node installation of OpenShift.

For information about the initial default MIG configuration and viewing it, refer to the beginning of Example: Mixed MIG strategy.

  1. Specify the host name, strategy, and configuration label in environment variables:

    $ NODE_NAME=myworker.redhat.com
    $ STRATEGY=single
    $ MIG_CONFIGURATION=all-3g.40gb
    
  2. Apply the strategy:

    $ oc patch clusterpolicy/gpu-cluster-policy --type='json' \
        -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'
    
  3. Label the node with the configuration label:

    $ oc label node $NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
    

    MIG manager applies a mig.config.state label to the GPU and then terminates all the GPU pods in preparation to enable MIG mode and configure the GPU into the specified configuration.

  4. Confirm that the GPU resources are available:

    $ oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
    

    The following sample output shows the expected 16 GPUs:

    Name:               myworker.redhat.com
    Roles:              control-plane,master,worker
    Capacity:
    nvidia.com/gpu:          16
    nvidia.com/mig-1g.10gb:  0
    nvidia.com/mig-2g.20gb:  0
    nvidia.com/mig-3g.40gb:  0
    Allocatable:
    nvidia.com/gpu:          16
    nvidia.com/mig-1g.10gb:  0
    nvidia.com/mig-2g.20gb:  0
    nvidia.com/mig-3g.40gb:  0
    Resource                Requests      Limits
    nvidia.com/mig-1g.10gb  0             0
    nvidia.com/mig-2g.20gb  0             0
    nvidia.com/mig-3g.40gb  0             0
    
  5. Optional: Start a pod to run the nvidia-smi command and display the GPU resources.

    1. Start the pod:

      $ cat <<EOF | oc apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
        name: command-nvidia-smi
      spec:
        restartPolicy: Never
        containers:
        - name: cuda-container
          image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
          command: ["/bin/sh","-c"]
          args: ["nvidia-smi"]
      EOF
      
    2. Confirm the pod ran successfully:

      $ oc get pods
      

      Example Output

      NAME                 READY   STATUS      RESTARTS   AGE
      command-nvidia-smi   0/1     Completed   0          3m34s
      
    3. Confirm that the nvidia-smi output includes 16 MIG devices:

      $ oc logs command-nvidia-smi
      

      Example Output

      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |                   On |
      | N/A   25C    P0              75W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |                   On |
      | N/A   27C    P0              74W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                   On |
      | N/A   32C    P0              75W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |                   On |
      | N/A   30C    P0              74W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |                   On |
      | N/A   27C    P0              75W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |                   On |
      | N/A   25C    P0              73W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |                   On |
      | N/A   30C    P0              77W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      |   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                   On |
      | N/A   31C    P0              76W / 700W |                  N/A |     N/A      Default |
      |                                         |                      |              Enabled |
      +-----------------------------------------+----------------------+----------------------+
      
      +---------------------------------------------------------------------------------------+
      | MIG devices:                                                                          |
      +------------------+--------------------------------+-----------+-----------------------+
      | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
      |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
      |                  |                                |        ECC|                       |
      |==================+================================+===========+=======================|
      |  0    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  0    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  1    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  2    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  3    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  4    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  5    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  6    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      |  7    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
      |                  |               0MiB / 65535MiB  |           |                       |
      +------------------+--------------------------------+-----------+-----------------------+
      
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      |  No running processes found                                                           |
      +---------------------------------------------------------------------------------------+
      
    4. Delete the sample pod:

      $ oc delete pod command-nvidia-smi
      

      Example Output

      pod "command-nvidia-smi" deleted
      

Running a sample GPU application

Let’s run a simple CUDA sample, in this case vectorAdd by requesting a GPU resource as you would normally do in Kubernetes.

If the cluster is configured with the mixed advertisement strategy.

  1. Request the MIG instance with nvidia.com/mig-2g.10gb: 1 as follows:

    Note

    There is no need for a nodeSelector, as the Pod is necessarily scheduled on a 2g.10gb MIG instance.

    $ cat << EOF | oc create -f -
    
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vectoradd
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vectoradd
        image: "nvidia/samples:vectoradd-cuda11.2.1"
        resources:
          limits:
            nvidia.com/mig-2g.10gb: 1
    
    pod/cuda-vectoradd created
    
  2. Check the logs of the container:

    $ oc logs cuda-vectoradd
    
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    

If the cluster is configured with the single advertisement strategy.

  1. Request the MIG instance with nvidia.com/gpu: 1 and enforce the Pod scheduling on a node with a 2g.10gb MIG instance with the nodeSelector stanza as follows:

    $ cat << EOF | oc create -f -
    
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vectoradd
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vectoradd
        image: "nvidia/samples:vectoradd-cuda11.2.1"
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
    EOF
    

Disable the MIG mode

To turn MIG mode off so that you can utilize the full capacity of the GPU run the following:

$ MIG_CONFIGURATION=all-disabled && \
  oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite

Troubleshooting

The MIG reconfiguration is handled exclusively by the controller deployed within the nvidia-mig-manager DaemonSet. Inspecting the logs of these Pods should give a clue about what went wrong.

  1. Check the logs of the container:

    $ oc logs nvidia-mig-manager
    

    The cluster administrator is expected to drain the node from any GPU workload, before requesting the MIG reconfiguration. If the node is not properly drained, the nvidia-mig-manager will fail with this error in the logs:

     Updating MIG config: map[2g.10gb:3]
    Error clearing MigConfig: error destroying Compute instance for profile '(0, 0)': In use by another client
    Error clearing MIG config on GPU 0, erroneous devices may persist
    Error setting MIGConfig: error attempting multiple config orderings: all orderings failed
    Restarting all GPU clients previously shutdown by reenabling their component-specific nodeSelector labels
    Changing the 'nvidia.com/mig.config.state' node label to 'failed'
    

Resolve this issue by:

  1. Correctly draining the node. For guidance on draining a node see, the OpenShift Container Platform documentation Understanding how to evacuate pods on nodes.

  2. Retrigger the reconfiguration by forcing the label update:

    $ oc label node/$NODE_NAME nvidia.com/mig.config- --overwrite
    
    $ oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite