Time-slicing NVIDIA GPUs in OpenShift

Introduction

The latest generations of NVIDIA GPUs provide a mode of operation called Multi-Instance GPU (MIG). MIG allows you to partition a GPU into several smaller, predefined instances, each of which looks like a mini-GPU that provides memory and fault isolation at the hardware layer. Users can share access to a GPU by running their workloads on one of these predefined instances instead of the full GPU.

This document describes a new mechanism for enabling time-sharing of GPUs in OpenShift. It allows a cluster administrator to define a set of replicas for a GPU, each of which can be handed out independently to a pod to run workloads on.

Unlike MIG, there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all. Under the hood, Compute Unified Device Architecture (CUDA) time-slicing is used to multiplex workloads from replicas of the same underlying GPU.

Configuring GPUs with time slicing

The following sections show you how to configure NVIDIA Tesla T4 GPUs, as they do not support MIG, but can easily accept multiple small jobs.

Enabling GPU Feature Discovery

The feature release on GPU Feature Discovery (GFD) exposes the GPU types as labels and allows users to create node selectors based on these labels to help the scheduler place the pods. By default, when you create a ClusterPolicy custom resource, GFD is enabled. In case, you disabled it, you can re-enable it with the following command:

$ oc patch clusterpolicy gpu-cluster-policy -n nvidia-gpu-operator \
    --type json \
    --patch '[{"op": "replace", "path": "/spec/gfd/enable", "value": true}]'

Creating the slicing configurations

Before enabling a time slicing configuration, you need to tell the device plugin what are the possible configurations.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: nvidia-gpu-operator
data:
  A100-SXM4-40GB: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 8
          - name: nvidia.com/mig-1g.5gb
            replicas: 1
          - name: nvidia.com/mig-2g.10gb
            replicas: 2
          - name: nvidia.com/mig-3g.20gb
            replicas: 3
          - name: nvidia.com/mig-7g.40gb
            replicas: 7
  A100-SXM4-80GB: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 8
          - name: nvidia.com/mig-1g.10gb
            replicas: 1
          - name: nvidia.com/mig-2g.20gb
            replicas: 2
          - name: nvidia.com/mig-3g.40gb
            replicas: 3
          - name: nvidia.com/mig-7g.80gb
            replicas: 7
  Tesla-T4: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 8

Create the ConfigMap:

$ oc create -f device-plugin-config.yaml

Tell the GPU Operator which ConfigMap to use for the device plugin configuration. You can simply patch the ClusterPolicy custom resource.

$ oc patch clusterpolicy gpu-cluster-policy \
    -n nvidia-gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "device-plugin-config"}}}}'

Apply the configuration to all the nodes you have with Tesla TA GPUs. GFD, labels the nodes with the GPU product, in this example Tesla-T4, so you can use a node selector to label all of the nodes at once.

You can also set devicePlugin.config.default=Tesla-T4, which applies the configuration across the cluster by default without requiring node specific labels.
```
$ oc label --overwrite node \
    --selector=nvidia.com/gpu.product=Tesla-T4 \
    nvidia.com/device-plugin.config=Tesla-T4
```

After a few seconds, the configuration is applied and you can verify that GPU resource replicas have been created. The following configuration creates eight replicas for Tesla T4 GPUs, so the nvidia.com/gpu external resource is set to 8.

$ oc get node --selector=nvidia.com/gpu.product=Tesla-T4-SHARED -o json | jq '.items[0].status.capacity'

Example output

{
  "attachable-volumes-aws-ebs": "39",
  "cpu": "4",
  "ephemeral-storage": "125293548Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "16105592Ki",
  "nvidia.com/gpu": "8",
  "pods": "250"
}

Note that a -SHARED suffix has been added to the nvidia.com/gpu.product label to reflect that time slicing is enabled. You can disable this in the configuration. For example, the Tesla T4 configuration would look like this:
```
version: v1
sharing:
  timeSlicing:
    renameByDefault: false
    resources:
      - name: nvidia.com/gpu
        replicas: 8
```

Verify that GFD labels have been added to indicate time-sharing.

$ oc get node --selector=nvidia.com/gpu.product=Tesla-T4-SHARED -o json \
 | jq '.items[0].metadata.labels' | grep nvidia

Example Output

"nvidia.com/cuda.driver.major": "510",
"nvidia.com/cuda.driver.minor": "73",
"nvidia.com/cuda.driver.rev": "08",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "7",
"nvidia.com/device-plugin.config": "Tesla-T4",
"nvidia.com/gfd.timestamp": "1655482336",
"nvidia.com/gpu.compute.major": "7",
"nvidia.com/gpu.compute.minor": "5",
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.deploy.container-toolkit": "true",
"nvidia.com/gpu.deploy.dcgm": "true",
"nvidia.com/gpu.deploy.dcgm-exporter": "true",
"nvidia.com/gpu.deploy.device-plugin": "true",
"nvidia.com/gpu.deploy.driver": "true",
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
"nvidia.com/gpu.deploy.node-status-exporter": "true",
"nvidia.com/gpu.deploy.nvsm": "",
"nvidia.com/gpu.deploy.operator-validator": "true",
"nvidia.com/gpu.family": "turing",
"nvidia.com/gpu.machine": "g4dn.xlarge",
"nvidia.com/gpu.memory": "16106127360",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "Tesla-T4-SHARED",
"nvidia.com/gpu.replicas": "8",
"nvidia.com/mig.strategy": "single",

If you remove the label, the node configuration is reset to its default.

Applying the configuration to a MachineSet

With OpenShift, you can leverage the Machine Management feature to dynamically provision nodes on platforms that support it.

For example, an administrator can create a MachineSet for nodes with Tesla T4 GPUs configured with time-slicing enabled. This provides a pool of replicas for workloads that don’t require a full T4 GPU.

Consider a MachineSet named worker-gpu-nvidia-t4-us-east-1, with Machine Autoscaler configured. You want to ensure the new nodes will have time slicing enabled automatically, that is, you want to apply the label to every new node. This can be done by setting the label in the MachineSet template.

$ oc patch machineset worker-gpu-nvidia-t4-us-east-1a \
    -n openshift-machine-api --type merge \
    --patch '{"spec": {"template": {"spec": {"metadata": {"labels": {"nvidia.com/device-plugin.config": "Tesla-T4"}}}}}}'

Now, any new machine created by the Machine Autoscaler for this MachineSet will have the label, and time-slicing enabled.

Sample ConfigMap values

The following table shows sample values for a ConfigMap that contains multiple config.yaml files (small, medium, and large).

Field	Description	Small	Medium	Large
`replicas`	The number of replicas that can be specified for each named resource.	2	5	10
`renameByDefault`	When `false`, the `SHARED` suffix is added to the product label.	false	false	false
`failRequestsGreaterThanOne`	This flag is `false` for backward compatibility.	false	false	false

Note

Unlike with standard GPU requests, requesting more than one shared GPU does not guarantee that you will have access to a proportional amount of compute power. It only specifies that you will have access to a GPU that is shared by other clients, each of which has the freedom to run as many processes on the underlying GPU as they want. Internally, the GPU will simply give an equal share of time to all GPU processes across all of the clients. The failRequestsGreaterThanOne flag is meant to help users understand this subtlety, by treating a request of 1 as an access request rather than an exclusive resource request. Setting failRequestsGreaterThanOne=true is recommended, but it is set to false by default to retain backwards compatibility.