NVIDIA GPUs with Google Cloud’s Anthos

Changelog

  • 3/22/2020 (author: PR):
    • Fixed URLs

  • 11/30/2020 (author: PR/DF):
    • Added information on Anthos on bare metal

  • 11/25/2020 (author: PR):
    • Migrated docs to new format

  • 8/14/2020 (author: PR):
    • Initial Version

Introduction

Google Cloud’s Anthos is a modern application management platform that lets users build, deploy, and manage applications anywhere in a secure, consistent manner. The platform provides a consistent development and operations experience across deployments while reducing operational overhead and improving developer productivity. Anthos runs in hybrid and multi-cloud environments that spans Google Cloud, on-premise, and is generally available on Amazon Web Services (AWS). Support for Anthos on Microsoft Azure is in preview. For more information on Anthos, see the product overview.

Systems with NVIDIA GPUs can be deployed in various configurations for use with Google Cloud’s Anthos. The purpose of this document is to provide users with steps on getting started with using NVIDIA GPUs with Anthos in these various configurations.

Deployment Configurations

Anthos can be deployed in different configurations. Depending on your deployment, choose one of the sections below to get started with NVIDIA GPUs in Google Cloud’s Anthos:

  1. Anthos Clusters on Bare Metal with NVIDIA DGX Systems and GPU-Accelerated Servers

  2. Anthos Clusters with VMware and NVIDIA GPU-Accelerated Servers

Supported Platforms

GPUs

The following GPUs are supported:

  • NVIDIA A100, T4 and V100

DGX Systems

The following NVIDIA DGX systems are supported:

  • NVIDIA DGX A100

  • NVIDIA DGX-2 and DGX-1 (Volta)

Linux Distributions

The following Linux distributions are supported:

  • Ubuntu 18.04.z, 20.04.z LTS

For more information on the Anthos Ready platforms, visit this page.

Getting Support

For support issues related to using GPUs with Anthos, please open a ticket on the NVIDIA GPU Operator GitHub project. Your feedback is appreciated.

DGX customers can visit the NVIDIA DGX Systems Support Portal.

Anthos Clusters on Bare Metal with NVIDIA DGX Systems and GPU-Accelerated Servers

Anthos on bare metal with DGX A100 or NVIDIA GPU-accelerated servers systems enables a consistent development and operational experience across deployments, while reducing expensive overhead and improving developer productivity. Refer to the Anthos documentation for more information on Anthos cluster environments.

Installation Flow

The basic steps described in this document follows this workflow:

  1. Configure nodes

    • Ensure each node (including the control plane) meets the pre-requisites, including time synchronization, correct versions of Docker and other conditions.

  2. Configure networking (Optional)

    • Ensure network connectivity between control plane and nodes - ideally the VIPs, control plane and the nodes in the cluster are in the same network subnet.

  3. Configure an admin workstation and set up Anthos to create the cluster

    • Set up the cluster using Anthos on bare-metal

  4. Setup NVIDIA software on GPU nodes

    • Set up the NVIDIA software components on the GPU nodes to ensure that your cluster can run CUDA applications.

At the end of the installation flow, you should have a user cluster with GPU-enabled nodes that you can use to deploy applications.

Configure Nodes

These steps are required on each node in the cluster (including the control plane).

Time Synchronization

  • Ensure apparmor is stopped:

    $ apt-get install -y apparmor-utils policycoreutils
    
    $ systemctl --now enable apparmor \
       && systemctl stop apparmor
    
  • Synchronize the time on each node:

    • Check the current time

      $ timedatectl
      
                     Local time: Fri 2020-11-20 10:38:06 PST
                 Universal time: Fri 2020-11-20 18:38:06 UTC
                       RTC time: Fri 2020-11-20 18:38:08
                      Time zone: US/Pacific (PST, -0800)
      System clock synchronized: no
                    NTP service: active
                RTC in local TZ: no
      
    • Configure the NTP server in /etc/systemd/timesyncd.conf:

      NTP=time.google.com
      
    • Adjust the system clock:

      $ timedatectl set-local-rtc 0 --adjust-system-clock
      
    • Restart the service

      $ systemctl restart systemd-timesyncd.service
      
    • Verify the synchronization with the time server

      $ timedatectl
      
                     Local time: Fri 2020-11-20 11:03:22 PST
                 Universal time: Fri 2020-11-20 19:03:22 UTC
                       RTC time: Fri 2020-11-20 19:03:22
                      Time zone: US/Pacific (PST, -0800)
      System clock synchronized: yes
                    NTP service: active
                RTC in local TZ: no
      

Test Network Connectivity

  • Ensure you can nslookup on hostname

    $ systemctl restart systemd-resolved \
       && ping us.archive.ubuntu.com
    
    ping: us.archive.ubuntu.com: Temporary failure in name resolution
    
  • Check the nameserver in resolve.conf

    $ cat <<EOF > /etc/resolv.conf
    nameserver 8.8.8.8
    EOF
    
  • And re-test ping

    $ ping us.archive.ubuntu.com
    
    PING us.archive.ubuntu.com (91.189.91.38) 56(84) bytes of data.
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=1 ttl=49 time=73.4 ms
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=2 ttl=49 time=73.3 ms
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=3 ttl=49 time=73.4 ms
    

Install Docker

Follow these steps to install Docker. On DGX systems, Docker may already be installed using the docker-ce package. In this case, use docker.io as the base installation package for Docker to ensure a successful cluster setup with Anthos.

  • Stop services using docker:

    $ systemctl stop kubelet \
       && systemctl stop docker \
       && systemctl stop containerd \
       && systemctl stop containerd.io
    
  • Purge the existing packages of Docker and nvidia-docker2 if any:

    $ systemctl stop run-docker-netns-default.mount \
       && systemctl stop docker.haproxy
    
    $ dpkg -r nv-docker-options \
       && dpkg --purge nv-docker-options \
       && dpkg -r nvidia-docker2 \
       && dpkg --purge nvidia-docker2 \
       && dpkg -r docker-ce \
       && dpkg --purge docker-ce \
       && dpkg -r docker-ce-cli \
       && dpkg -r containerd \
       && dpkg --purge containerd \
       && dpkg -r containerd.io \
       && dpkg --purge
    
  • Re-install Docker

    $ apt-get update \
       && apt-get install -y apt-transport-https \
          ca-certificates \
          curl \
          software-properties-common \
          inetutils-traceroute \
          conntrack
    
    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
    
    $ add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) stable"
    
    $ apt-get update \
       && apt-get install -y docker.io
    
    $ systemctl --now enable docker
    
Install nvidia-docker2 on GPU Nodes

Note

This step should be performed on the GPU nodes only

For DGX systems, re-install nvidia-docker2 from the DGX repositories:

$ apt-get install -y nvidia-docker2

Since Kubernetes does not support the --gpus option with Docker yet, the nvidia runtime should be setup as the default container runtime for Docker on the GPU node. This can be done by adding the default-runtime line into the Docker daemon config file, which is usually located on the system at /etc/docker/daemon.json:

{
   "default-runtime": "nvidia",
   "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Restart the Docker daemon to complete the installation after setting the default runtime:

$ sudo systemctl restart docker

For non-DGX systems, refer to the NVIDIA Container Toolkit installation guide to setup nvidia-docker2.

Configure Networking (Optional)

Note

The following steps are provided as a reference for configuring the network so that the control plane and the nodes are on the same subnet by using tunnels and DNAT. If the nodes in your cluster are on the same subnet, then you may skip this step.

In the example below:

  • The control plane is at 10.117.29.41

  • The GPU node or admin workstation is at 10.110.20.149

  • The control plane VIP is 10.0.0.8

If the machines are on a different subnet than each other or the control plane VIP then tunnel routes can be used to establish connectivity.

There are two scenarios to consider:

  1. If the machines are on the same subnet, but the VIP is on a different subnet, then add the correct IP route (using ip route add 10.0.0.8 via <contro-plane-ip> from the GPU node or admin-workstation

  2. If the machines and VIP are on different subnets, then a tunnel is also needed to enable the above route command to succeed where <control-plane-ip> is the control plane tunnel 192.168.210.1.

Control Plane

Setup tunneling:

$ ip tunnel add tun0 mode ipip local 10.117.29.41 remote 10.110.20.149
$ ip addr add 192.168.200.1/24 dev tun0
$ ip link set tun0 up

Update DNAT to support the control plane VIP over the tunnel:

$ iptables -t nat -I PREROUTING  -p udp -d 192.168.210.1  --dport 6081 -j DNAT --to-destination 10.117.29.41

GPU Node or Admin Workstation

Establish connectivity with the control plane:

$ ip tunnel add tun1 mode ipip local 10.110.20.149  remote 10.117.29.41
$ ip addr add 192.168.210.2/24 dev tun1
$ ip link set tun1 up
$ ip route add 10.0.0.8/32 via 192.168.210.1

Setup DNAT:

$ iptables -t nat -I OUTPUT -p udp -d 10.117.29.41  --dport 6081 -j DNAT --to-destination 192.168.210.1

Configure Admin Workstation

Configure the admin workstation prior to setting up the cluster.

Download the Google Cloud SDK:

$ wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-314.0.0-linux-x86_64.tar.gz \
   && tar -xf google-cloud-sdk-314.0.0-linux-x86_64.tar.gz
$ google-cloud-sdk/install.sh

Install the Anthos authentication components:

$ gcloud components install anthos-auth

See the Anthos installtion overview for detailed instructions for installing Anthos in an on-premise environment and setup your cluster.

Setup NVIDIA Software on GPU Nodes

Once the Anthos cluster has been set up, you can proceed to deploy the NVIDIA software components on the GPU nodes.

NVIDIA Drivers

Note

DGX systems include the NVIDIA drivers. This step can be skipped.

For complete instructions on setting up NVIDIA drivers, visit the quickstart guide at https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html. The guide covers a number of pre-installation requirements and steps on supported Linux distributions for a successful install of the driver.

NVIDIA Device Plugin

To use GPUs in Kubernetes, the NVIDIA Device Plugin is required. The NVIDIA Device Plugin is a daemonset that automatically enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs.

The preferred method to deploy the device plugin is as a daemonset using helm.

Add the nvidia-device-plugin helm repository:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update

Deploy the device plugin:

$ helm install --generate-name nvdp/nvidia-device-plugin

For more user configurable options while deploying the daemonset, refer to the device plugin README

Node Feature Discovery

For detecting the hardware configuration and system configuration, we will deploy the Node Feature Discovery add-on:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.6.0/nfd-master.yaml.template
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.6.0/nfd-worker-daemonset.yaml.template

See the NFD documentation for more information on NFD.

Anthos Clusters with VMware and NVIDIA GPU Accelerated Servers

Anthos running on-premise has requirements for which vSphere versions are supported along with network and storage requirements. Please see the Anthos version compatibility matrix for more information: https://cloud.google.com/anthos/gke/docs/on-prem/versioning-and-upgrades#version_compatibility_matrix.

This guide assumes that the user already has an installed Anthos on-premise cluster in a vSphere environment. Please see https://cloud.google.com/anthos/gke/docs/on-prem/how-to/install-overview-basic for detailed instructions for installing Anthos in an on-premise environment.

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs.

In the VMware vSphere configuration, Anthos uses the NVIDIA GPU Operator to configure GPU nodes in the Kubernetes cluster so that the nodes can be used to schedule CUDA applications. The GPU Operator itself is deployed using Helm. The rest of this section provides users with steps on getting started.

Configuring PCIe Passthrough

For the GPU to be accessible to the VM, first you must enable PCI Passthrough on the ESXi host. This can be done from the vSphere client. This will require a reboot of the ESXi host to complete the process and therefore the host should be put into maintenance mode and any VMs running on the ESXi host evacuated to another. If you only have a single ESXi host, then the VMs will need to be restarted after the reboot.

From the vSphere client, select an ESXi host from the Inventory of VMware vSphere Client. In the Configure tab, click Hardware > PCI Devices. This will show you the passthrough-enabled devices (you will most likely find none at this time).

../_images/image01.png

Click CONFIGURE PASSTHROUGH to launch the Edit PCI Device Availability window. Look for the GPU device and select the checkbox next to it (the GPU device will be recognizable as having NVIDIA Corporation in the Vendor Name view). Select the GPU devices (you may have more than one) and click OK.

../_images/image02.png

At this point, the GPU(s) will appear as Available (pending). You will need to select Reboot This Host and complete the reboot before proceeding to the next step.

../_images/image03.png

It is a VMware best practice to reboot an ESXi host only when it is in maintenance mode and after all the VMs have been migrated to other hosts. If you have only 1 ESXi host, then you can reboot without migrating the VMs, though shutting them down gracefully first is always a good idea.

../_images/image04.png

Once the server has rebooted. Make sure to remove maintenance mode (if it was used) or restart the VMs that needed to be stopped (when only a single ESXi host is used).

Adding GPUs to a Node

Creating a Node Pool for the GPU Node

Note

This is an optional step.

Node Pools are a good way to specify pools of Kubernetes worker nodes which may have different or unique attributes. In this case, we have the opportunity to create a node pool which contains workers that manually have a GPU assigned to it. See here for more information regarding node pools with Anthos on-premise.

First, edit your user cluster config.yaml file on the admin workstation and add an additional node pool:

- name: user-cluster1-gpu
  cpus: 4
  memoryMB: 8192
  replicas: 1
  labels:
    hardware: gpu

After adding the node pool to your configuration, use the gkectl update command push the change:

$ gkectl update cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
   --config [USER_CLUSTER_KUBECONFIG]
Reading config with version "v1"
Update summary for cluster user-cluster1-bundledlb:
   Node pool(s) to be created: [user-cluster1-gpu]
Do you want to continue? [Y/n]: Y
Updating cluster "user-cluster1-bundledlb"...
Creating node MachineDeployment(s) in user cluster...  DONE
Done updating the user cluster

Add GPUs to Nodes in vSphere

Select an existing user-cluster node to add a GPU to (if you created a node pool with the previous step then you would choose a node from that pool). Make sure that this VM is on the host with the GPU (if you have vMotion enabled this could be as simple as right clicking on the VM and selecting Migrate).

To configure a PCI device on a virtual machine, from the Inventory in vSphere Client, right-click the virtual machine and select Power->Power Off.

../_images/image05.png

After the VM is powered off, right-click the virtual machine and click Edit Settings.

../_images/image06.png

Within the Edit Settings window, click ADD NEW DEVICE.

../_images/image07.png

Choose PCI Device from the dropdown.

../_images/image08.png

You may need to select the GPU or if it’s the only device available it may be automatically selected for you. If you don’t see the GPU, it’s possible your VM is not currently on the ESXi host with the passthrough device configured.

../_images/image09.png

Expand the Memory section and make sure to select the option for Reserve all Guest Memory (All locked).

../_images/image10.png

Click OK.

Before the VM can be started, the VM/Host Rule for VM anti-affinity must be deleted. (Note that this step may not be necessary if your cluster’s config.yaml contains antiAffinityGroups.enabled: False). From the vSphere Inventory list, click on the cluster then the Configure tab and then under Configuration select VM/Host Rules. Select the rule containing your node and delete it.

../_images/image11.png

Now you can power on the VM, right click on the VM and select Power>Power On.

../_images/image12.png

If vSphere presents you with Power On Recommendations then select OK.

../_images/image13.png

The following steps should be performed from your Admin Workstation or other Linux system which has the ability to use kubectl to work with the cluster.

Install NVIDIA GPU Operator

Install Helm

The preferred method to deploy the GPU Operator is using helm.

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Now, add the NVIDIA Helm repository:

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update

Install the GPU Operator

The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options

The following options are available when using the Helm chart. These options can be used with --set when installing via Helm.

Parameter

Description

Default

nfd.enabled

Deploys Node Feature Discovery plugin as a daemonset. Set this variable to false if NFD is already running in the cluster.

true

operator.defaultRuntime

DEPRECATED as of v1.9

docker

mig.strategy

Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either mixed or single.

single

psp.enabled

The GPU operator deploys PodSecurityPolicies if enabled.

false

driver.enabled

By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to false when using the Operator on systems with pre-installed drivers.

true

driver.repository

The images are downloaded from NGC. Specify another image repository when using custom driver images.

nvcr.io/nvidia

driver.version

Version of the NVIDIA datacenter driver supported by the Operator.

Depends on the version of the Operator. See the Component Matrix for more information on supported drivers.

driver.rdma.enabled

Controls whether the driver daemonset should build and load the nvidia-peermem kernel module.

false

driver.rdma.useHostMofed

Indicate if MOFED is directly pre-installed on the host. This is used to build and load nvidia-peermem kernel module.

false

toolkit.enabled

By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a container on the system. Set this value to false when using the Operator on systems with pre-installed NVIDIA runtimes.

true

migManager.enabled

The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).

true

Namespace

Prior to GPU Operator v1.9, the operator was installed in the default namespace while all operands were installed in the gpu-operator-resources namespace.

Starting with GPU Operator v1.9, both the operator and operands get installed in the same namespace. The namespace is configurable and is determined during installation. For example, to install the GPU Operator in the gpu-operator namespace:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace
     nvidia/gpu-operator

If a namespace is not specified during installation, all GPU Operator components will be installed in the default namespace.

Common Deployment Scenarios

In this section, we present some common deployment recipes when using the Helm chart to install the GPU Operator.

Bare-metal/Passthrough with default configurations on Ubuntu

In this scenario, the default configuration options are used:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator
Bare-metal/Passthrough with default configurations on CentOS

In this scenario, the CentOS toolkit image is used:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set toolkit.version=1.7.1-centos7

Note

  • For CentOS 8 systems, use toolkit.version=1.7.1-centos8.

  • Replace 1.7.1 toolkit version used here with the latest one available here.


NVIDIA vGPU

Note

The GPU Operator with NVIDIA vGPUs requires additional steps to build a private driver image prior to install. Refer to the document NVIDIA vGPU for detailed instructions on the workflow and required values of the variables used in this command.

The command below will install the GPU Operator with its default configuration for vGPU:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.repository=$PRIVATE_REGISTRY \
     --set driver.version=$VERSION \
     --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
     --set driver.licensingConfig.configMapName=licensing-config

NVIDIA AI Enterprise

Note

The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated:

$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie \
    --username='$oauthtoken' --password=api-key \
    && helm repo update

Install the NVIDIA GPU Operator:

$ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator-resources

Bare-metal/Passthrough with pre-installed NVIDIA drivers

In this example, the user has already pre-installed NVIDIA drivers as part of the system image:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false

Bare-metal/Passthrough with pre-installed drivers and NVIDIA Container Toolkit

In this example, the user has already pre-installed the NVIDIA drivers and NVIDIA Container Toolkit (nvidia-docker2) as part of the system image.

Note

These steps should be followed when using the GPU Operator v1.8+ on DGX systems such as DGX A100.

Before installing the operator, ensure that the following configurations are modified depending on the container runtime configured in your cluster.

Docker:

  • Update the Docker configuration to add nvidia as the default runtime. The nvidia runtime should be setup as the default container runtime for Docker on GPU nodes. This can be done by adding the default-runtime line into the Docker daemon config file, which is usually located on the system at /etc/docker/daemon.json:

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
          }
        }
    }
    

    Restart the Docker daemon to complete the installation after setting the default runtime:

    $ sudo systemctl restart docker
    

Containerd:

  • Update containerd to use nvidia as the default runtime and add nvidia runtime configuration. This can be done by adding below config to /etc/containerd/config.toml and restarting containerd service.

    version = 2
    [plugins]
      [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
    
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              privileged_without_host_devices = false
              runtime_engine = ""
              runtime_root = ""
              runtime_type = "io.containerd.runc.v2"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
    

    Restart the Containerd daemon to complete the installation after setting the default runtime:

    $ sudo systemctl restart containerd
    

Install the GPU operator with the following options:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

Bare-metal/Passthrough with pre-installed NVIDIA Container Toolkit (but no drivers)

In this example, the user has already pre-installed the NVIDIA Container Toolkit (nvidia-docker2) as part of the system image.

Before installing the operator, ensure that the following configurations are modified depending on the container runtime configured in your cluster.

Docker:

  • Update the Docker configuration to add nvidia as the default runtime. The nvidia runtime should be setup as the default container runtime for Docker on GPU nodes. This can be done by adding the default-runtime line into the Docker daemon config file, which is usually located on the system at /etc/docker/daemon.json:

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
          }
        }
    }
    

    Restart the Docker daemon to complete the installation after setting the default runtime:

    $ sudo systemctl restart docker
    

Containerd:

  • Update containerd to use nvidia as the default runtime and add nvidia runtime configuration. This can be done by adding below config to /etc/containerd/config.toml and restarting containerd service.

    version = 2
    [plugins]
      [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
    
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              privileged_without_host_devices = false
              runtime_engine = ""
              runtime_root = ""
              runtime_type = "io.containerd.runc.v2"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
    

    Restart the Containerd daemon to complete the installation after setting the default runtime:

    $ sudo systemctl restart containerd
    

Configure toolkit to use the root directory of the driver installation as /run/nvidia/driver, which is the path mounted by driver container.

$ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml

Once these steps are complete, now install the GPU operator with the following options (which will provision a driver):

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set toolkit.enabled=false

Custom driver image (based off a specific driver version)

If you want to use custom driver container images (for e.g. using 465.27), then you would need to build a new driver container image. Follow these steps:

  • Rebuild the driver container by specifying the $DRIVER_VERSION argument when building the Docker image. For reference, the driver container Dockerfiles are available on the Git repo here

  • Build the container using the appropriate Dockerfile. For example:

    $ docker build --pull -t \
        --build-arg DRIVER_VERSION=455.28 \
        nvidia/driver:455.28-ubuntu20.04 \
        --file Dockerfile .
    

    Ensure that the driver container is tagged as shown in the example by using the driver:<version>-<os> schema.

  • Specify the new driver image and repository by overriding the defaults in the Helm install command. For example:

    $ helm install --wait --generate-name \
         -n gpu-operator --create-namespace \
         nvidia/gpu-operator \
         --set driver.repository=docker.io/nvidia \
         --set driver.version="465.27"
    

Note that these instructions are provided for reference and evaluation purposes. Not using the standard releases of the GPU Operator from NVIDIA would mean limited support for such custom configurations.


Custom configuration for runtime containerd

When containerd is the container runtime used, the following configuration options are also respected:

toolkit:
   env:
   - name: CONTAINERD_CONFIG
   value: /etc/containerd/config.toml
   - name: CONTAINERD_SOCKET
   value: /run/containerd/containerd.sock
   - name: CONTAINERD_RUNTIME_CLASS
   value: nvidia
   - name: CONTAINERD_SET_AS_DEFAULT
   value: true

These options are defined as follows:

  • CONTAINERD_CONFIGThe path on the host to the containerd config

    you would like to have updated with support for the nvidia-container-runtime. By default this will point to /etc/containerd/config.toml (the default location for containerd). It should be customized if your containerd installation is not in the default location.

  • CONTAINERD_SOCKETThe path on the host to the socket file used to

    communicate with containerd. The operator will use this to send a SIGHUP signal to the containerd daemon to reload its config. By default this will point to /run/containerd/containerd.sock (the default location for containerd). It should be customized if your containerd installation is not in the default location.

  • CONTAINERD_RUNTIME_CLASSThe name of the

    Runtime Class you would like to associate with the nvidia-container-runtime. Pods launched with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS will always run with the nvidia-container-runtime. The default CONTAINERD_RUNTIME_CLASS is nvidia.

  • CONTAINERD_SET_AS_DEFAULTA flag indicating whether you want to set

    nvidia-container-runtime as the default runtime used to launch all containers. When set to false, only containers in pods with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS will be run with the nvidia-container-runtime. The default value is true.


Proxy Environments

Refer to the section Install GPU Operator in Proxy Environments for more information on how to install the Operator on clusters behind a HTTP proxy.


Air-gapped Environments

Refer to the section Install GPU Operator in Air-gapped Environments for more information on how to install the Operator in air-gapped environments.


Multi-Instance GPU (MIG)

Refer to the document GPU Operator with MIG for more information on how use the Operator with Multi-Instance GPU (MIG) on NVIDIA Ampere products. For guidance on configuring MIG support for the NVIDIA GPU Operator in an OpenShift Container Platform cluster, see the user guide.


Outdated Kernels

Refer to the section Considerations when Installing with Outdated Kernels in Cluster for more information on how to install the Operator successfully when nodes in the cluster are not running the latest kernel


Verify GPU Operator Install

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running and the validation is complete:

$ kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-crrsq                                   1/1     Running     0          60s
gpu-operator-7fb75556c7-x8spj                                 1/1     Running     0          5m13s
gpu-operator-node-feature-discovery-master-58d884d5cc-w7q7b   1/1     Running     0          5m13s
gpu-operator-node-feature-discovery-worker-6rht2              1/1     Running     0          5m13s
gpu-operator-node-feature-discovery-worker-9r8js              1/1     Running     0          5m13s
nvidia-container-toolkit-daemonset-lhgqf                      1/1     Running     0          4m53s
nvidia-cuda-validator-rhvbb                                   0/1     Completed   0          54s
nvidia-dcgm-5jqzg                                             1/1     Running     0          60s
nvidia-dcgm-exporter-h964h                                    1/1     Running     0          60s
nvidia-device-plugin-daemonset-d9ntc                          1/1     Running     0          60s
nvidia-device-plugin-validator-cm2fd                          0/1     Completed   0          48s
nvidia-driver-daemonset-5xj6g                                 1/1     Running     0          4m53s
nvidia-mig-manager-89z9b                                      1/1     Running     0          4m53s
nvidia-operator-validator-bwx99                               1/1     Running     0          58s

We can now proceed to running some sample GPU workloads to verify that the Operator (and its components) are working correctly.

Running GPU Applications

Jupyter Notebooks

This section of the guide walks through how to run a sample Jupyter notebook on the Kubernetes cluster.

  1. Create a yaml file for the pod and service for the notebook:

    $ LOADBALANCERIP=<ip address to be used to expose the service>
    
    $ cat << EOF | kubectl create -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: tf-notebook
      labels:
        app: tf-notebook
    spec:
      type: LoadBalancer
      loadBalancerIP: $LOADBALANCERIP
      ports:
      - port: 80
        name: http
        targetPort: 8888
        nodePort: 30001
      selector:
        app: tf-notebook
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: tf-notebook
      labels:
        app: tf-notebook
    spec:
      securityContext:
        fsGroup: 0
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:latest-gpu-jupyter
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8888
          name: notebook
    EOF
    
  2. View the logs of the tf-notebook pod to obtain the token:

    $ kubectl logs tf-notebook
    
    [I 19:07:43.061 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
    [I 19:07:43.423 NotebookApp] Serving notebooks from local directory: /tf
    [I 19:07:43.423 NotebookApp] The Jupyter Notebook is running at:
    [I 19:07:43.423 NotebookApp] http://tf-notebook:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:07:43.423 NotebookApp]  or http://127.0.0.1:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:07:43.423 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    [C 19:07:43.429 NotebookApp]
    
       To access the notebook, open this file in a browser:
          file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
       Or copy and paste one of these URLs:
          http://tf-notebook:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
       or http://127.0.0.1:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:08:24.180 NotebookApp] 302 GET / (172.16.20.30) 0.61ms
    [I 19:08:24.182 NotebookApp] 302 GET /tree? (172.16.20.30) 0.57ms
    
  3. From a web browser, navigate to http://<LOADBALANCERIP> and enter the token where prompted to login: Depending on your environment you may not have web browser access to the exposed service. You may be able to use SSH Port Forwarding/Tunneling to achieve this.

    ../_images/image14.png
  4. Once logged in, navigate click on the tenserflow-tutorials folder and then on the first file, classification.ipynb:

    ../_images/image15.png
  5. This will launch a new tab with the Notebook loaded. You can now run through the Notebook by clicking on the Run button. The notebook will step through each section and execute the code as you go. Continue pressing Run until you reach the end of the notebook and observe the execution of the classification program.

    ../_images/image16.png
  6. Once the notebook is complete you can check the logs of the tf-notebook pod to confirm it was using the GPU:

    =========snip===============
    [I 19:17:58.116 NotebookApp] Saving file at /tensorflow-tutorials/classification.ipynb
    2020-05-21 19:21:01.422482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
    2020-05-21 19:21:01.436767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.437469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
    pciBusID: 0000:13:00.0 name: Tesla P4 computeCapability: 6.1
    coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
    2020-05-21 19:21:01.438477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.462370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    2020-05-21 19:21:01.475269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
    2020-05-21 19:21:01.478104: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
    2020-05-21 19:21:01.501057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
    2020-05-21 19:21:01.503901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
    2020-05-21 19:21:01.544763: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2020-05-21 19:21:01.545022: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.545746: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.546356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
    2020-05-21 19:21:01.546705: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2020-05-21 19:21:01.558283: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2194840000 Hz
    2020-05-21 19:21:01.558919: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f6f2c000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2020-05-21 19:21:01.558982: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2020-05-21 19:21:01.645786: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.646387: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53ab350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
    2020-05-21 19:21:01.646430: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P4, Compute Capability 6.1
    2020-05-21 19:21:01.647005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.647444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
    pciBusID: 0000:13:00.0 name: Tesla P4 computeCapability: 6.1
    coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
    2020-05-21 19:21:01.647523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.647570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    2020-05-21 19:21:01.647611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
    2020-05-21 19:21:01.647647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
    2020-05-21 19:21:01.647683: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
    2020-05-21 19:21:01.647722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
    2020-05-21 19:21:01.647758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2020-05-21 19:21:01.647847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.648311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.648720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
    2020-05-21 19:21:01.649158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.650302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-21 19:21:01.650362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
    2020-05-21 19:21:01.650392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
    2020-05-21 19:21:01.650860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.651341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.651773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7048 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:13:00.0, compute capability: 6.1)
    2020-05-21 19:21:03.601093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    [I 19:21:58.132 NotebookApp] Saving file at /tensorflow-tutorials/classification.ipynb
    

Uninstall and Cleanup

You can remove the tf-notebook and service with the following commands:

$ kubectl delete pod tf-notebook
$ kubectl delete svc tf-notebook

You can remove the GPU operator with the command:

$ helm uninstall $(helm list | grep gpu-operator | awk '{print $1}')
release "gpu-operator-1590086955" uninstalled

You can now stop the VM, remove the PCI device, remove the memory reservation, and restart the VM.

You do not need to remove the PCI passthrough device from the host.

Known Issues

This section outlines some known issues with using Google Cloud’s Anthos with NVIDIA GPUs.

  1. Attaching a GPU to a Anthos on-prem worker node requires manually editing the VM from vSphere. These changes will not survive an Anthos on-prem upgrade process. When the node with the GPU is deleted as part of the update process, the new VM replacing it will not have the GPU added. The GPU must be added back to a new VM manually again. While the NVIDIA GPU seems to be able to handle that event gracefully, the workload backed by the GPU may need to be initiated again manually.

  2. Attaching a VM to the GPU means that the VM can no longer be migrated to another ESXi host. The VM will essentially be pinned to the ESXi host which hosts the GPU. vMotion and VMware HA features cannot be used.

  3. VMs that use a PCI Passthrough device require that their full memory allocation be locked. This will cause a Virtual machine memory usage alarm on the VM which can safely be ignored.

    ../_images/image17.png