NVIDIA GPUs with Google Anthos

Changelog

  • 3/22/2020 (author: PR):
    • Fixed URLs

  • 11/30/2020 (author: PR/DF):
    • Added information on Anthos on bare metal

  • 11/25/2020 (author: PR):
    • Migrated docs to new format

  • 8/14/2020 (author: PR):
    • Initial Version

Introduction

Google Cloud’s Anthos is a modern application management platform that lets users build, deploy, and manage applications anywhere in a secure, consistent manner. The platform provides a consistent development and operations experience across deployments while reducing operational overhead and improving developer productivity. Anthos runs in hybrid and multi-cloud environments that spans Google Cloud, on-premise, and is generally available on Amazon Web Services (AWS). Support for Anthos on Microsoft Azure is in preview. For more information on Anthos, see the product overview.

Systems with NVIDIA GPUs can be deployed in various configurations for use with Google Cloud’s Anthos. The purpose of this document is to provide users with steps on getting started with using NVIDIA GPUs with Anthos in these various configurations.

Deployment Configurations

Anthos can be deployed in different configurations. Depending on your deployment, choose one of the sections below to get started with NVIDIA GPUs in Google Cloud’s Anthos:

  1. Anthos Clusters on Bare Metal with NVIDIA DGX Systems and GPU-Accelerated Servers

  2. Anthos Clusters with VMware and NVIDIA GPU-Accelerated Servers

Supported Platforms

GPUs

The following GPUs are supported:

  • NVIDIA A100, T4 and V100

DGX Systems

The following NVIDIA DGX systems are supported:

  • NVIDIA DGX A100

  • NVIDIA DGX-2 and DGX-1 (Volta)

Linux Distributions

The following Linux distributions are supported:

  • Ubuntu 18.04.z, 20.04.z LTS

For more information on the Anthos Ready platforms, visit this page.

Getting Support

For support issues related to using GPUs with Anthos, please open a ticket on the NVIDIA GPU Operator GitHub project. Your feedback is appreciated.

DGX customers can visit the NVIDIA DGX Systems Support Portal.

Anthos Clusters on Bare Metal with NVIDIA DGX Systems and GPU-Accelerated Servers

Anthos on bare metal with DGX A100 or NVIDIA GPU-accelerated servers systems enables a consistent development and operational experience across deployments, while reducing expensive overhead and improving developer productivity. Refer to the Anthos documentation for more information on Anthos cluster environments.

Installation Flow

The basic steps described in this document follows this workflow:

  1. Configure nodes

    • Ensure each node (including the control plane) meets the pre-requisites, including time synchronization, correct versions of Docker and other conditions.

  2. Configure networking (Optional)

    • Ensure network connectivity between control plane and nodes - ideally the VIPs, control plane and the nodes in the cluster are in the same network subnet.

  3. Configure an admin workstation and set up Anthos to create the cluster

    • Set up the cluster using Anthos on bare-metal

  4. Setup NVIDIA software on GPU nodes

    • Set up the NVIDIA software components on the GPU nodes to ensure that your cluster can run CUDA applications.

At the end of the installation flow, you should have a user cluster with GPU-enabled nodes that you can use to deploy applications.

Configure Nodes

These steps are required on each node in the cluster (including the control plane).

Time Synchronization

  • Ensure apparmor is stopped:

    $ apt-get install -y apparmor-utils policycoreutils
    
    $ systemctl --now enable apparmor \
       && systemctl stop apparmor
    
  • Synchronize the time on each node:

    • Check the current time

      $ timedatectl
      
                     Local time: Fri 2020-11-20 10:38:06 PST
                 Universal time: Fri 2020-11-20 18:38:06 UTC
                       RTC time: Fri 2020-11-20 18:38:08
                      Time zone: US/Pacific (PST, -0800)
      System clock synchronized: no
                    NTP service: active
                RTC in local TZ: no
      
    • Configure the NTP server in /etc/systemd/timesyncd.conf:

      NTP=time.google.com
      
    • Adjust the system clock:

      $ timedatectl set-local-rtc 0 --adjust-system-clock
      
    • Restart the service

      $ systemctl restart systemd-timesyncd.service
      
    • Verify the synchronization with the time server

      $ timedatectl
      
                     Local time: Fri 2020-11-20 11:03:22 PST
                 Universal time: Fri 2020-11-20 19:03:22 UTC
                       RTC time: Fri 2020-11-20 19:03:22
                      Time zone: US/Pacific (PST, -0800)
      System clock synchronized: yes
                    NTP service: active
                RTC in local TZ: no
      

Test Network Connectivity

  • Ensure you can nslookup on hostname

    $ systemctl restart systemd-resolved \
       && ping us.archive.ubuntu.com
    
    ping: us.archive.ubuntu.com: Temporary failure in name resolution
    
  • Check the nameserver in resolve.conf

    $ cat <<EOF > /etc/resolv.conf
    nameserver 8.8.8.8
    EOF
    
  • And re-test ping

    $ ping us.archive.ubuntu.com
    
    PING us.archive.ubuntu.com (91.189.91.38) 56(84) bytes of data.
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=1 ttl=49 time=73.4 ms
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=2 ttl=49 time=73.3 ms
    64 bytes from banjo.canonical.com (91.189.91.38): icmp_seq=3 ttl=49 time=73.4 ms
    

Install Docker

Follow these steps to install Docker. On DGX systems, Docker may already be installed using the docker-ce package. In this case, use docker.io as the base installation package for Docker to ensure a successful cluster setup with Anthos.

  • Stop services using docker:

    $ systemctl stop kubelet \
       && systemctl stop docker \
       && systemctl stop containerd \
       && systemctl stop containerd.io
    
  • Purge the existing packages of Docker and nvidia-docker2 if any:

    $ systemctl stop run-docker-netns-default.mount \
       && systemctl stop docker.haproxy
    
    $ dpkg -r nv-docker-options \
       && dpkg --purge nv-docker-options \
       && dpkg -r nvidia-docker2 \
       && dpkg --purge nvidia-docker2 \
       && dpkg -r docker-ce \
       && dpkg --purge docker-ce \
       && dpkg -r docker-ce-cli \
       && dpkg -r containerd \
       && dpkg --purge containerd \
       && dpkg -r containerd.io \
       && dpkg --purge
    
  • Re-install Docker

    $ apt-get update \
       && apt-get install -y apt-transport-https \
          ca-certificates \
          curl \
          software-properties-common \
          inetutils-traceroute \
          conntrack
    
    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
    
    $ add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) stable"
    
    $ apt-get update \
       && apt-get install -y docker.io
    
    $ systemctl --now enable docker
    
Install nvidia-docker on GPU Nodes

Note

This step should be performed on the GPU nodes only

For DGX systems, re-install nvidia-docker2 from the DGX repositories:

$ apt-get install -y nvidia-docker2

Since Kubernetes does not support the --gpus option with Docker yet, the nvidia runtime should be setup as the default container runtime for Docker on the GPU node. This can be done by adding the default-runtime line into the Docker daemon config file, which is usually located on the system at /etc/docker/daemon.json:

{
   "default-runtime": "nvidia",
   "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Restart the Docker daemon to complete the installation after setting the default runtime:

$ sudo systemctl restart docker

For non-DGX systems, refer to the NVIDIA Container Toolkit installation guide to setup nvidia-docker2.

Configure Networking (Optional)

Note

The following steps are provided as a reference for configuring the network so that the control plane and the nodes are on the same subnet by using tunnels and DNAT. If the nodes in your cluster are on the same subnet, then you may skip this step.

In the example below:

  • The control plane is at 10.117.29.41

  • The GPU node or admin workstation is at 10.110.20.149

  • The control plane VIP is 10.0.0.8

If the machines are on a different subnet than each other or the control plane VIP then tunnel routes can be used to establish connectivity.

There are two scenarios to consider:

  1. If the machines are on the same subnet, but the VIP is on a different subnet, then add the correct IP route (using ip route add 10.0.0.8 via <contro-plane-ip> from the GPU node or admin-workstation

  2. If the machines and VIP are on different subnets, then a tunnel is also needed to enable the above route command to succeed where <control-plane-ip> is the control plane tunnel 192.168.210.1.

Control Plane

Setup tunneling:

$ ip tunnel add tun0 mode ipip local 10.117.29.41 remote 10.110.20.149
$ ip addr add 192.168.200.1/24 dev tun0
$ ip link set tun0 up

Update DNAT to support the control plane VIP over the tunnel:

$ iptables -t nat -I PREROUTING  -p udp -d 192.168.210.1  --dport 6081 -j DNAT --to-destination 10.117.29.41

GPU Node or Admin Workstation

Establish connectivity with the control plane:

$ ip tunnel add tun1 mode ipip local 10.110.20.149  remote 10.117.29.41
$ ip addr add 192.168.210.2/24 dev tun1
$ ip link set tun1 up
$ ip route add 10.0.0.8/32 via 192.168.210.1

Setup DNAT:

$ iptables -t nat -I OUTPUT -p udp -d 10.117.29.41  --dport 6081 -j DNAT --to-destination 192.168.210.1

Configure Admin Workstation

Configure the admin workstation prior to setting up the cluster.

Download the Google Cloud SDK:

$ wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-314.0.0-linux-x86_64.tar.gz \
   && tar -xf google-cloud-sdk-314.0.0-linux-x86_64.tar.gz
$ google-cloud-sdk/install.sh

Install the Anthos authentication components:

$ gcloud components install anthos-auth

See the Anthos installtion overview for detailed instructions for installing Anthos in an on-premise environment and setup your cluster.

Setup NVIDIA Software on GPU Nodes

Once the Anthos cluster has been set up, you can proceed to deploy the NVIDIA software components on the GPU nodes.

NVIDIA Drivers

Note

DGX systems include the NVIDIA drivers. This step can be skipped.

For complete instructions on setting up NVIDIA drivers, visit the quickstart guide at https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html. The guide covers a number of pre-installation requirements and steps on supported Linux distributions for a successful install of the driver.

NVIDIA Device Plugin

To use GPUs in Kubernetes, the NVIDIA Device Plugin is required. The NVIDIA Device Plugin is a daemonset that automatically enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs.

The preferred method to deploy the device plugin is as a daemonset using helm.

Add the nvidia-device-plugin helm repository:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update

Deploy the device plugin:

$ helm install --generate-name nvdp/nvidia-device-plugin

For more user configurable options while deploying the daemonset, refer to the device plugin README

Node Feature Discovery

For detecting the hardware configuration and system configuration, we will deploy the Node Feature Discovery add-on:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.6.0/nfd-master.yaml.template
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.6.0/nfd-worker-daemonset.yaml.template

See the NFD documentation for more information on NFD.

Anthos Clusters with VMware and NVIDIA GPU Accelerated Servers

Anthos running on-premise has requirements for which vSphere versions are supported along with network and storage requirements. Please see the Anthos version compatibility matrix for more information: https://cloud.google.com/anthos/gke/docs/on-prem/versioning-and-upgrades#version_compatibility_matrix.

This guide assumes that the user already has an installed Anthos on-premise cluster in a vSphere environment. Please see https://cloud.google.com/anthos/gke/docs/on-prem/how-to/install-overview-basic for detailed instructions for installing Anthos in an on-premise environment.

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs.

In the VMware vSphere configuration, Anthos uses the NVIDIA GPU Operator to configure GPU nodes in the Kubernetes cluster so that the nodes can be used to schedule CUDA applications. The GPU Operator itself is deployed using Helm. The rest of this section provides users with steps on getting started.

Configuring PCIe Passthrough

For the GPU to be accessible to the VM, first you must enable PCI Passthrough on the ESXi host. This can be done from the vSphere client. This will require a reboot of the ESXi host to complete the process and therefore the host should be put into maintenance mode and any VMs running on the ESXi host evacuated to another. If you only have a single ESXi host, then the VMs will need to be restarted after the reboot.

From the vSphere client, select an ESXi host from the Inventory of VMware vSphere Client. In the Configure tab, click Hardware > PCI Devices. This will show you the passthrough-enabled devices (you will most likely find none at this time).

_images/image01.png

Click CONFIGURE PASSTHROUGH to launch the Edit PCI Device Availability window. Look for the GPU device and select the checkbox next to it (the GPU device will be recognizable as having NVIDIA Corporation in the Vendor Name view). Select the GPU devices (you may have more than one) and click OK.

_images/image02.png

At this point, the GPU(s) will appear as Available (pending). You will need to select Reboot This Host and complete the reboot before proceeding to the next step.

_images/image03.png

It is a VMware best practice to reboot an ESXi host only when it is in maintenance mode and after all the VMs have been migrated to other hosts. If you have only 1 ESXi host, then you can reboot without migrating the VMs, though shutting them down gracefully first is always a good idea.

_images/image04.png

Once the server has rebooted. Make sure to remove maintenance mode (if it was used) or restart the VMs that needed to be stopped (when only a single ESXi host is used).

Adding GPUs to a Node

Creating a Node Pool for the GPU Node

Note

This is an optional step.

Node Pools are a good way to specify pools of Kubernetes worker nodes which may have different or unique attributes. In this case, we have the opportunity to create a node pool which contains workers that manually have a GPU assigned to it. See managing node pools in the Google GKE documentation for more information regarding node pools with Anthos on-premise.

First, edit your user cluster config.yaml file on the admin workstation and add an additional node pool:

- name: user-cluster1-gpu
  cpus: 4
  memoryMB: 8192
  replicas: 1
  labels:
    hardware: gpu

After adding the node pool to your configuration, use the gkectl update command push the change:

$ gkectl update cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
   --config [USER_CLUSTER_KUBECONFIG]
Reading config with version "v1"
Update summary for cluster user-cluster1-bundledlb:
   Node pool(s) to be created: [user-cluster1-gpu]
Do you want to continue? [Y/n]: Y
Updating cluster "user-cluster1-bundledlb"...
Creating node MachineDeployment(s) in user cluster...  DONE
Done updating the user cluster

Add GPUs to Nodes in vSphere

Select an existing user-cluster node to add a GPU to (if you created a node pool with the previous step then you would choose a node from that pool). Make sure that this VM is on the host with the GPU (if you have vMotion enabled this could be as simple as right clicking on the VM and selecting Migrate).

To configure a PCI device on a virtual machine, from the Inventory in vSphere Client, right-click the virtual machine and select Power->Power Off.

_images/image05.png

After the VM is powered off, right-click the virtual machine and click Edit Settings.

_images/image06.png

Within the Edit Settings window, click ADD NEW DEVICE.

_images/image07.png

Choose PCI Device from the dropdown.

_images/image08.png

You may need to select the GPU or if it’s the only device available it may be automatically selected for you. If you don’t see the GPU, it’s possible your VM is not currently on the ESXi host with the passthrough device configured.

_images/image09.png

Expand the Memory section and make sure to select the option for Reserve all Guest Memory (All locked).

_images/image10.png

Click OK.

Before the VM can be started, the VM/Host Rule for VM anti-affinity must be deleted. (Note that this step may not be necessary if your cluster’s config.yaml contains antiAffinityGroups.enabled: False). From the vSphere Inventory list, click on the cluster then the Configure tab and then under Configuration select VM/Host Rules. Select the rule containing your node and delete it.

_images/image11.png

Now you can power on the VM, right click on the VM and select Power>Power On.

_images/image12.png

If vSphere presents you with Power On Recommendations then select OK.

_images/image13.png

The following steps should be performed from your Admin Workstation or other Linux system which has the ability to use kubectl to work with the cluster.

Install the NVIDIA GPU Operator:

$ helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator

Refer to Installing the NVIDIA GPU Operator in the NVIDIA GPU Operator documentation for installation options.

Running GPU Applications

Jupyter Notebooks

This section of the guide walks through how to run a sample Jupyter notebook on the Kubernetes cluster.

  1. Create a yaml file for the pod and service for the notebook:

    $ LOADBALANCERIP=<ip address to be used to expose the service>
    
    $ cat << EOF | kubectl create -f -
    apiVersion: v1
    kind: Service
    metadata:
      name: tf-notebook
      labels:
        app: tf-notebook
    spec:
      type: LoadBalancer
      loadBalancerIP: $LOADBALANCERIP
      ports:
      - port: 80
        name: http
        targetPort: 8888
        nodePort: 30001
      selector:
        app: tf-notebook
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: tf-notebook
      labels:
        app: tf-notebook
    spec:
      securityContext:
        fsGroup: 0
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:latest-gpu-jupyter
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8888
          name: notebook
    EOF
    
  2. View the logs of the tf-notebook pod to obtain the token:

    $ kubectl logs tf-notebook
    
    [I 19:07:43.061 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
    [I 19:07:43.423 NotebookApp] Serving notebooks from local directory: /tf
    [I 19:07:43.423 NotebookApp] The Jupyter Notebook is running at:
    [I 19:07:43.423 NotebookApp] http://tf-notebook:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:07:43.423 NotebookApp]  or http://127.0.0.1:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:07:43.423 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    [C 19:07:43.429 NotebookApp]
    
       To access the notebook, open this file in a browser:
          file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
       Or copy and paste one of these URLs:
          http://tf-notebook:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
       or http://127.0.0.1:8888/?token=fc5d8b9d6f29d5ddad62e8c731f83fc8e90a2d817588d772
    [I 19:08:24.180 NotebookApp] 302 GET / (172.16.20.30) 0.61ms
    [I 19:08:24.182 NotebookApp] 302 GET /tree? (172.16.20.30) 0.57ms
    
  3. From a web browser, navigate to http://<LOADBALANCERIP> and enter the token where prompted to login: Depending on your environment you may not have web browser access to the exposed service. You may be able to use SSH Port Forwarding/Tunneling to achieve this.

    _images/image14.png
  4. Once logged in, navigate click on the tenserflow-tutorials folder and then on the first file, classification.ipynb:

    _images/image15.png
  5. This will launch a new tab with the Notebook loaded. You can now run through the Notebook by clicking on the Run button. The notebook will step through each section and execute the code as you go. Continue pressing Run until you reach the end of the notebook and observe the execution of the classification program.

    _images/image16.png
  6. Once the notebook is complete you can check the logs of the tf-notebook pod to confirm it was using the GPU:

    =========snip===============
    [I 19:17:58.116 NotebookApp] Saving file at /tensorflow-tutorials/classification.ipynb
    2020-05-21 19:21:01.422482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
    2020-05-21 19:21:01.436767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.437469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
    pciBusID: 0000:13:00.0 name: Tesla P4 computeCapability: 6.1
    coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
    2020-05-21 19:21:01.438477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.462370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    2020-05-21 19:21:01.475269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
    2020-05-21 19:21:01.478104: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
    2020-05-21 19:21:01.501057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
    2020-05-21 19:21:01.503901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
    2020-05-21 19:21:01.544763: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2020-05-21 19:21:01.545022: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.545746: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.546356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
    2020-05-21 19:21:01.546705: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2020-05-21 19:21:01.558283: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2194840000 Hz
    2020-05-21 19:21:01.558919: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f6f2c000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2020-05-21 19:21:01.558982: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2020-05-21 19:21:01.645786: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.646387: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53ab350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
    2020-05-21 19:21:01.646430: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P4, Compute Capability 6.1
    2020-05-21 19:21:01.647005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.647444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
    pciBusID: 0000:13:00.0 name: Tesla P4 computeCapability: 6.1
    coreClock: 1.1135GHz coreCount: 20 deviceMemorySize: 7.43GiB deviceMemoryBandwidth: 178.99GiB/s
    2020-05-21 19:21:01.647523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.647570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    2020-05-21 19:21:01.647611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
    2020-05-21 19:21:01.647647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
    2020-05-21 19:21:01.647683: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
    2020-05-21 19:21:01.647722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
    2020-05-21 19:21:01.647758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
    2020-05-21 19:21:01.647847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.648311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.648720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
    2020-05-21 19:21:01.649158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
    2020-05-21 19:21:01.650302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-21 19:21:01.650362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
    2020-05-21 19:21:01.650392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
    2020-05-21 19:21:01.650860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.651341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-05-21 19:21:01.651773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7048 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:13:00.0, compute capability: 6.1)
    2020-05-21 19:21:03.601093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    [I 19:21:58.132 NotebookApp] Saving file at /tensorflow-tutorials/classification.ipynb
    

Uninstall and Cleanup

You can remove the tf-notebook and service with the following commands:

$ kubectl delete pod tf-notebook
$ kubectl delete svc tf-notebook

You can remove the GPU operator with the command:

$ helm uninstall $(helm list | grep gpu-operator | awk '{print $1}')
release "gpu-operator-1590086955" uninstalled

You can now stop the VM, remove the PCI device, remove the memory reservation, and restart the VM.

You do not need to remove the PCI passthrough device from the host.

Known Issues

This section outlines some known issues with using Google Cloud’s Anthos with NVIDIA GPUs.

  1. Attaching a GPU to a Anthos on-prem worker node requires manually editing the VM from vSphere. These changes will not survive an Anthos on-prem upgrade process. When the node with the GPU is deleted as part of the update process, the new VM replacing it will not have the GPU added. The GPU must be added back to a new VM manually again. While the NVIDIA GPU seems to be able to handle that event gracefully, the workload backed by the GPU may need to be initiated again manually.

  2. Attaching a VM to the GPU means that the VM can no longer be migrated to another ESXi host. The VM will essentially be pinned to the ESXi host which hosts the GPU. vMotion and VMware HA features cannot be used.

  3. VMs that use a PCI Passthrough device require that their full memory allocation be locked. This will cause a Virtual machine memory usage alarm on the VM which can safely be ignored.

    _images/image17.png