NVIDIA GPU Operator with Amazon EKS#

Approaches for Working with Amazon EKS#

You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways.

Default EKS configuration without the GPU Operator#

By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types that support NVIDIA GPUs.

Using the default configuration has the following limitations:

The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA.
You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.

If these limitations are acceptable to you, refer to Amazon EKS optimized Amazon Linux AMIs in the Amazon EKS documentation for information about configuring your cluster. You do not need to install the NVIDIA GPU Operator.

EKS Node Group with the GPU Operator#

To overcome the limitations with the first approach, you can create a node group for your cluster. Configure the node group with instance types that have NVIDIA GPUs and use an AMI with an operating system that the GPU Operator supports. The Operator does not support a mix of some nodes running Amazon Linux 2 and others running a supported operating system in the same cluster.

In this case, the Operator manages the lifecycle of all the operands, including the NVIDIA GPU driver containers. This approach enables you to run the most recent NVIDIA GPU drivers and use the Operator to manage upgrades of the driver and other software components such as the NVIDIA device plugin, NVIDIA Container Toolkit, and NVIDIA MIG Manager.

This approach provides the most up-to-date software and the Operator reduces the administrative overhead.

EKS Node Groups in Brief and Client Applications#

When you configure an Amazon EKS node group, you can configure self-managed nodes or managed nodes groups.

Amazon EKS supports many clients for creating a node group.

For self-managed nodes, you can use the eksctl CLI or Amazon Management Console. Refer to the preceding URL for concepts and procedures.

For managed node groups, you can use the Amazon Management Console. The Amazon EKS documentation describes how to use the eksctl CLI, but the CLI does not support operating systems other than Amazon Linux 2 and the Operator does not support that operating system. Refer to the preceding URL for concepts and procedures.

Terraform supports creating self-managed and managed node groups. Refer to AWS EKS Terraform module in the Terraform Registry for more information.

About Using the Operator with Amazon EKS#

To use the NVIDIA GPU Operator with Amazon Elastic Kubernetes Service (EKS) without any limitations, you perform the following high-level actions:

Create a self-managed or managed node group with instance types that have NVIDIA GPUs.

Refer to the following resources in the Amazon EC2 documentation to help you choose the instance type to meet your needs:
- Table of accelerated computing instance types for information about GPU model and count, RAM, and storage.
- Maximum IP addresses per network interface for accelerated computing instance types. Make sure the instance type supports enough IP addresses for your workload. For example, the g4dn.xlarge instance type supports 29 IP addresses for pods on the node.
Use an Amazon EKS optimized Amazon Machine Image (AMI) with Ubuntu 20.04 or 22.04 on the nodes in the node group.

AMIs support are specific to an AWS region and Kubernetes version. See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as ami-00687acd80b7a620a.
Use your preferred client application to create the node group.

Example: Create a Self-Managed Node Group with eksctl#

Prerequisites#

You have access to the Amazon Management Console or you installed and configured the AWS CLI. Refer to Installing or updating to the latest version of the AWS CLI and Configuring the AWS CLI in the AWS CLI documentation.
You installed the eksctl CLI if you prefer it as your client application. The CLI is available from https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html#eksctl-install-update.
You have the AMI value from https://cloud-images.ubuntu.com/aws-eks/.
You have the EC2 instance type to use for your nodes.

Procedure#

The following steps show how to create an Amazon EKS cluster with the eksctl CLI. The steps create a self-managed node group that uses an Amazon EKS optimized AMI.

Create a file, such as cluster-config.yaml, with contents like the following example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: demo-cluster
  region: us-west-2
  version: "1.25"
nodeGroups:
  - name: demo-gpu-workers
    instanceType: g4dn.xlarge
    ami: ami-0770ab88ec35aa875
    amiFamily: Ubuntu2004
    minSize: 1
    desiredCapacity: 3
    maxSize: 3
    volumeSize: 100
    overrideBootstrapCommand: |
      #!/bin/bash
      source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
      /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_rsa.pub

Replace the values for the cluster name, Kubernetes version, and so on. To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script.

Tip

The default volume size for each node is 20 GB. In many cases, containers with frameworks for AI/ML workloads are often very large. The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers.

Create the Amazon EKS cluster with the node group:

$ eksctl create cluster -f cluster-config.yaml

Creating the cluster requires several minutes.

Example Output

2022-08-19 17:51:04 [i]  eksctl version 0.105.0
2022-08-19 17:51:04 [i]  using region us-west-2
2022-08-19 17:51:04 [i]  setting availability zones to [us-west-2d us-west-2c us-west-2a]
2022-08-19 17:51:04 [i]  subnets for us-west-2d - public:192.168.0.0/19 private:192.168.96.0/19
...
[✓]  EKS cluster "demo-cluster" in "us-west-2" region is ready

Optional: View the cluster name:

$ eksctl get cluster

Example Output

NAME          REGION     EKSCTL CREATED
demo-cluster  us-west-2  True

Next Steps#

By default, the eksctl CLI adds the Kubernetes configuration information to your ~/.kube/config file. You can run kubectl get nodes -o wide to view the nodes in the Amazon EKS cluster.
You are ready to install the NVIDIA GPU Operator with Helm.

If you specified a Kubernetes version less than 1.25, then specify --set psp.enabled=true when you run the helm install command.

NVIDIA GPU Operator with Amazon EKS#

Approaches for Working with Amazon EKS#

Default EKS configuration without the GPU Operator#

EKS Node Group with the GPU Operator#

EKS Node Groups in Brief and Client Applications#

About Using the Operator with Amazon EKS#

Example: Create a Self-Managed Node Group with eksctl#

Prerequisites#

Procedure#

Next Steps#

Related Information#