NVIDIA GPU Operator with Amazon EKS
Approaches for Working with Amazon EKS
You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways.
Default EKS configuration without the GPU Operator
By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types that support NVIDIA GPUs.
Using the default configuration has the following limitations:
The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA.
You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.
If these limitations are acceptable to you, refer to Amazon EKS optimized Amazon Linux AMIs in the Amazon EKS documentation for information about configuring your cluster. You do not need to install the NVIDIA GPU Operator.
EKS Node Group with the GPU Operator
To overcome the limitations with the first approach, you can create a node group for your cluster. Configure the node group with instance types that have NVIDIA GPUs and use an AMI with an operating system that the GPU Operator supports. The Operator does not support a mix of some nodes running Amazon Linux 2 and others running a supported operating system in the same cluster.
In this case, the Operator manages the lifecycle of all the operands, including the NVIDIA GPU driver containers. This approach enables you to run the most recent NVIDIA GPU drivers and use the Operator to manage upgrades of the driver and other software components such as the NVIDIA device plugin, NVIDIA Container Toolkit, and NVIDIA MIG Manager.
This approach provides the most up-to-date software and the Operator reduces the administrative overhead.
EKS Node Groups in Brief and Client Applications
When you configure an Amazon EKS node group, you can configure self-managed nodes or managed nodes groups.
Amazon EKS supports many clients for creating a node group.
For self-managed nodes, you can use the eksctl
CLI or Amazon Management Console.
Refer to the preceding URL for concepts and procedures.
For managed node groups, you can use the Amazon Management Console.
The Amazon EKS documentation describes how to use the eksctl
CLI,
but the CLI does not support operating systems other than Amazon Linux 2 and
the Operator does not support that operating system.
Refer to the preceding URL for concepts and procedures.
Terraform supports creating self-managed and managed node groups. Refer to AWS EKS Terraform module in the Terraform Registry for more information.
About Using the Operator with Amazon EKS
To use the NVIDIA GPU Operator with Amazon Elastic Kubernetes Service (EKS) without any limitations, you perform the following high-level actions:
Create a self-managed or managed node group with instance types that have NVIDIA GPUs.
Refer to the following resources in the Amazon EC2 documentation to help you choose the instance type to meet your needs:
Table of accelerated computing instance types for information about GPU model and count, RAM, and storage.
Table of maximum network interfaces for accelerated computing instance types. Make sure the instance type supports enough IP addresses for your workload. For example, the
g4dn.xlarge
instance type supports29
IP addresses for pods on the node.
Use an Amazon EKS optimized Amazon Machine Image (AMI) with Ubuntu 20.04 or 22.04 on the nodes in the node group.
AMIs support are specific to an AWS region and Kubernetes version. See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as
ami-00687acd80b7a620a
.Use your preferred client application to create the node group.
Example: Create a Self-Managed Node Group with eksctl
Prerequisites
You have access to the Amazon Management Console or you installed and configured the AWS CLI. Refer to Installing or updating to the latest version of the AWS CLI and Configuring the AWS CLI in the AWS CLI documentation.
You installed the
eksctl
CLI if you prefer it as your client application. The CLI is available from https://eksctl.io/introduction/#installation.You have the AMI value from https://cloud-images.ubuntu.com/aws-eks/.
You have the EC2 instance type to use for your nodes.
Procedure
The following steps show how to create an Amazon EKS cluster with the eksctl
CLI.
The steps create a self-managed node group that uses an Amazon EKS optimized AMI.
Create a file, such as
cluster-config.yaml
, with contents like the following example:apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: demo-cluster region: us-west-2 version: "1.25" nodeGroups: - name: demo-gpu-workers instanceType: g4dn.xlarge ami: ami-0770ab88ec35aa875 amiFamily: Ubuntu2004 minSize: 1 desiredCapacity: 3 maxSize: 3 volumeSize: 100 overrideBootstrapCommand: | #!/bin/bash source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}" ssh: allow: true publicKeyPath: ~/.ssh/id_rsa.pub
Replace the values for the cluster name, Kubernetes version, and so on. To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script.
Tip
The default volume size for each node is 20 GB. In many cases, containers with frameworks for AI/ML workloads are often very large. The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers.
Create the Amazon EKS cluster with the node group:
$ eksctl create cluster -f cluster-config.yaml
Creating the cluster requires several minutes.
Example Output
2022-08-19 17:51:04 [i] eksctl version 0.105.0 2022-08-19 17:51:04 [i] using region us-west-2 2022-08-19 17:51:04 [i] setting availability zones to [us-west-2d us-west-2c us-west-2a] 2022-08-19 17:51:04 [i] subnets for us-west-2d - public:192.168.0.0/19 private:192.168.96.0/19 ... [✓] EKS cluster "demo-cluster" in "us-west-2" region is ready
Optional: View the cluster name:
$ eksctl get cluster
Example Output
NAME REGION EKSCTL CREATED demo-cluster us-west-2 True
Next Steps
By default, the
eksctl
CLI adds the Kubernetes configuration information to your~/.kube/config
file. You can runkubectl get nodes -o wide
to view the nodes in the Amazon EKS cluster.You are ready to install the NVIDIA GPU Operator with Helm.
If you specified a Kubernetes version less than
1.25
, then specify--set psp.enabled=true
when you run thehelm install
command.