NVIDIA GPU Operator with Amazon EKS

Approaches for Working with Amazon EKS

You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways.

Default EKS configuration without the GPU Operator

By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types that support NVIDIA GPUs.

Using the default configuration has the following limitations:

  • The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA.

  • You must deploy the NVIDIA device plugin and you assume responsibility for upgrading the plugin.

If these limitations are acceptable to you, refer to Amazon EKS optimized Amazon Linux AMIs in the Amazon EKS documentation for information about configuring your cluster. You do not need to install the NVIDIA GPU Operator.

EKS Node Group with the GPU Operator

To overcome the limitations with the first approach, you can create a node group for your cluster. Configure the node group with instance types that have NVIDIA GPUs and use an AMI with an operating system that the GPU Operator supports. The Operator does not support a mix of some nodes running Amazon Linux 2 and others running a supported operating system in the same cluster.

In this case, the Operator manages the lifecycle of all the operands, including the NVIDIA GPU driver containers. This approach enables you to run the most recent NVIDIA GPU drivers and use the Operator to manage upgrades of the driver and other software components such as the NVIDIA device plugin, NVIDIA Container Toolkit, and NVIDIA MIG Manager.

This approach provides the most up-to-date software and the Operator reduces the administrative overhead.

EKS Node Groups in Brief and Client Applications

When you configure an Amazon EKS node group, you can configure self-managed nodes or managed nodes groups.

Amazon EKS supports many clients for creating a node group.

For self-managed nodes, you can use the eksctl CLI or Amazon Management Console. Refer to the preceding URL for concepts and procedures.

For managed node groups, you can use the Amazon Management Console. The Amazon EKS documentation describes how to use the eksctl CLI, but the CLI does not support operating systems other than Amazon Linux 2 and the Operator does not support that operating system. Refer to the preceding URL for concepts and procedures.

Terraform supports creating self-managed and managed node groups. Refer to AWS EKS Terraform module in the Terraform Registry for more information.

About Using the Operator with Amazon EKS

To use the NVIDIA GPU Operator with Amazon Elastic Kubernetes Service (EKS) without any limitations, you perform the following high-level actions:

  • Create a self-managed or managed node group with instance types that have NVIDIA GPUs.

    Refer to the following resources in the Amazon EC2 documentation to help you choose the instance type to meet your needs:

    • Table of accelerated computing instance types for information about GPU model and count, RAM, and storage.

    • Table of maximum network interfaces for accelerated computing instance types. Make sure the instance type supports enough IP addresses for your workload. For example, the g4dn.xlarge instance type supports 29 IP addresses for pods on the node.

  • Use an Amazon EKS optimized Amazon Machine Image (AMI) with Ubuntu 20.04 or 22.04 on the nodes in the node group.

    AMIs support are specific to an AWS region and Kubernetes version. See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as ami-00687acd80b7a620a.

  • Use your preferred client application to create the node group.

Example: Create a Self-Managed Node Group with eksctl

Prerequisites

Procedure

The following steps show how to create an Amazon EKS cluster with the eksctl CLI. The steps create a self-managed node group that uses an Amazon EKS optimized AMI.

  1. Create a file, such as cluster-config.yaml, with contents like the following example:

    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
      name: demo-cluster
      region: us-west-2
      version: "1.25"
    nodeGroups:
      - name: demo-gpu-workers
        instanceType: g4dn.xlarge
        ami: ami-0770ab88ec35aa875
        amiFamily: Ubuntu2004
        minSize: 1
        desiredCapacity: 3
        maxSize: 3
        volumeSize: 100
        overrideBootstrapCommand: |
          #!/bin/bash
          source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
          /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
        ssh:
          allow: true
          publicKeyPath: ~/.ssh/id_rsa.pub
    

    Replace the values for the cluster name, Kubernetes version, and so on. To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script.

    Tip

    The default volume size for each node is 20 GB. In many cases, containers with frameworks for AI/ML workloads are often very large. The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers.

  2. Create the Amazon EKS cluster with the node group:

    $ eksctl create cluster -f cluster-config.yaml
    

    Creating the cluster requires several minutes.

    Example Output

    2022-08-19 17:51:04 [i]  eksctl version 0.105.0
    2022-08-19 17:51:04 [i]  using region us-west-2
    2022-08-19 17:51:04 [i]  setting availability zones to [us-west-2d us-west-2c us-west-2a]
    2022-08-19 17:51:04 [i]  subnets for us-west-2d - public:192.168.0.0/19 private:192.168.96.0/19
    ...
    [✓]  EKS cluster "demo-cluster" in "us-west-2" region is ready
    
  3. Optional: View the cluster name:

    $ eksctl get cluster
    

    Example Output

    NAME          REGION     EKSCTL CREATED
    demo-cluster  us-west-2  True
    

Next Steps

  • By default, the eksctl CLI adds the Kubernetes configuration information to your ~/.kube/config file. You can run kubectl get nodes -o wide to view the nodes in the Amazon EKS cluster.

  • You are ready to install the NVIDIA GPU Operator with Helm.

    If you specified a Kubernetes version less than 1.25, then specify --set psp.enabled=true when you run the helm install command.