Setting Up Amazon EKS#

Amazon EKS

This section will describe how to setup an NVIDIA AI Enterprise-supported Amazon EKS instance, and the associated Amazon Web Services, for the Cloud Native Service Add-On Pack to be deployed on top of and integrate with.

Prerequisites#

Note

The following steps will need to be performed using an AWS account with admin privileges

  1. First, using the hardware specifications from the AI Workflows documentation, provision an EKS instance meeting the minimum cluster version below, following the instructions in the NVIDIA AI Enterprise Cloud Guide.

    • Minimum Cluster Version: 1.23

    • Minimum Cloud Native Service Add-On Pack Version: 0.4.0

  2. Once your cluster has been created, ensure you can access the cluster via the kubeconfig and eksctl on your system.

  3. Retrieve the cluster name using the following command:

    1$  aws eks list-clusters
    

    You should see an output similar to the following:

    1"clusters": [
    2
    3"<cluster-name>"
    4
    5]
    

    Make a note of this cluster name, as you will reference this throughout the rest of the steps.

  4. Create an NGC API Key if you have not done so already, and ensure you can access the Enterprise Catalog.

  5. Once you have created an NGC API Key, install and configure the NGC CLI if you have not done so already using the instructions here.

EKS Configuration#

Follow the steps below to create and configure the required components for the Cloud Native Service Add-On Pack to be deployed on and integrate with.

IAM OIDC Provider#

An IAM OIDC provider is required to manage the service accounts required for configuration of the EKS cluster and Amazon-managed services.

To create an IAM OIDC provider, run the following command with your cluster name:

eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve

You should see an output similar to the following:

12023-04-04 13:55:08 []  will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
2
32023-04-04 13:55:08 []  created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"

Storage Configuration#

A storage class must be available on the EKS cluster for the Cloud Native Service Add-on Pack to be configured to use. This guide will use the gp2 storage class. For detailed enformation on how to enable the gp2 storage class, please refer to AWS documentation.

Once the gp2 storage class has been created, the following configuration is required to create a service account that can create Persistent Volumes using the gp2 storage class on the cluster

  1. First, retrieve your cluster ID using the command below. You will use this in later steps.

    1$ aws sts get-caller-identity --query 'Account' --output text
    

    You should see an output similar to the following:

    1298485221437
    
  2. Next, create a service account with the EBS CSI Driver role for the EBS CSI driver add-on using the command below. To make sure the role name is unique and to avoid conflicts with existing roles, append the timestamp to the role name.

    1$ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    

    You should see an output similar to the following:

     12023-04-04 15:18:05 []  1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules)
     2
     32023-04-04 15:18:05 [!]  serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used
     4
     52023-04-04 15:18:05 []  1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" }
     6
     72023-04-04 15:18:05 []  building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
     8
     92023-04-04 15:18:05 []  deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    10
    112023-04-04 15:18:05 []  waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    12
    132023-04-04 15:18:35 []  waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    

    Confirm that the service account was created with the following command:

    1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>
    

    You should see an output similar to the following:

    1NAMESPACE NAME ROLE ARN
    2
    3kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
    

    Confirm the IAM Role in the AWS Console matches the images below:

  3. Next, create the add-on for the service account role for your cluster using the command below:

    1$ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    

    You should see an output similar to the following:

    1arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    2
    32023-04-04 15:29:49 []  Kubernetes version "1.25" in use by cluster "<cluster-name> "
    4
    52023-04-04 15:29:49 []  using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>"
    6
    72023-04-04 15:29:49 []  creating addon
    

    Confirm that the add-on was created using the following command:

    1$ eksctl get addon --cluster <cluster-name>
    

    You should see an output similar to the following:

    12023-04-04 15:30:20 []  Kubernetes version "1.25" in use by cluster "<cluster-name>"
    2
    32023-04-04 15:30:20 []  getting all addons
    4
    52023-04-04 15:30:21 []  to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>`
    6
    7NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES
    8
    9aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
    

Amazon Web Services Integration#

When installing CNPack on EKS, you have an option to configure and connect some of the cluster components to AWS central services. The current AWS services that can be configured are:

  • FluentBit

  • Prometheus

  • Cert-manager

Configuration guidance for these services is provided below. Example deployment configuration for connecting the Cloud Native Service Add-On Pack to these services is provided in the next section.

AWS Cloud Watch#

To enable FluentBit to send logs to AWS CloudWatch, the policy CloudWatchAgentServerPolicy needs to be attached to your cluster nodes. This policy enables FluentBit to ship logs to AWS CloudWatch.

  1. To attach the policy, navigate to the AWS Console, then go to the IAM > Roles page.

    Search for node:

    Select your node group(s):

  2. Go to Add Permissions > Attach Policies.

    Search for CloudWatchAgentServerPolicy:

    Select it via the check box, and click Add Permission. You should then see:

  3. Repeat these steps for your other node groups.

AWS Private CA#

To enable AWS Private CA issuer on the cluster, first setup an AWS Private CA issuer from the AWS Console.

Once the CA has been created, create an IAM policy to attach to the Amazon PCA Resource (arn) following the steps below.

  1. Create a file named iam-pca-policy.json with the following contents:

    {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "awspcaissuer",
        "Action": [
          "acm-pca:DescribeCertificateAuthority",
          "acm-pca:GetCertificate",
          "acm-pca:IssueCertificate"
        ],
        "Effect": "Allow",
        "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>"
      }
    ]
    }
    
  2. Create an IAM policy called AWSPCAIssuerIAMPolicy:

    aws iam create-policy \
     --policy-name AWSPCAIssuerIAMPolicy \
     --policy-document file://pca-iam-policy.json
    
  3. This new PCA IAM policy arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy needs to be attached to the node-role ARN of the Kubernetes node pool. This can be done from the AWS Console or by running this command as admin:

    aws iam attach-role-policy \
    --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \
    --role-name <your node-role arn>
    
  4. Update the certManager configuration in the configuration YAML for the add-on pack to the following:

    certManager:
      enabled: true
      awsPCA:
        enabled: true
        commonName: "<your common name specific to your AWS Private CA>"
        domainName: "<the domain name specific to your AWS Private CA>"
        arn: "<ARN of the AWS Private CA>"
    

AWS Managed Prometheus#

  1. Create an AWS Managed Prometheus Workspace from the AWS Console.

  2. Once the Prometheus instance is ready, to set up the IAM roles needed for AWS Managed Prometheus, create an IAM helper script using the template below:

    Note

    Make sure to change the CLUSTER_NAME field to match the cluster name for the EKS cluster.

      1aws-managed-prometheus-iam-setup.sh
      2
      3    #!/bin/bash -e
      4
      5    # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
      6
      7    # Permission is hereby granted, free of charge, to any person obtaining a copy of this
      8    # software and associated documentation files (the "Software"), to deal in the Software
      9    # without restriction, including without limitation the rights to use, copy, modify,
     10    # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
     11    # permit persons to whom the Software is furnished to do so.
     12
     13    # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
     14    # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
     15    # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
     16    # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
     17    # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
     18    # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
     19
     20
     21    CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail
     22    SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment
     23    AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
     24    OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")
     25    SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment
     26    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role
     27    SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy
     28    #
     29    # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp.
     30    #
     31    cat <<EOF > TrustPolicy.json
     32    {
     33    "Version": "2012-10-17",
     34    "Statement": [
     35        {
     36        "Effect": "Allow",
     37        "Principal": {
     38            "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
     39        },
     40        "Action": "sts:AssumeRoleWithWebIdentity",
     41        "Condition": {
     42            "StringEquals": {
     43            "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}"
     44            }
     45        }
     46        }
     47    ]
     48    }
     49    EOF
     50    #
     51    # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces
     52    #
     53    cat <<EOF > PermissionPolicyIngest.json
     54    {
     55    "Version": "2012-10-17",
     56    "Statement": [
     57        {"Effect": "Allow",
     58            "Action": [
     59            "aps:RemoteWrite",
     60            "aps:GetSeries",
     61            "aps:GetLabels",
     62            "aps:GetMetricMetadata"
     63            ],
     64            "Resource": "*"
     65        }
     66    ]
     67    }
     68    EOF
     69
     70    function getRoleArn() {
     71    OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1)
     72
     73    # Check for an expected exception
     74    if [[ $? -eq 0 ]]; then
     75        echo $OUTPUT
     76    elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then
     77        echo ""
     78    else
     79        >&2 echo $OUTPUT
     80        return 1
     81    fi
     82    }
     83
     84    #
     85    # Create the IAM Role for ingest with the above trust policy
     86    #
     87    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE)
     88    if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ];
     89    then
     90    #
     91    # Create the IAM role for service account
     92    #
     93    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \
     94    --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \
     95    --assume-role-policy-document file://TrustPolicy.json \
     96    --query "Role.Arn" --output text)
     97    #
     98    # Create an IAM permission policy
     99    #
    100    SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \
    101    --policy-document file://PermissionPolicyIngest.json \
    102    --query 'Policy.Arn' --output text)
    103    #
    104    # Attach the required IAM policies to the IAM role created above
    105    #
    106    aws iam attach-role-policy \
    107    --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \
    108    --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN
    109    else
    110        echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN IAM role for ingest already exists"
    111    fi
    112    echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN
    113    #
    114    # EKS cluster hosts an OIDC provider with a public discovery endpoint.
    115    # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts.
    116    # Doing this with eksctl is the easier and best approach.
    117    #
    118    eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
    
  3. Run the created helper script. You should see an output similar to the following:

    1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh
    2
    3arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role
    4
    52023-03-31 12:57:48 []  IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"
    
  4. Update the configuration YAML file for the add-on pack following the template below.

    prometheus:
    enabled: true
        awsRemoteWrite:
        url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write"
        arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"
    
    • The field url is set to the AWS Managed Prometheus workspace url

    • Ensure the field arn is set to the arn of the AWS AMP Policy created by the IAM helper script above, or existing IAM Role. See the AWS Documentation for more information on IAM role creation.