Setting Up Amazon EKS

Amazon EKS

This section will describe how to setup an NVIDIA AI Enterprise-supported Amazon EKS instance, and the associated Amazon Web Services, for the Cloud Native Service Add-On Pack to be deployed on top of and integrate with.

Note

The following steps will need to be performed using an AWS account with admin privileges

  1. First, using the hardware specifications from the AI Workflows documentation, provision an EKS instance meeting the minimum cluster version below, following the instructions in the NVIDIA AI Enterprise Cloud Guide.

    • Minimum Cluster Version: 1.23

    • Minimum Cloud Native Service Add-On Pack Version: 0.4.0

  2. Once your cluster has been created, ensure you can access the cluster via the kubeconfig and eksctl on your system.

  3. Retrieve the cluster name using the following command:

    Copy
    Copied!
                

    $ aws eks list-clusters

    You should see an output similar to the following:

    Copy
    Copied!
                

    "clusters": [ "<cluster-name>" ]

    Make a note of this cluster name, as you will reference this throughout the rest of the steps.

  4. Create an NGC API Key if you have not done so already, and ensure you can access the Enterprise Catalog.

  5. Once you have created an NGC API Key, install and configure the NGC CLI if you have not done so already using the instructions here.

Follow the steps below to create and configure the required components for the Cloud Native Service Add-On Pack to be deployed on and integrate with.

IAM OIDC Provider

An IAM OIDC provider is required to manage the service accounts required for configuration of the EKS cluster and Amazon-managed services.

To create an IAM OIDC provider, run the following command with your cluster name:

Copy
Copied!
            

eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve


You should see an output similar to the following:

Copy
Copied!
            

2023-04-04 13:55:08 [ℹ] will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>" 2023-04-04 13:55:08 [✔] created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"


Storage Configuration

A storage class must be available on the EKS cluster for the Cloud Native Service Add-on Pack to be configured to use. This guide will use the gp2 storage class. For detailed enformation on how to enable the gp2 storage class, please refer to AWS documentation.

Once the gp2 storage class has been created, the following configuration is required to create a service account that can create Persistent Volumes using the gp2 storage class on the cluster

  1. First, retrieve your cluster ID using the command below. You will use this in later steps.

    Copy
    Copied!
                

    $ aws sts get-caller-identity --query 'Account' --output text

    You should see an output similar to the following:

    Copy
    Copied!
                

    298485221437


  2. Next, create a service account with the EBS CSI Driver role for the EBS CSI driver add-on using the command below. To make sure the role name is unique and to avoid conflicts with existing roles, append the timestamp to the role name.

    Copy
    Copied!
                

    $ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force

    You should see an output similar to the following:

    Copy
    Copied!
                

    2023-04-04 15:18:05 [ℹ] 1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules) 2023-04-04 15:18:05 [!] serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used 2023-04-04 15:18:05 [ℹ] 1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" } 2023-04-04 15:18:05 [ℹ] building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:05 [ℹ] deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:05 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:35 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"

    Confirm that the service account was created with the following command:

    Copy
    Copied!
                

    nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>

    You should see an output similar to the following:

    Copy
    Copied!
                

    NAMESPACE NAME ROLE ARN kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>

    Confirm the IAM Role in the AWS Console matches the images below:

    eks-iam-01.png
    eks-iam-02.png
    eks-iam-03.png

  3. Next, create the add-on for the service account role for your cluster using the command below:

    Copy
    Copied!
                

    $ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force

    You should see an output similar to the following:

    Copy
    Copied!
                

    arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force 2023-04-04 15:29:49 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name> " 2023-04-04 15:29:49 [ℹ] using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>" 2023-04-04 15:29:49 [ℹ] creating addon

    Confirm that the add-on was created using the following command:

    Copy
    Copied!
                

    $ eksctl get addon --cluster <cluster-name>

    You should see an output similar to the following:

    Copy
    Copied!
                

    2023-04-04 15:30:20 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name>" 2023-04-04 15:30:20 [ℹ] getting all addons 2023-04-04 15:30:21 [ℹ] to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>` NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>


When installing CNPack on EKS, you have an option to configure and connect some of the cluster components to AWS central services. The current AWS services that can be configured are:

  • FluentBit

  • Prometheus

  • Cert-manager

Configuration guidance for these services is provided below. Example deployment configuration for connecting the Cloud Native Service Add-On Pack to these services is provided in the next section.

AWS Cloud Watch

To enable FluentBit to send logs to AWS CloudWatch, the policy CloudWatchAgentServerPolicy needs to be attached to your cluster nodes. This policy enables FluentBit to ship logs to AWS CloudWatch.

  1. To attach the policy, navigate to the AWS Console, then go to the IAM > Roles page.

    Search for node:

    eks-cloudwatch-01.png

    Select your node group(s):

    eks-cloudwatch-02.png

  2. Go to Add Permissions > Attach Policies.

    Search for CloudWatchAgentServerPolicy:

    eks-cloudwatch-03.png

    Select it via the check box, and click Add Permission. You should then see:

    eks-cloudwatch-04.png

  3. Repeat these steps for your other node groups.

AWS Private CA

To enable AWS Private CA issuer on the cluster, first setup an AWS Private CA issuer from the AWS Console.

Once the CA has been created, create an IAM policy to attach to the Amazon PCA Resource (arn) following the steps below.

  1. Create a file named iam-pca-policy.json with the following contents:

    Copy
    Copied!
                

    { "Version": "2012-10-17", "Statement": [ { "Sid": "awspcaissuer", "Action": [ "acm-pca:DescribeCertificateAuthority", "acm-pca:GetCertificate", "acm-pca:IssueCertificate" ], "Effect": "Allow", "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>" } ] }

  2. Create an IAM policy called AWSPCAIssuerIAMPolicy:

    Copy
    Copied!
                

    aws iam create-policy \ --policy-name AWSPCAIssuerIAMPolicy \ --policy-document file://pca-iam-policy.json

  3. This new PCA IAM policy arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy needs to be attached to the node-role ARN of the Kubernetes node pool. This can be done from the AWS Console or by running this command as admin:

    Copy
    Copied!
                

    aws iam attach-role-policy \ --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \ --role-name <your node-role arn>

  4. Update the certManager configuration in the configuration YAML for the add-on pack to the following:

    Copy
    Copied!
                

    certManager: enabled: true awsPCA: enabled: true commonName: "<yourcommonnamespecifictoyourAWSPrivateCA>" domainName: "<thedomainnamespecifictoyourAWSPrivateCA>" arn: "<ARNoftheAWSPrivateCA>"

AWS Managed Prometheus

  1. Create an AWS Managed Prometheus Workspace from the AWS Console.

  2. Once the Prometheus instance is ready, to set up the IAM roles needed for AWS Managed Prometheus, create an IAM helper script using the template below:

    Note

    Make sure to change the CLUSTER_NAME field to match the cluster name for the EKS cluster.

    Copy
    Copied!
                

    aws-managed-prometheus-iam-setup.sh #!/bin/bash -e # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. # Permission is hereby granted, free of charge, to any person obtaining a copy of this # software and associated documentation files (the "Software"), to deal in the Software # without restriction, including without limitation the rights to use, copy, modify, # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so. # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy # # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp. # cat <<EOF > TrustPolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}" } } } ] } EOF # # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces # cat <<EOF > PermissionPolicyIngest.json { "Version": "2012-10-17", "Statement": [ {"Effect": "Allow", "Action": [ "aps:RemoteWrite", "aps:GetSeries", "aps:GetLabels", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF function getRoleArn() { OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1) # Check for an expected exception if [[ $? -eq 0 ]]; then echo $OUTPUT elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then echo "" else >&2 echo $OUTPUT return 1 fi } # # Create the IAM Role for ingest with the above trust policy # SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE) if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ]; then # # Create the IAM role for service account # SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \ --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ --assume-role-policy-document file://TrustPolicy.json \ --query "Role.Arn" --output text) # # Create an IAM permission policy # SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \ --policy-document file://PermissionPolicyIngest.json \ --query 'Policy.Arn' --output text) # # Attach the required IAM policies to the IAM role created above # aws iam attach-role-policy \ --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN else echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARNIAM role for ingest already exists" fi echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN # # EKS cluster hosts an OIDC provider with a public discovery endpoint. # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts. # Doing this with eksctl is the easier and best approach. # eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve


  3. Run the created helper script. You should see an output similar to the following:

    Copy
    Copied!
                

    nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role 2023-03-31 12:57:48 [ℹ] IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"


  4. Update the configuration YAML file for the add-on pack following the template below.

    Copy
    Copied!
                

    prometheus: enabled: true awsRemoteWrite: url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write" arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"

    • The field url is set to the AWS Managed Prometheus workspace url

    • Ensure the field arn is set to the arn of the AWS AMP Policy created by the IAM helper script above, or existing IAM Role. See the AWS Documentation for more information on IAM role creation.

© Copyright 2022-2023, NVIDIA. Last updated on May 23, 2023.