Setting Up Amazon EKS#
Amazon EKS
This section will describe how to setup an NVIDIA AI Enterprise-supported Amazon EKS instance, and the associated Amazon Web Services, for the Cloud Native Service Add-On Pack to be deployed on top of and integrate with.
Prerequisites#
Note
The following steps will need to be performed using an AWS account with admin privileges
First, using the hardware specifications from the AI Workflows documentation, provision an EKS instance meeting the minimum cluster version below, following the instructions in the NVIDIA AI Enterprise Cloud Guide.
Minimum Cluster Version: 1.23
Minimum Cloud Native Service Add-On Pack Version: 0.4.0
Once your cluster has been created, ensure you can access the cluster via the kubeconfig and
eksctl
on your system.Retrieve the cluster name using the following command:
1$ aws eks list-clusters
You should see an output similar to the following:
1"clusters": [ 2 3"<cluster-name>" 4 5]
Make a note of this cluster name, as you will reference this throughout the rest of the steps.
Create an NGC API Key if you have not done so already, and ensure you can access the Enterprise Catalog.
Once you have created an NGC API Key, install and configure the NGC CLI if you have not done so already using the instructions here.
EKS Configuration#
Follow the steps below to create and configure the required components for the Cloud Native Service Add-On Pack to be deployed on and integrate with.
IAM OIDC Provider#
An IAM OIDC provider is required to manage the service accounts required for configuration of the EKS cluster and Amazon-managed services.
To create an IAM OIDC provider, run the following command with your cluster name:
eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve
You should see an output similar to the following:
12023-04-04 13:55:08 [ℹ] will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>" 2 32023-04-04 13:55:08 [✔] created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
Storage Configuration#
A storage class must be available on the EKS cluster for the Cloud Native Service Add-on Pack to be configured to use. This guide will use the gp2
storage class. For detailed enformation on how to enable the gp2
storage class, please refer to AWS documentation.
Once the gp2
storage class has been created, the following configuration is required to create a service account that can create Persistent Volumes using the gp2
storage class on the cluster
First, retrieve your cluster ID using the command below. You will use this in later steps.
1$ aws sts get-caller-identity --query 'Account' --output text
You should see an output similar to the following:
1298485221437
Next, create a service account with the EBS CSI Driver role for the EBS CSI driver add-on using the command below. To make sure the role name is unique and to avoid conflicts with existing roles, append the timestamp to the role name.
1$ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
You should see an output similar to the following:
12023-04-04 15:18:05 [ℹ] 1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules) 2 32023-04-04 15:18:05 [!] serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used 4 52023-04-04 15:18:05 [ℹ] 1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" } 6 72023-04-04 15:18:05 [ℹ] building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 8 92023-04-04 15:18:05 [ℹ] deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 10 112023-04-04 15:18:05 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 12 132023-04-04 15:18:35 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
Confirm that the service account was created with the following command:
1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>
You should see an output similar to the following:
1NAMESPACE NAME ROLE ARN 2 3kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
Confirm the IAM Role in the AWS Console matches the images below:
Next, create the add-on for the service account role for your cluster using the command below:
1$ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
You should see an output similar to the following:
1arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force 2 32023-04-04 15:29:49 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name> " 4 52023-04-04 15:29:49 [ℹ] using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>" 6 72023-04-04 15:29:49 [ℹ] creating addon
Confirm that the add-on was created using the following command:
1$ eksctl get addon --cluster <cluster-name>
You should see an output similar to the following:
12023-04-04 15:30:20 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name>" 2 32023-04-04 15:30:20 [ℹ] getting all addons 4 52023-04-04 15:30:21 [ℹ] to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>` 6 7NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES 8 9aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
Amazon Web Services Integration#
When installing CNPack on EKS, you have an option to configure and connect some of the cluster components to AWS central services. The current AWS services that can be configured are:
FluentBit
Prometheus
Cert-manager
Configuration guidance for these services is provided below. Example deployment configuration for connecting the Cloud Native Service Add-On Pack to these services is provided in the next section.
AWS Cloud Watch#
To enable FluentBit to send logs to AWS CloudWatch, the policy CloudWatchAgentServerPolicy
needs to be attached to your cluster nodes. This policy enables FluentBit to ship logs to AWS CloudWatch.
To attach the policy, navigate to the AWS Console, then go to the IAM > Roles page.
Go to Add Permissions > Attach Policies.
Repeat these steps for your other node groups.
AWS Private CA#
To enable AWS Private CA issuer on the cluster, first setup an AWS Private CA issuer from the AWS Console.
Once the CA has been created, create an IAM policy to attach to the Amazon PCA Resource (arn) following the steps below.
Create a file named
iam-pca-policy.json
with the following contents:{ "Version": "2012-10-17", "Statement": [ { "Sid": "awspcaissuer", "Action": [ "acm-pca:DescribeCertificateAuthority", "acm-pca:GetCertificate", "acm-pca:IssueCertificate" ], "Effect": "Allow", "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>" } ] }
Create an IAM policy called AWSPCAIssuerIAMPolicy:
aws iam create-policy \ --policy-name AWSPCAIssuerIAMPolicy \ --policy-document file://pca-iam-policy.json
This new PCA IAM policy
arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy
needs to be attached to the node-role ARN of the Kubernetes node pool. This can be done from the AWS Console or by running this command as admin:aws iam attach-role-policy \ --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \ --role-name <your node-role arn>
Update the
certManager
configuration in the configuration YAML for the add-on pack to the following:certManager: enabled: true awsPCA: enabled: true commonName: "<your common name specific to your AWS Private CA>" domainName: "<the domain name specific to your AWS Private CA>" arn: "<ARN of the AWS Private CA>"
AWS Managed Prometheus#
Create an AWS Managed Prometheus Workspace from the AWS Console.
Once the Prometheus instance is ready, to set up the IAM roles needed for AWS Managed Prometheus, create an IAM helper script using the template below:
Note
Make sure to change the
CLUSTER_NAME
field to match the cluster name for the EKS cluster.1aws-managed-prometheus-iam-setup.sh 2 3 #!/bin/bash -e 4 5 # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 6 7 # Permission is hereby granted, free of charge, to any person obtaining a copy of this 8 # software and associated documentation files (the "Software"), to deal in the Software 9 # without restriction, including without limitation the rights to use, copy, modify, 10 # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 11 # permit persons to whom the Software is furnished to do so. 12 13 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 14 # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 15 # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 16 # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 17 # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 18 # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 20 21 CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail 22 SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment 23 AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) 24 OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") 25 SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment 26 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role 27 SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy 28 # 29 # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp. 30 # 31 cat <<EOF > TrustPolicy.json 32 { 33 "Version": "2012-10-17", 34 "Statement": [ 35 { 36 "Effect": "Allow", 37 "Principal": { 38 "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" 39 }, 40 "Action": "sts:AssumeRoleWithWebIdentity", 41 "Condition": { 42 "StringEquals": { 43 "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}" 44 } 45 } 46 } 47 ] 48 } 49 EOF 50 # 51 # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces 52 # 53 cat <<EOF > PermissionPolicyIngest.json 54 { 55 "Version": "2012-10-17", 56 "Statement": [ 57 {"Effect": "Allow", 58 "Action": [ 59 "aps:RemoteWrite", 60 "aps:GetSeries", 61 "aps:GetLabels", 62 "aps:GetMetricMetadata" 63 ], 64 "Resource": "*" 65 } 66 ] 67 } 68 EOF 69 70 function getRoleArn() { 71 OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1) 72 73 # Check for an expected exception 74 if [[ $? -eq 0 ]]; then 75 echo $OUTPUT 76 elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then 77 echo "" 78 else 79 >&2 echo $OUTPUT 80 return 1 81 fi 82 } 83 84 # 85 # Create the IAM Role for ingest with the above trust policy 86 # 87 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE) 88 if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ]; 89 then 90 # 91 # Create the IAM role for service account 92 # 93 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \ 94 --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ 95 --assume-role-policy-document file://TrustPolicy.json \ 96 --query "Role.Arn" --output text) 97 # 98 # Create an IAM permission policy 99 # 100 SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \ 101 --policy-document file://PermissionPolicyIngest.json \ 102 --query 'Policy.Arn' --output text) 103 # 104 # Attach the required IAM policies to the IAM role created above 105 # 106 aws iam attach-role-policy \ 107 --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ 108 --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN 109 else 110 echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN IAM role for ingest already exists" 111 fi 112 echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN 113 # 114 # EKS cluster hosts an OIDC provider with a public discovery endpoint. 115 # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts. 116 # Doing this with eksctl is the easier and best approach. 117 # 118 eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
Run the created helper script. You should see an output similar to the following:
1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh 2 3arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role 4 52023-03-31 12:57:48 [ℹ] IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"
Update the configuration YAML file for the add-on pack following the template below.
prometheus: enabled: true awsRemoteWrite: url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write" arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"
The field
url
is set to the AWS Managed Prometheus workspace urlEnsure the field
arn
is set to thearn
of the AWS AMP Policy created by the IAM helper script above, or existing IAM Role. See the AWS Documentation for more information on IAM role creation.