Setting Up Amazon EKS
Amazon EKS
This section will describe how to setup an NVIDIA AI Enterprise-supported Amazon EKS instance, and the associated Amazon Web Services, for the Cloud Native Service Add-On Pack to be deployed on top of and integrate with.
The following steps will need to be performed using an AWS account with admin privileges
First, using the hardware specifications from the AI Workflows documentation, provision an EKS instance meeting the minimum cluster version below, following the instructions in the NVIDIA AI Enterprise Cloud Guide.
Minimum Cluster Version: 1.23
Minimum Cloud Native Service Add-On Pack Version: 0.4.0
Once your cluster has been created, ensure you can access the cluster via the kubeconfig and
eksctl
on your system.Retrieve the cluster name using the following command:
$ aws eks list-clusters
You should see an output similar to the following:
"clusters": [ "<cluster-name>" ]
Make a note of this cluster name, as you will reference this throughout the rest of the steps.
Create an NGC API Key if you have not done so already, and ensure you can access the Enterprise Catalog.
Once you have created an NGC API Key, install and configure the NGC CLI if you have not done so already using the instructions here.
Follow the steps below to create and configure the required components for the Cloud Native Service Add-On Pack to be deployed on and integrate with.
IAM OIDC Provider
An IAM OIDC provider is required to manage the service accounts required for configuration of the EKS cluster and Amazon-managed services.
To create an IAM OIDC provider, run the following command with your cluster name:
eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve
You should see an output similar to the following:
2023-04-04 13:55:08 [ℹ] will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
2023-04-04 13:55:08 [✔] created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
Storage Configuration
A storage class must be available on the EKS cluster for the Cloud Native Service Add-on Pack to be configured to use. This guide will use the gp2
storage class. For detailed enformation on how to enable the gp2
storage class, please refer to AWS documentation.
Once the gp2
storage class has been created, the following configuration is required to create a service account that can create Persistent Volumes using the gp2
storage class on the cluster
First, retrieve your cluster ID using the command below. You will use this in later steps.
$ aws sts get-caller-identity --query 'Account' --output text
You should see an output similar to the following:
298485221437
Next, create a service account with the EBS CSI Driver role for the EBS CSI driver add-on using the command below. To make sure the role name is unique and to avoid conflicts with existing roles, append the timestamp to the role name.
$ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
You should see an output similar to the following:
2023-04-04 15:18:05 [ℹ] 1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules) 2023-04-04 15:18:05 [!] serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used 2023-04-04 15:18:05 [ℹ] 1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" } 2023-04-04 15:18:05 [ℹ] building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:05 [ℹ] deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:05 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 2023-04-04 15:18:35 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
Confirm that the service account was created with the following command:
nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>
You should see an output similar to the following:
NAMESPACE NAME ROLE ARN kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
Confirm the IAM Role in the AWS Console matches the images below:
Next, create the add-on for the service account role for your cluster using the command below:
$ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
You should see an output similar to the following:
arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force 2023-04-04 15:29:49 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name> " 2023-04-04 15:29:49 [ℹ] using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>" 2023-04-04 15:29:49 [ℹ] creating addon
Confirm that the add-on was created using the following command:
$ eksctl get addon --cluster <cluster-name>
You should see an output similar to the following:
2023-04-04 15:30:20 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name>" 2023-04-04 15:30:20 [ℹ] getting all addons 2023-04-04 15:30:21 [ℹ] to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>` NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
When installing CNPack on EKS, you have an option to configure and connect some of the cluster components to AWS central services. The current AWS services that can be configured are:
FluentBit
Prometheus
Cert-manager
Configuration guidance for these services is provided below. Example deployment configuration for connecting the Cloud Native Service Add-On Pack to these services is provided in the next section.
AWS Cloud Watch
To enable FluentBit to send logs to AWS CloudWatch, the policy CloudWatchAgentServerPolicy
needs to be attached to your cluster nodes. This policy enables FluentBit to ship logs to AWS CloudWatch.
To attach the policy, navigate to the AWS Console, then go to the IAM > Roles page.
Search for
node
:
Select your node group(s):
Go to Add Permissions > Attach Policies.
Search for
CloudWatchAgentServerPolicy
:
Select it via the check box, and click Add Permission. You should then see:
Repeat these steps for your other node groups.
AWS Private CA
To enable AWS Private CA issuer on the cluster, first setup an AWS Private CA issuer from the AWS Console.
Once the CA has been created, create an IAM policy to attach to the Amazon PCA Resource (arn) following the steps below.
Create a file named
iam-pca-policy.json
with the following contents:{ "Version": "2012-10-17", "Statement": [ { "Sid": "awspcaissuer", "Action": [ "acm-pca:DescribeCertificateAuthority", "acm-pca:GetCertificate", "acm-pca:IssueCertificate" ], "Effect": "Allow", "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>" } ] }
Create an IAM policy called AWSPCAIssuerIAMPolicy:
aws iam create-policy \ --policy-name AWSPCAIssuerIAMPolicy \ --policy-document file://pca-iam-policy.json
This new PCA IAM policy
arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy
needs to be attached to the node-role ARN of the Kubernetes node pool. This can be done from the AWS Console or by running this command as admin:aws iam attach-role-policy \ --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \ --role-name <your node-role arn>
Update the
certManager
configuration in the configuration YAML for the add-on pack to the following:certManager: enabled: true awsPCA: enabled: true commonName: "<yourcommonnamespecifictoyourAWSPrivateCA>" domainName: "<thedomainnamespecifictoyourAWSPrivateCA>" arn: "<ARNoftheAWSPrivateCA>"
AWS Managed Prometheus
Create an AWS Managed Prometheus Workspace from the AWS Console.
Once the Prometheus instance is ready, to set up the IAM roles needed for AWS Managed Prometheus, create an IAM helper script using the template below:
NoteMake sure to change the
CLUSTER_NAME
field to match the cluster name for the EKS cluster.aws-managed-prometheus-iam-setup.sh #!/bin/bash -e # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. # Permission is hereby granted, free of charge, to any person obtaining a copy of this # software and associated documentation files (the "Software"), to deal in the Software # without restriction, including without limitation the rights to use, copy, modify, # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so. # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy # # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp. # cat <<EOF > TrustPolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}" } } } ] } EOF # # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces # cat <<EOF > PermissionPolicyIngest.json { "Version": "2012-10-17", "Statement": [ {"Effect": "Allow", "Action": [ "aps:RemoteWrite", "aps:GetSeries", "aps:GetLabels", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF function getRoleArn() { OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1) # Check for an expected exception if [[ $? -eq 0 ]]; then echo $OUTPUT elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then echo "" else >&2 echo $OUTPUT return 1 fi } # # Create the IAM Role for ingest with the above trust policy # SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE) if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ]; then # # Create the IAM role for service account # SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \ --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ --assume-role-policy-document file://TrustPolicy.json \ --query "Role.Arn" --output text) # # Create an IAM permission policy # SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \ --policy-document file://PermissionPolicyIngest.json \ --query 'Policy.Arn' --output text) # # Attach the required IAM policies to the IAM role created above # aws iam attach-role-policy \ --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN else echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARNIAM role for ingest already exists" fi echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN # # EKS cluster hosts an OIDC provider with a public discovery endpoint. # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts. # Doing this with eksctl is the easier and best approach. # eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
Run the created helper script. You should see an output similar to the following:
nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role 2023-03-31 12:57:48 [ℹ] IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"
Update the configuration YAML file for the add-on pack following the template below.
prometheus: enabled: true awsRemoteWrite: url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write" arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"
The field
url
is set to the AWS Managed Prometheus workspace urlEnsure the field
arn
is set to thearn
of the AWS AMP Policy created by the IAM helper script above, or existing IAM Role. See the AWS Documentation for more information on IAM role creation.