EKS Cluster Terraform (Optional)#

Warning

This guide is completely optional. If you already have a Kubernetes cluster with GPU nodes, you can skip this page and proceed directly to Helmfile Installation to install the NVCF control plane.

This guide provides instructions for deploying the Amazon EKS infrastructure foundation for a fully self-hosted NVIDIA Cloud Functions (NVCF) deployment using Terraform. This includes:

  • Amazon EKS cluster with dedicated node pools for various workloads

  • GPU nodes with automatic taint configuration for inference workloads

  • Core infrastructure components (VPC, subnets, IAM roles, security groups)

  • NVIDIA GPU Operator deployment (required for GPU workloads)

  • Infrastructure prerequisites for optional enhancements (LLS streaming, simulation caching)

Note

This guide covers infrastructure deployment only. Some Terraform options configure AWS resources (IAM policies, S3 buckets) required by optional enhancements deployed later. See orphan for details on these components.

Prerequisites#

Required Tools#

  • Terraform >= 1.0.0

  • AWS CLI configured with credentials

  • kubectl >= 1.28

  • helm >= 3.17

  • helmfile >= 0.150

  • helm-diff plugin >=3.11

  • helm-secrets plugin >=4.7.4

  • skopeo (required only if using create_sm_ecr_repos = true for automated ECR mirroring)

Required Access#

  • AWS Account with permissions for EKS, VPC, EC2, IAM, S3

  • NGC API Key from ngc.nvidia.com authenticated with nvcf-onprem organization - See Image Mirroring for more details on required NGC Service Key scopes.

  • The nvcf-base repository must be downloaded to your local machine (see Downloading nvcf-base).

Configure AWS Credentials#

Terraform requires valid AWS credentials to create resources. Configure your AWS credentials using one of the following methods before running any terraform commands:

Configure credentials using the AWS CLI:

# Interactive configuration
aws configure

# Or use SSO login
aws sso login --profile <profile-name>

Set credentials directly as environment variables:

export AWS_ACCESS_KEY_ID="<your-access-key>"
export AWS_SECRET_ACCESS_KEY="<your-secret-key>"
export AWS_SESSION_TOKEN="<your-session-token>"  # If using temporary credentials
export AWS_REGION="<your-region>"  # e.g., us-east-1

Use a named profile from ~/.aws/credentials:

export AWS_PROFILE="<profile-name>"

Verify AWS credentials are configured correctly:

aws sts get-caller-identity

You should see output showing your AWS account ID, user ARN, and user ID. If you receive an error, your credentials are not configured correctly.

Set NGC API Key

Before proceeding, set your NGC API key as an environment variable. This is required for automated ECR mirroring and GPU Operator deployment:

export NGC_API_KEY="nvapi-xxxxxxxxxxxxx"  # Replace with your NGC API key

Network Planning#

  • VPC CIDR: /16 recommended for production

  • Service CIDR: /16, must not overlap with VPC CIDR

  • Egress is required for third-party registry access to pull both service artifacts and function containers

Node Pool Design#

The Terraform configuration supports flexible node pool designs for different deployment scenarios:

  • self-managed: 5 node pools (compute, GPU, control-plane, database, and secrets management), the extra compute node pool is primarily for supporting optional simulation components and can be disabled for inference-only self-hosted NVCF

  • byoc: 2 node pools (compute and GPU) - if deploying with this configuration, nodeSelectors must be disabled in the self-hosted stack environment configuration file.

Please refer to the codebase nvcf-base/terraform/tfvars-examples for the full list of node configurations and deployment options. Though the byoc configuration may support a self-hosted stack deployment, it is primarily meant for BYOC cluster deployments with NVIDIA-managed NVCF control plane services (see Overview).

Note

You can customize node pools (instance types, capacities, and configurations) by copying one of the example tfvars files from terraform/tfvars-examples/ to your environment directory and modifying it to match your requirements.

Automatic GPU Taint Configuration

GPU nodes are automatically tainted with nvidia.com/gpu=present:NoSchedule based on instance family detection (g* or p* patterns). No manual configuration required.

Cluster Creation#

Step 1: Create Environment#

The nvcf-base repository includes a base Terraform environment under terraform/envs/byoc/ containing the required Terraform configuration files. Create your own environment by copying this folder:

cd nvcf-base
cp -r terraform/envs/byoc terraform/envs/<your-environment>
cp terraform/tfvars-examples/self-managed-full.tfvars terraform/envs/<your-environment>/terraform.tfvars

Replace <your-environment> with your environment name (e.g., nvcf-prod, staging). This copies all required Terraform files (main.tf, variables.tf, providers.tf, outputs.tf) along with the tfvars configuration template.

Step 2: Configure Environment#

Edit terraform/envs/<your-environment>/terraform.tfvars to match your requirements. The key sections are described below. Feel free to use this example terraform.tfvars directly to bring up an EKS cluster ready for NVCF self-hosted control plane deployment. LLS (Low Latency Streaming) is disabled by default; enable it only if you plan to deploy simulation or streaming VM workloads (see LLS Installation).

Example terraform.tfvars Configuration
terraform.tfvars#
  1# =============================================================================
  2# NVCF Fully Self-Managed Configuration (Co-located)
  3# =============================================================================
  4# This configuration deploys a cluster with BOTH:
  5#   - NVCF control plane (self-hosted)
  6#   - BYOC workloads
  7#
  8# Co-located architecture - both components in the same EKS cluster.
  9# =============================================================================
 10
 11# -----------------------------------------------------------------------------
 12# REQUIRED: Cluster Identification
 13# -----------------------------------------------------------------------------
 14cluster_name = "my-self-hosted-cluster" # Must be under 20 characters if enabling LLS (EA limitation)
 15cluster_version = "1.32"
 16region       = "us-west-2"
 17environment  = "production"
 18
 19# -----------------------------------------------------------------------------
 20# VPC and Networking (larger for control plane + workloads)
 21# -----------------------------------------------------------------------------
 22# Default: null lets AWS auto-assign a non-colliding CIDR.
 23# Override with a specific CIDR if you need deterministic addressing:
 24#   vpc_cidr = "10.110.0.0/16"
 25vpc_cidr = null
 26
 27availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
 28
 29# When vpc_cidr is null, leave these as null for automatic subnet calculation.
 30# When using a specific vpc_cidr, override with matching subnets, e.g.:
 31#   private_subnet_cidrs = ["10.110.0.0/19", "10.110.32.0/19", "10.110.64.0/19"]
 32#   public_subnet_cidrs  = ["10.110.101.0/24", "10.110.102.0/24", "10.110.103.0/24"]
 33private_subnet_cidrs = null
 34public_subnet_cidrs  = null
 35
 36service_ipv4_cidr = "172.20.0.0/16"
 37
 38create_nat_gateways = true
 39
 40# -----------------------------------------------------------------------------
 41# Node Pool Configuration (Control Plane + BYOC)
 42# -----------------------------------------------------------------------------
 43node_pools = {
 44  # NVCF Control Plane Nodes
 45  "nvcf-control-plane" = {
 46    instance_type    = "m5.4xlarge"  # Control plane services need CPU/memory
 47    desired_capacity = 3
 48    max_capacity     = 5
 49    min_capacity     = 3
 50    labels = {
 51      "node-type" = "control-plane"
 52      "workload"  = "nvcf-control-plane"
 53      "nvcf.nvidia.com/workload" = "control-plane"
 54    }
 55  },
 56  
 57  # Compute nodes for BYOC workloads
 58  "compute" = {
 59    instance_type    = "m5.2xlarge"
 60    desired_capacity = 3
 61    max_capacity     = 10
 62    min_capacity     = 2
 63    labels = {
 64      "node-type" = "compute"
 65      "workload"  = "byoc"
 66    }
 67  },
 68  
 69  # GPU nodes for BYOC workloads
 70  # Change to appropriate GPU instance type for your workload. For single-GPU simulation workloads, this should be g6e.4xlarge.
 71  # For very basic workloads to test the stack, we recommend g5.4xlarge (A10G) or for inference workloads, A100, H100 or better.
 72  # min_capacity is 1 because the NVCF cluster agent (NVCA) will not be able to start if there are no GPU nodes.
 73  "gpu" = {
 74    instance_type    = "g6e.4xlarge"
 75    desired_capacity = 2
 76    max_capacity     = 8
 77    min_capacity     = 1
 78    labels = {
 79      "node-type"      = "gpu"
 80      "nvidia.com/gpu" = "true"
 81      "workload"       = "byoc-gpu"
 82    }
 83  },
 84  
 85  # Cassandra nodes for control plane storage
 86  "cassandra" = {
 87    instance_type    = "r5.2xlarge"  # Memory-optimized for database
 88    desired_capacity = 3
 89    max_capacity     = 5
 90    min_capacity     = 3
 91    labels = {
 92      "node-type" = "storage"
 93      "workload"  = "cassandra"
 94      "nvcf.nvidia.com/workload" = "cassandra"
 95    }
 96  },
 97  
 98  # OpenBao nodes for secrets management
 99  "openbao" = {
100    instance_type    = "m5.xlarge"
101    desired_capacity = 3
102    max_capacity     = 3
103    min_capacity     = 3
104    labels = {
105      "node-type" = "security"
106      "workload"  = "openbao"
107      "nvcf.nvidia.com/workload" = "vault"
108    }
109  }
110}
111
112# Storage configuration (larger for control plane data)
113node_root_volume_size     = 100  # GB for control plane nodes
114gpu_node_root_volume_size = 250  # GB for GPU nodes
115
116# AMI Configuration
117# Default (null) automatically discovers the latest Ubuntu 22.04 EKS-optimized AMI for your region
118# This is RECOMMENDED for most deployments (always uses latest security patches)
119# This determines the base OS image for the EKS nodes.
120node_ami_id = null
121
122# Advanced: Pin a specific AMI for compliance/reproducibility
123# NOTE: AMI IDs are region-specific. Examples:
124#   us-west-2: ami-0bce1583264e581a6
125#   us-east-1: ami-0e70225fadb23da91
126#   us-east-2: ami-0a12b3c4d5e6f7890
127# Uncomment and update for your region:
128# node_ami_id = "ami-0bce1583264e581a6"
129
130# SSH access (recommended for control plane troubleshooting)
131# ssh_public_key = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIExample..."
132
133# -----------------------------------------------------------------------------
134# Feature Flags
135# -----------------------------------------------------------------------------
136
137# Set to true to create ECR repositories and copy NVCF images from NGC
138# IMPORTANT: Requires NGC_API_KEY to be set in your environment
139create_sm_ecr_repos = true
140
141# =============================================================================
142# Additional Configuration (Optional)
143# =============================================================================
144
145# Observability - OPTIONAL, DEPRECATED
146# WARNING: The CloudWatch Observability addon is disabled to avoid conflicts
147# with the stack bring-up.
148enable_cloudwatch_observability = false
149
150# S3 Buckets - OPTIONAL, REQUIRED for DDCS and UCC (Simulation components)
151create_s3_buckets = true
152s3_bucket_name    = "my-self-hosted-data" # REPLACE: Must be globally unique
153
154# Autoscaling (important for handling varying workload)
155enable_autoscaling = true
156
157# -----------------------------------------------------------------------------
158# Advanced Autoscaling Configuration (optional)
159# -----------------------------------------------------------------------------
160# Uncomment and customize for fine-grained control
161
162# autoscaling_cooldown_period = 300
163# autoscaling_polling_interval = 30
164# autoscaling_scale_up_threshold = 70
165# autoscaling_scale_down_threshold = 30
166# gpu_autoscaling_enabled = true
167# gpu_autoscaling_min_nodes = 0
168# gpu_autoscaling_max_nodes = 10
169# compute_autoscaling_enabled = true
170# compute_autoscaling_min_nodes = 2
171# compute_autoscaling_max_nodes = 15
172# autoscaling_metrics = ["CPUUtilization", "MemoryUtilization"]
173# enable_spot_instances = false
174# spot_instance_percentage = 0
175# enable_predictive_scaling = false
176
177# -----------------------------------------------------------------------------
178# Tags
179# -----------------------------------------------------------------------------
180tags = {
181  Environment  = "production"
182  Project      = "nvcf-self-hosted"
183  ManagedBy    = "terraform"
184  Deployment   = "co-located"
185  CostCenter   = "engineering"
186  Owner        = "platform-team"
187  Architecture = "self-hosted-full"
188}

terraform.tfvars

Note

If you plan on using NVCF streaming functions, cluster_name must be less than 20 characters. Please double-check before proceeding, or you’ll need to unwind and restart.

Warning

AMI IDs are region-specific. The sample configuration uses node_ami_id = null which automatically discovers the correct EKS-optimized AMI for your region. This is the recommended setting.

If you need to pin a specific AMI (for compliance or reproducibility), you must use an AMI ID that exists in your target region. Using an AMI from a different region will cause terraform apply to fail with “image id does not exist” errors. See AMI Does Not Exist Error in Troubleshooting.

ECR Registry Image Mirroring

For ECR users, this Terraform module can automatically mirror all required NVCF artifacts from NGC:

create_sm_ecr_repos = true  # Enable automated mirroring

Important

Requires NGC_API_KEY environment variable set before running terraform apply. Generate this key from the nvcf-onprem organization at https://org.ngc.nvidia.com/setup/api-keys.

See Recommended for ECR Users: Automated ECR Mirroring for details on what’s included (control plane, LLS, worker components) and what’s not (simulation cache, custom streaming apps).

If you’re not using ECR or prefer manual mirroring, set create_sm_ecr_repos = false and follow Image Mirroring.

GPU Node Configuration

For GPU workloads, you must set the appropriate GPU instance type in the terraform.tfvars configuration. NVCF supports all GPU types supported by the NVIDIA GPU Operator. Ensure the instance type is available in your chosen region and availability zones (specified in availability_zones).

Note

For single-GPU simulation workloads, this should be g6e.4xlarge or better.

"gpu" = {
   instance_type    = "g6e.4xlarge"  # Change to appropriate GPU instance type for your workload.
   desired_capacity = 2
   max_capacity     = 8
   min_capacity     = 1
   labels = {
      "node-type"      = "gpu"
      "nvidia.com/gpu" = "true"
      "workload"       = "byoc-gpu"
   }
}

Deploying to Different AWS Regions

If deploying to a region other than us-west-2, you must update these three variables:

Variable

Required Change

Example

region

Target AWS region

"us-east-1"

availability_zones

Valid AZs for that region

["us-east-1a", "us-east-1b", "us-east-1c"]

node_ami_id

Set to null for auto-detection

null

Why these are required:

  • Availability zones are region-specific - us-west-2a doesn’t exist in us-east-1

  • AMI IDs are region-specific - setting to null auto-detects the latest EKS-optimized AMI for your region

  • Region determines resource location - all AWS resources will be created in this region

Example for US-East-1:

region = "us-east-1"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
node_ami_id = null  # Auto-detects correct AMI for us-east-1

To find availability zones for your region:

aws ec2 describe-availability-zones --region <your-region> \
  --query 'AvailabilityZones[].ZoneName' --output text

For detailed guidance in any region, see nvcf-base/terraform/tfvars-examples/README.md.

Step 3: Initialize and Validate#

Initialize Terraform and validate the configuration.

cd terraform/envs/<your-environment>
terraform init
terraform validate

Warning

  • VPC Subnets (availability_zones, private_subnet_cidrs, public_subnet_cidrs) - MUST span at least 2 AZs (AWS EKS requirement for high availability)

  • Node Placement (node_availability_zones) - MUST use only 1 AZ (LLS limitation)

Configuration:

# VPC subnets - keep multiple AZs (AWS EKS requirement)
availability_zones = [
  "us-west-2a",
  "us-west-2b",
  "us-west-2c"
]

Do NOT change availability_zones to a single zone - this will cause Terraform to fail with “subnets not in at least two different availability zones” error.

Step 4: Deploy Cluster#

Expected Duration: 30-45 minutes

What gets deployed:

  • VPC with public/private subnets across 3 AZs

  • EKS control plane

  • Node pool configuration

  • IAM roles and policies

  • Security groups

  • S3 buckets (if enabled)

  1. Ensure you are in the environment directory and have run terraform init (see Step 3). Review the deployment plan.

terraform plan

Note

Review the plan output to verify expected resources will be created based on your configuration. Key items to check:

  • Node pools: Verify correct number and instance types (e.g., 5 node pools for self-managed deployment with optional caching components)

  • VPC/Networking: Confirm subnets match your CIDR configuration

  • S3 buckets: If create_s3_buckets = true, verify bucket name is correct

  1. Apply the configuration.

terraform apply

Verify Deployment#

  1. After deployment completes, configure kubectl:

# Replace <region> and <cluster-name> with your values from terraform.tfvars
aws eks update-kubeconfig \
  --region <region> \
  --name <cluster-name>

Note

Use the same region and cluster_name values from your terraform.tfvars configuration.

  1. Verify cluster health:

# Check all nodes are Ready
kubectl get nodes

# Verify GPU taints are applied automatically
kubectl get nodes -o=custom-columns="NAME:.metadata.name, INSTANCE:.metadata.labels.node\.kubernetes\.io/instance-type, TAINTS:.spec.taints"

Example output for GPU nodes (should match your GPU instance type):

NAME                          INSTANCE        TAINTS
ip-10-120-x-x.compute.internal  g6e.4xlarge   [map[effect:NoSchedule key:nvidia.com/gpu value:present]]

Step 5: Deploy GPU Operator#

The NVIDIA GPU Operator is required for GPU workloads. It installs GPU drivers, device plugins, and monitoring components on GPU nodes.

  1. Set NGC credentials:

export NGC_API_KEY="nvapi-xxxxxxxxxxxxx"  # Your NGC API key
  1. Deploy the GPU Operator:

# Navigate to core-apps under the nvcf-base top-level directory
cd /path/to/nvcf-base/core-apps

helmfile apply --selector component=gpu
  1. Verify deployment is proceeding.

Expected Duration: 5-10 minutes

# Check GPU operator pods are running (if some pods are in init state, wait a few minutes)
kubectl get pods -n gpu-operator

# Verify GPU resources are advertised on nodes
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Expected output: All pods should be Running, and GPU nodes should show available GPU count (e.g., 1 for g6e.4xlarge).

The GPU Operator installs:

  • GPU driver (NVIDIA 570.x)

  • NVIDIA device plugin

  • GPU feature discovery

  • DCGM exporter (metrics)

Next Steps#

With the infrastructure and GPU Operator deployed:

  1. Begin control plane deployment by following Control Plane Installation.

  2. Deploy optional application components (including simulation components such as DDCS, UCC, Storage API, LLS) under nvcf-base/core-apps. See orphan.

Operations#

Scaling Node Pools#

Update terraform.tfvars and reapply:

# Edit desired_capacity or max_capacity for node pools
vim terraform/envs/<your-environment>/terraform.tfvars

cd terraform/envs/<your-environment>
terraform plan
terraform apply

Upgrading GPU Operator#

Re-run the deployment command to upgrade to the latest version:

helmfile apply --selector component=gpu

Note

To upgrade optional enhancements (container caching, simulation caching, LLS), re-run the corresponding make deploy-* commands from orphan.

Adding GPU Capacity#

GPU taints are applied automatically when new GPU nodes join:

  1. Increase desired_capacity for gpu node pool in tfvars

  2. Run terraform apply

  3. New nodes will automatically receive GPU taints

  4. Verify: kubectl get nodes -o=custom-columns="NAME:.metadata.name,TAINTS:.spec.taints"

Decommissioning#

cd terraform/envs/<your-environment>
terraform destroy

Warning

This destroys all cluster resources.

Troubleshooting#

GPU Taints Not Applied#

Symptoms: GPU nodes do not have nvidia.com/gpu taint

Diagnosis:

# SSH to GPU node
ssh ubuntu@<gpu-node-ip>

# Check cloud-init logs
sudo cat /var/log/cloud-init-output.log | grep -E "IMDSv2|GPU|TAINT"

Expected output:

DEBUG: Obtained IMDSv2 token
DEBUG: Instance Type: g5.12xlarge
DEBUG: Instance Family: g5
DEBUG: Matched GPU family (g* or p*) - adding GPU taint
DEBUG: Added GPU taint for GPU instance family: nvidia.com/gpu=present:NoSchedule

Resolution:

  • Verify instance type starts with g or p

  • Check launch template user-data rendered correctly

  • Terminate node and let ASG create replacement

AMI Does Not Exist Error#

Symptom: During terraform apply, Auto Scaling Group creation fails with:

Error: creating Auto Scaling Group (my-cluster-gpu): ValidationError: You must use a
valid fully-formed launch template. The image id '[ami-0bce1583264e581a6]' does not exist

Cause: The node_ami_id is set to an AMI that doesn’t exist in your target region. AMI IDs are region-specific — an AMI available in us-west-2 will not exist in ap-south-1 or other regions.

Resolution:

  1. Recommended: Use automatic AMI discovery

    Set node_ami_id = null in your terraform.tfvars. This automatically discovers the latest EKS-optimized Ubuntu AMI for your region:

    node_ami_id = null  # Recommended - auto-discovers correct AMI for your region
    
  2. If you must pin a specific AMI, find the correct AMI for your region:

    # Replace <your-region> and <k8s-version> with your values
    aws ec2 describe-images --region <your-region> \
      --owners amazon \
      --filters "Name=name,Values=ubuntu-eks/k8s_<k8s-version>/images/*x86_64*" \
      --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \
      --output text
    
    # Example for us-west-2 with Kubernetes 1.32:
    aws ec2 describe-images --region us-west-2 \
      --owners amazon \
      --filters "Name=name,Values=ubuntu-eks/k8s_1.32/images/*x86_64*" \
      --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \
      --output text
    
  3. After fixing, run terraform apply again.

Tip

Using node_ami_id = null is strongly recommended as it ensures you always get the latest security patches and eliminates region compatibility issues.

Pods Not Scheduling on GPU Nodes#

Symptoms: Pods with GPU requests stay in Pending state

Cause: Missing toleration for GPU taint

Resolution: Add toleration to pod spec:

tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

ImagePullBackOff Errors#

Symptoms: Pods fail to pull images from nvcr.io

Resolution:

# Verify NGC_API_KEY is set
echo $NGC_API_KEY

# Recreate image pull secret
kubectl create secret docker-registry nvcr-creds \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY" \
  --namespace=<namespace> \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart affected pods
kubectl rollout restart deployment <deployment-name> -n <namespace>

Resuming Installation After AWS Credential Expiration#

Symptoms: Terraform fails during long-running operations (e.g., EKS cluster creation) with authentication errors, or you see “Resource already exists” errors when trying to resume

Cause: AWS credentials expired during Terraform provisioning. AWS credentials typically expire after a certain period (often 1 hour for temporary credentials), which can occur during long-running deployments like EKS cluster creation.

Resolution:

When AWS credentials expire mid-deployment, Terraform may lose track of resources it created. You can recover by removing the resource from Terraform state and re-importing it.

Example: EKS Cluster Creation

# Remove the EKS cluster from Terraform state
terraform state rm module.eks.aws_eks_cluster.main

# Re-import the existing cluster
terraform import module.eks.aws_eks_cluster.main <cluster_name>

# Continue with your terraform operation
terraform apply

General Process:

  1. Identify the resource that was partially created (check AWS Console)

  2. Remove it from Terraform state: terraform state rm <resource_address>

  3. Import the existing resource: terraform import <resource_address> <resource_id>

  4. Continue with terraform apply

Tip

To find the correct resource address, use terraform state list to see all resources in your state file.

Note

This approach works for any Terraform resource that exists in AWS but Terraform has lost track of due to credential expiration. Investigate this process whenever you encounter “Resource already exists” errors after a failed deployment.

CSI Driver or CloudWatch Add-on Installation Timeouts#

Symptoms: Terraform times out when deploying CSI drivers (SMB CSI Driver) or CloudWatch Observability add-on, or add-on installation appears to hang

Common Causes:

  • AWS credential expiration during long-running deployments

  • Network connectivity issues between EKS and AWS services

  • IAM permissions issues preventing add-on installation

  • Resource limits or quota exhaustions in your AWS account

Resolution Steps:

  1. Retry the deployment

    AWS credential expiration is the most common cause. See Resuming Installation After AWS Credential Expiration for recovery steps.

  2. If the issue persists, consult AWS documentation

    AWS provides detailed troubleshooting guides for add-on installations:

  3. Verify add-on status via AWS CLI

    # Check CloudWatch add-on status
    aws eks describe-addon \
      --cluster-name <cluster-name> \
      --addon-name amazon-cloudwatch-observability \
      --region <region>
    
    # Check all add-ons
    aws eks list-addons \
      --cluster-name <cluster-name> \
      --region <region>
    
  4. Last resort: Destroy and recreate

    If troubleshooting steps fail to resolve the issue, you may need to start fresh. See the Decommissioning section to destroy the cluster, then return to Step 4: Deploy Cluster to recreate it.

Tip

Prevention: Use long-lived AWS credentials or AWS IAM roles when possible to avoid credential expiration during deployments. If using temporary credentials, ensure they have sufficient validity period (at least 2-3 hours) before starting a cluster deployment.