Terraform Installation | NVIDIA Cloud Functions

This guide is completely optional. If you already have a Kubernetes cluster with GPU nodes, you can skip this page and proceed directly to helmfile-installation to install the NVCF control plane.

This guide provides instructions for deploying the Amazon EKS infrastructure foundation for a fully self-hosted NVIDIA Cloud Functions (NVCF) deployment using Terraform. This includes:

Amazon EKS cluster with dedicated node pools for various workloads
GPU nodes with automatic taint configuration for inference workloads
Core infrastructure components (VPC, subnets, IAM roles, security groups)
NVIDIA GPU Operator deployment (required for GPU workloads)
Infrastructure prerequisites for optional enhancements (LLS streaming, simulation caching)

This guide covers infrastructure deployment only. Some Terraform options configure AWS resources (IAM policies, S3 buckets) required by optional enhancements deployed later. See self-hosted-optional-enhancements for details on these components.

Prerequisites

Required Tools

Terraform >= 1.0.0
AWS CLI configured with credentials
kubectl >= 1.28
helm >= 3.17
helmfile >= 0.150
helm-diff plugin >=3.11
helm-secrets plugin >=4.7.4
skopeo (required only if using create_sm_ecr_repos = true for automated ECR mirroring)

Required Access

AWS Account with permissions for EKS, VPC, EC2, IAM, S3
NGC API Key from ngc.nvidia.com authenticated with nvcf-onprem organization - See self-hosted-image-mirroring for more details on required NGC Service Key scopes.
The nvcf-base repository must be downloaded to your local machine (see download-nvcf-base).

Configure AWS Credentials

Terraform requires valid AWS credentials to create resources. Configure your AWS credentials using one of the following methods before running any terraform commands:

AWS CLI (Recommended)

Environment Variables

AWS Profile

Configure credentials using the AWS CLI:

$ # Interactive configuration
$ aws configure
$ 
$ # Or use SSO login
$ aws sso login --profile <profile-name>

Verify AWS credentials are configured correctly:

$ aws sts get-caller-identity

You should see output showing your AWS account ID, user ARN, and user ID. If you receive an error, your credentials are not configured correctly.

Set NGC API Key

Before proceeding, set your NGC API key as an environment variable. This is required for automated ECR mirroring and GPU Operator deployment:

$ export NGC_API_KEY="nvapi-xxxxxxxxxxxxx"  # Replace with your NGC API key

Network Planning

VPC CIDR: /16 recommended for production
Service CIDR: /16, must not overlap with VPC CIDR
Egress is required for third-party registry access to pull both service artifacts and function containers

Node Pool Design

The Terraform configuration supports flexible node pool designs for different deployment scenarios:

self-managed: 5 node pools (compute, GPU, control-plane, database, and secrets management), the extra compute node pool is primarily for supporting optional simulation components and can be disabled for inference-only self-hosted NVCF
byoc: 2 node pools (compute and GPU) - if deploying with this configuration, nodeSelectors must be disabled in the self-hosted stack environment configuration file.

Please refer to the codebase nvcf-base/terraform/tfvars-examples for the full list of node configurations and deployment options. Though the byoc configuration may support a self-hosted stack deployment, it is primarily meant for BYOC cluster deployments with NVIDIA-managed NVCF control plane services (see cluster-setup-management).

You can customize node pools (instance types, capacities, and configurations) by copying one of the example tfvars files from terraform/tfvars-examples/ to your environment directory and modifying it to match your requirements.

Automatic GPU Taint Configuration

GPU nodes are automatically tainted with nvidia.com/gpu=present:NoSchedule based on instance family detection (g* or p* patterns). No manual configuration required.

Cluster Creation

Step 1: Create Environment

The nvcf-base repository includes a base Terraform environment under terraform/envs/byoc/ containing the required Terraform configuration files. Create your own environment by copying this folder:

$ cd nvcf-base
$ cp -r terraform/envs/byoc terraform/envs/<your-environment>
$ cp terraform/tfvars-examples/self-managed-full.tfvars terraform/envs/<your-environment>/terraform.tfvars

Replace <your-environment> with your environment name (e.g., nvcf-prod, staging). This copies all required Terraform files (main.tf, variables.tf, providers.tf, outputs.tf) along with the tfvars configuration template.

Step 2: Configure Environment

Edit terraform/envs/<your-environment>/terraform.tfvars to match your requirements. The key sections are described below. Feel free to use this example terraform.tfvars directly to bring up an EKS cluster ready for NVCF self-hosted control plane deployment. LLS (Low Latency Streaming) is disabled by default; enable it only if you plan to deploy simulation or streaming VM workloads (see self-hosted-lls-installation).

terraform.tfvars

1 # =============================================================================
2 # NVCF Fully Self-Managed Configuration (Co-located)
3 # =============================================================================
4 # This configuration deploys a cluster with BOTH:
5 #   - NVCF control plane (self-hosted)
6 #   - BYOC workloads
7 #
8 # Co-located architecture - both components in the same EKS cluster.
9 # =============================================================================
10 
11 # -----------------------------------------------------------------------------
12 # REQUIRED: Cluster Identification
13 # -----------------------------------------------------------------------------
14 cluster_name = "my-self-hosted-cluster" # Must be under 20 characters if enabling LLS (EA limitation)
15 cluster_version = "1.32"
16 region       = "us-west-2"
17 environment  = "production"
18 
19 # -----------------------------------------------------------------------------
20 # VPC and Networking (larger for control plane + workloads)
21 # -----------------------------------------------------------------------------
22 # Default: null lets AWS auto-assign a non-colliding CIDR.
23 # Override with a specific CIDR if you need deterministic addressing:
24 #   vpc_cidr = "10.110.0.0/16"
25 vpc_cidr = null
26 
27 availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
28 
29 # When vpc_cidr is null, leave these as null for automatic subnet calculation.
30 # When using a specific vpc_cidr, override with matching subnets, e.g.:
31 #   private_subnet_cidrs = ["10.110.0.0/19", "10.110.32.0/19", "10.110.64.0/19"]
32 #   public_subnet_cidrs  = ["10.110.101.0/24", "10.110.102.0/24", "10.110.103.0/24"]
33 private_subnet_cidrs = null
34 public_subnet_cidrs  = null
35 
36 service_ipv4_cidr = "172.20.0.0/16"
37 
38 create_nat_gateways = true
39 
40 # -----------------------------------------------------------------------------
41 # Node Pool Configuration (Control Plane + BYOC)
42 # -----------------------------------------------------------------------------
43 node_pools = {
44   # NVCF Control Plane Nodes
45   "nvcf-control-plane" = {
46     instance_type    = "m5.4xlarge"  # Control plane services need CPU/memory
47     desired_capacity = 3
48     max_capacity     = 5
49     min_capacity     = 3
50     labels = {
51       "node-type" = "control-plane"
52       "workload"  = "nvcf-control-plane"
53       "nvcf.nvidia.com/workload" = "control-plane"
54     }
55   },
56   
57   # Compute nodes for BYOC workloads
58   "compute" = {
59     instance_type    = "m5.2xlarge"
60     desired_capacity = 3
61     max_capacity     = 10
62     min_capacity     = 2
63     labels = {
64       "node-type" = "compute"
65       "workload"  = "byoc"
66     }
67   },
68   
69   # GPU nodes for BYOC workloads
70   # Change to appropriate GPU instance type for your workload. For single-GPU simulation workloads, this should be g6e.4xlarge.
71   # For very basic workloads to test the stack, we recommend g5.4xlarge (A10G) or for inference workloads, A100, H100 or better.
72   # min_capacity is 1 because the NVCF cluster agent (NVCA) will not be able to start if there are no GPU nodes.
73   "gpu" = {
74     instance_type    = "g6e.4xlarge"
75     desired_capacity = 2
76     max_capacity     = 8
77     min_capacity     = 1
78     labels = {
79       "node-type"      = "gpu"
80       "nvidia.com/gpu" = "true"
81       "workload"       = "byoc-gpu"
82     }
83   },
84   
85   # Cassandra nodes for control plane storage
86   "cassandra" = {
87     instance_type    = "r5.2xlarge"  # Memory-optimized for database
88     desired_capacity = 3
89     max_capacity     = 5
90     min_capacity     = 3
91     labels = {
92       "node-type" = "storage"
93       "workload"  = "cassandra"
94       "nvcf.nvidia.com/workload" = "cassandra"
95     }
96   },
97   
98   # OpenBao nodes for secrets management
99   "openbao" = {
100     instance_type    = "m5.xlarge"
101     desired_capacity = 3
102     max_capacity     = 3
103     min_capacity     = 3
104     labels = {
105       "node-type" = "security"
106       "workload"  = "openbao"
107       "nvcf.nvidia.com/workload" = "vault"
108     }
109   }
110 }
111 
112 # Storage configuration (larger for control plane data)
113 node_root_volume_size     = 100  # GB for control plane nodes
114 gpu_node_root_volume_size = 250  # GB for GPU nodes
115 
116 # AMI Configuration
117 # Default (null) automatically discovers the latest Ubuntu 22.04 EKS-optimized AMI for your region
118 # This is RECOMMENDED for most deployments (always uses latest security patches)
119 # This determines the base OS image for the EKS nodes.
120 node_ami_id = null
121 
122 # Advanced: Pin a specific AMI for compliance/reproducibility
123 # NOTE: AMI IDs are region-specific. Examples:
124 #   us-west-2: ami-0bce1583264e581a6
125 #   us-east-1: ami-0e70225fadb23da91
126 #   us-east-2: ami-0a12b3c4d5e6f7890
127 # Uncomment and update for your region:
128 # node_ami_id = "ami-0bce1583264e581a6"
129 
130 # SSH access (recommended for control plane troubleshooting)
131 # ssh_public_key = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIExample..."
132 
133 # -----------------------------------------------------------------------------
134 # Feature Flags
135 # -----------------------------------------------------------------------------
136 
137 # Set to true to create ECR repositories and copy NVCF images from NGC
138 # IMPORTANT: Requires NGC_API_KEY to be set in your environment
139 create_sm_ecr_repos = true
140 
141 # =============================================================================
142 # Additional Configuration (Optional)
143 # =============================================================================
144 
145 # Observability - OPTIONAL, DEPRECATED
146 # WARNING: The CloudWatch Observability addon is disabled to avoid conflicts
147 # with the stack bring-up.
148 enable_cloudwatch_observability = false
149 
150 # S3 Buckets - OPTIONAL, REQUIRED for DDCS and UCC (Simulation components)
151 create_s3_buckets = true
152 s3_bucket_name    = "my-self-hosted-data" # REPLACE: Must be globally unique
153 
154 # Autoscaling (important for handling varying workload)
155 enable_autoscaling = true
156 
157 # -----------------------------------------------------------------------------
158 # Advanced Autoscaling Configuration (optional)
159 # -----------------------------------------------------------------------------
160 # Uncomment and customize for fine-grained control
161 
162 # autoscaling_cooldown_period = 300
163 # autoscaling_polling_interval = 30
164 # autoscaling_scale_up_threshold = 70
165 # autoscaling_scale_down_threshold = 30
166 # gpu_autoscaling_enabled = true
167 # gpu_autoscaling_min_nodes = 0
168 # gpu_autoscaling_max_nodes = 10
169 # compute_autoscaling_enabled = true
170 # compute_autoscaling_min_nodes = 2
171 # compute_autoscaling_max_nodes = 15
172 # autoscaling_metrics = ["CPUUtilization", "MemoryUtilization"]
173 # enable_spot_instances = false
174 # spot_instance_percentage = 0
175 # enable_predictive_scaling = false
176 
177 # -----------------------------------------------------------------------------
178 # Tags
179 # -----------------------------------------------------------------------------
180 tags = {
181   Environment  = "production"
182   Project      = "nvcf-self-hosted"
183   ManagedBy    = "terraform"
184   Deployment   = "co-located"
185   CostCenter   = "engineering"
186   Owner        = "platform-team"
187   Architecture = "self-hosted-full"
188 }

If you plan on using NVCF streaming functions, cluster_name must be less than 20 characters. Please double-check before proceeding, or you’ll need to unwind and restart.

AMI IDs are region-specific. The sample configuration uses node_ami_id = null which automatically discovers the correct EKS-optimized AMI for your region. This is the recommended setting.

If you need to pin a specific AMI (for compliance or reproducibility), you must use an AMI ID that exists in your target region. Using an AMI from a different region will cause terraform apply to fail with “image id does not exist” errors. See [AMI Does Not Exist Error] in Troubleshooting.

ECR Registry Image Mirroring

For ECR users, this Terraform module can automatically mirror all required NVCF artifacts from NGC:

1 create_sm_ecr_repos = true  # Enable automated mirroring

Requires NGC_API_KEY environment variable set before running terraform apply. Generate this key from the nvcf-onprem organization at https://org.ngc.nvidia.com/setup/api-keys.

See ecr-automated-mirroring for details on what’s included (control plane, LLS, worker components) and what’s not (simulation cache, custom streaming apps).

If you’re not using ECR or prefer manual mirroring, set create_sm_ecr_repos = false and follow self-hosted-image-mirroring.

GPU Node Configuration

For GPU workloads, you must set the appropriate GPU instance type in the terraform.tfvars configuration. NVCF supports all GPU types supported by the NVIDIA GPU Operator. Ensure the instance type is available in your chosen region and availability zones (specified in availability_zones).

For single-GPU simulation workloads, this should be g6e.4xlarge or better.

1 "gpu" = {
2    instance_type    = "g6e.4xlarge"  # Change to appropriate GPU instance type for your workload.
3    desired_capacity = 2
4    max_capacity     = 8
5    min_capacity     = 1
6    labels = {
7       "node-type"      = "gpu"
8       "nvidia.com/gpu" = "true"
9       "workload"       = "byoc-gpu"
10    }
11 }

Deploying to Different AWS Regions

If deploying to a region other than us-west-2, you must update these three variables:

Variable	Required Change	Example
`region`	Target AWS region	`"us-east-1"`
`availability_zones`	Valid AZs for that region	`["us-east-1a", "us-east-1b", "us-east-1c"]`
`node_ami_id`	Set to `null` for auto-detection	`null`

Why these are required:

Availability zones are region-specific - us-west-2a doesn’t exist in us-east-1
AMI IDs are region-specific - setting to null auto-detects the latest EKS-optimized AMI for your region
Region determines resource location - all AWS resources will be created in this region

Example for US-East-1:

1 region = "us-east-1"
2 availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
3 node_ami_id = null  # Auto-detects correct AMI for us-east-1

To find availability zones for your region:

$ aws ec2 describe-availability-zones --region <your-region> \
>   --query 'AvailabilityZones[].ZoneName' --output text

For detailed guidance in any region, see nvcf-base/terraform/tfvars-examples/README.md.

Step 3: Initialize and Validate

Initialize Terraform and validate the configuration.

$ cd terraform/envs/<your-environment>
$ terraform init
$ terraform validate

VPC Subnets (availability_zones, private_subnet_cidrs, public_subnet_cidrs) - MUST span at least 2 AZs (AWS EKS requirement for high availability)
Node Placement (node_availability_zones) - MUST use only 1 AZ (LLS limitation)

Configuration:

1 # VPC subnets - keep multiple AZs (AWS EKS requirement)
2 availability_zones = [
3   "us-west-2a",
4   "us-west-2b",
5   "us-west-2c"
6 ]

Do NOT change availability_zones to a single zone - this will cause Terraform to fail with “subnets not in at least two different availability zones” error.

Step 4: Deploy Cluster

Expected Duration: 30-45 minutes

What gets deployed:

VPC with public/private subnets across 3 AZs
EKS control plane
Node pool configuration
IAM roles and policies
Security groups
S3 buckets (if enabled)

Ensure you are in the environment directory and have run terraform init (see Step 3). Review the deployment plan.

$ terraform plan

Review the plan output to verify expected resources will be created based on your configuration. Key items to check:

Node pools: Verify correct number and instance types (e.g., 5 node pools for self-managed deployment with optional caching components)
VPC/Networking: Confirm subnets match your CIDR configuration
S3 buckets: If create_s3_buckets = true, verify bucket name is correct

Apply the configuration.

$ terraform apply

Verify Deployment

After deployment completes, configure kubectl:

$ # Replace <region> and <cluster-name> with your values from terraform.tfvars
$ aws eks update-kubeconfig \
>   --region <region> \
>   --name <cluster-name>

Use the same region and cluster_name values from your terraform.tfvars configuration.

Verify cluster health:

$ # Check all nodes are Ready
$ kubectl get nodes
$ 
$ # Verify GPU taints are applied automatically
$ kubectl get nodes -o=custom-columns="NAME:.metadata.name, INSTANCE:.metadata.labels.node\.kubernetes\.io/instance-type, TAINTS:.spec.taints"

Example output for GPU nodes (should match your GPU instance type):

NAME                          INSTANCE        TAINTS
ip-10-120-x-x.compute.internal  g6e.4xlarge   [map[effect:NoSchedule key:nvidia.com/gpu value:present]]

Step 5: Deploy GPU Operator

The NVIDIA GPU Operator is required for GPU workloads. It installs GPU drivers, device plugins, and monitoring components on GPU nodes.

Set NGC credentials:

$ export NGC_API_KEY="nvapi-xxxxxxxxxxxxx"  # Your NGC API key

Deploy the GPU Operator:

$ # Navigate to core-apps under the nvcf-base top-level directory
$ cd /path/to/nvcf-base/core-apps
$ 
$ helmfile apply --selector component=gpu

Verify deployment is proceeding.

Expected Duration: 5-10 minutes

$ # Check GPU operator pods are running (if some pods are in init state, wait a few minutes)
$ kubectl get pods -n gpu-operator
$ 
$ # Verify GPU resources are advertised on nodes
$ kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Expected output: All pods should be Running, and GPU nodes should show available GPU count (e.g., 1 for g6e.4xlarge).

The GPU Operator installs:

GPU driver (NVIDIA 570.x)
NVIDIA device plugin
GPU feature discovery
DCGM exporter (metrics)

Next Steps

With the infrastructure and GPU Operator deployed:

Begin control plane deployment by following helmfile-installation.
Deploy optional application components (including simulation components such as DDCS, UCC, Storage API, LLS) under nvcf-base/core-apps. See self-hosted-optional-enhancements.

Operations

Scaling Node Pools

Update terraform.tfvars and reapply:

$ # Edit desired_capacity or max_capacity for node pools
$ vim terraform/envs/<your-environment>/terraform.tfvars
$ 
$ cd terraform/envs/<your-environment>
$ terraform plan
$ terraform apply

Upgrading GPU Operator

Re-run the deployment command to upgrade to the latest version:

$ helmfile apply --selector component=gpu

To upgrade optional enhancements (container caching, simulation caching, LLS), re-run the corresponding make deploy-* commands from self-hosted-optional-enhancements.

Adding GPU Capacity

GPU taints are applied automatically when new GPU nodes join:

Increase desired_capacity for gpu node pool in tfvars
Run terraform apply
New nodes will automatically receive GPU taints
Verify: kubectl get nodes -o=custom-columns="NAME:.metadata.name,TAINTS:.spec.taints"

Decommissioning

$ cd terraform/envs/<your-environment>
$ terraform destroy

This destroys all cluster resources.

Troubleshooting

GPU Taints Not Applied

Symptoms: GPU nodes do not have nvidia.com/gpu taint

Diagnosis:

$ # SSH to GPU node
$ ssh ubuntu@<gpu-node-ip>
$ 
$ # Check cloud-init logs
$ sudo cat /var/log/cloud-init-output.log | grep -E "IMDSv2|GPU|TAINT"

Expected output:

DEBUG: Obtained IMDSv2 token
DEBUG: Instance Type: g5.12xlarge
DEBUG: Instance Family: g5
DEBUG: Matched GPU family (g* or p*) - adding GPU taint
DEBUG: Added GPU taint for GPU instance family: nvidia.com/gpu=present:NoSchedule

Resolution:

Verify instance type starts with g or p
Check launch template user-data rendered correctly
Terminate node and let ASG create replacement

AMI Does Not Exist Error

Symptom: During terraform apply, Auto Scaling Group creation fails with:

Error: creating Auto Scaling Group (my-cluster-gpu): ValidationError: You must use a
valid fully-formed launch template. The image id '[ami-0bce1583264e581a6]' does not exist

Cause: The node_ami_id is set to an AMI that doesn’t exist in your target region. AMI IDs are region-specific — an AMI available in us-west-2 will not exist in ap-south-1 or other regions.

Resolution:

Recommended: Use automatic AMI discovery

Set node_ami_id = null in your terraform.tfvars. This automatically discovers the latest EKS-optimized Ubuntu AMI for your region:

1 node_ami_id = null  # Recommended - auto-discovers correct AMI for your region

If you must pin a specific AMI, find the correct AMI for your region:

$ # Replace <your-region> and <k8s-version> with your values
$ aws ec2 describe-images --region <your-region> \
>   --owners amazon \
>   --filters "Name=name,Values=ubuntu-eks/k8s_<k8s-version>/images/*x86_64*" \
>   --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \
>   --output text
$ 
$ # Example for us-west-2 with Kubernetes 1.32:
$ aws ec2 describe-images --region us-west-2 \
>   --owners amazon \
>   --filters "Name=name,Values=ubuntu-eks/k8s_1.32/images/*x86_64*" \
>   --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \
>   --output text

After fixing, run terraform apply again.

Using node_ami_id = null is strongly recommended as it ensures you always get the latest security patches and eliminates region compatibility issues.

Pods Not Scheduling on GPU Nodes

Symptoms: Pods with GPU requests stay in Pending state

Cause: Missing toleration for GPU taint

Resolution: Add toleration to pod spec:

1 tolerations:
2 - key: nvidia.com/gpu
3   operator: Exists
4   effect: NoSchedule

ImagePullBackOff Errors

Symptoms: Pods fail to pull images from nvcr.io

Resolution:

$ # Verify NGC_API_KEY is set
$ echo $NGC_API_KEY
$ 
$ # Recreate image pull secret
$ kubectl create secret docker-registry nvcr-creds \
>   --docker-server=nvcr.io \
>   --docker-username='$oauthtoken' \
>   --docker-password="$NGC_API_KEY" \
>   --namespace=<namespace> \
>   --dry-run=client -o yaml | kubectl apply -f -
$ 
$ # Restart affected pods
$ kubectl rollout restart deployment <deployment-name> -n <namespace>

Resuming Installation After AWS Credential Expiration

Symptoms: Terraform fails during long-running operations (e.g., EKS cluster creation) with authentication errors, or you see “Resource already exists” errors when trying to resume

Cause: AWS credentials expired during Terraform provisioning. AWS credentials typically expire after a certain period (often 1 hour for temporary credentials), which can occur during long-running deployments like EKS cluster creation.

Resolution:

When AWS credentials expire mid-deployment, Terraform may lose track of resources it created. You can recover by removing the resource from Terraform state and re-importing it.

Example: EKS Cluster Creation

$ # Remove the EKS cluster from Terraform state
$ terraform state rm module.eks.aws_eks_cluster.main
$ 
$ # Re-import the existing cluster
$ terraform import module.eks.aws_eks_cluster.main <cluster_name>
$ 
$ # Continue with your terraform operation
$ terraform apply

General Process:

Identify the resource that was partially created (check AWS Console)
Remove it from Terraform state: terraform state rm <resource_address>
Import the existing resource: terraform import <resource_address> <resource_id>
Continue with terraform apply

To find the correct resource address, use terraform state list to see all resources in your state file.

This approach works for any Terraform resource that exists in AWS but Terraform has lost track of due to credential expiration. Investigate this process whenever you encounter “Resource already exists” errors after a failed deployment.

CSI Driver or CloudWatch Add-on Installation Timeouts

Symptoms: Terraform times out when deploying CSI drivers (SMB CSI Driver) or CloudWatch Observability add-on, or add-on installation appears to hang

Common Causes:

AWS credential expiration during long-running deployments
Network connectivity issues between EKS and AWS services
IAM permissions issues preventing add-on installation
Resource limits or quota exhaustions in your AWS account

Resolution Steps:

Retry the deployment

AWS credential expiration is the most common cause. See [Resuming Installation After AWS Credential Expiration] for recovery steps.
If the issue persists, consult AWS documentation

AWS provides detailed troubleshooting guides for add-on installations:
- CloudWatch Observability Add-on: Troubleshooting CloudWatch Observability EKS Add-on
- CSI Driver Issues: Amazon EKS CSI Driver Troubleshooting

Verify add-on status via AWS CLI

$ # Check CloudWatch add-on status
$ aws eks describe-addon \
>   --cluster-name <cluster-name> \
>   --addon-name amazon-cloudwatch-observability \
>   --region <region>
$ 
$ # Check all add-ons
$ aws eks list-addons \
>   --cluster-name <cluster-name> \
>   --region <region>

Last resort: Destroy and recreate

If troubleshooting steps fail to resolve the issue, you may need to start fresh. See the [Decommissioning] section to destroy the cluster, then return to [Step 4: Deploy Cluster] to recreate it.

Prevention: Use long-lived AWS credentials or AWS IAM roles when possible to avoid credential expiration during deployments. If using temporary credentials, ensure they have sufficient validity period (at least 2-3 hours) before starting a cluster deployment.