EKS Cluster Terraform (Optional)#
Warning
This guide is completely optional. If you already have a Kubernetes cluster with GPU nodes, you can skip this page and proceed directly to Helmfile Installation to install the NVCF control plane.
This guide provides instructions for deploying the Amazon EKS infrastructure foundation for a fully self-hosted NVIDIA Cloud Functions (NVCF) deployment using Terraform. This includes:
Amazon EKS cluster with dedicated node pools for various workloads
GPU nodes with automatic taint configuration for inference workloads
Core infrastructure components (VPC, subnets, IAM roles, security groups)
NVIDIA GPU Operator deployment (required for GPU workloads)
Infrastructure prerequisites for optional enhancements (LLS streaming, simulation caching)
Note
This guide covers infrastructure deployment only. Some Terraform options configure AWS resources (IAM policies, S3 buckets) required by optional enhancements deployed later. See orphan for details on these components.
Prerequisites#
Required Tools#
Terraform >= 1.0.0
AWS CLI configured with credentials
kubectl >= 1.28
helm >= 3.17
helmfile >= 0.150
helm-diff plugin >=3.11
helm-secrets plugin >=4.7.4
skopeo (required only if using
create_sm_ecr_repos = truefor automated ECR mirroring)
Required Access#
AWS Account with permissions for EKS, VPC, EC2, IAM, S3
NGC API Key from ngc.nvidia.com authenticated with
nvcf-onpremorganization - See Image Mirroring for more details on required NGC Service Key scopes.The
nvcf-baserepository must be downloaded to your local machine (see Downloading nvcf-base).
Configure AWS Credentials#
Terraform requires valid AWS credentials to create resources. Configure your AWS credentials using one of the following methods before running any terraform commands:
Configure credentials using the AWS CLI:
# Interactive configuration
aws configure
# Or use SSO login
aws sso login --profile <profile-name>
Set credentials directly as environment variables:
export AWS_ACCESS_KEY_ID="<your-access-key>"
export AWS_SECRET_ACCESS_KEY="<your-secret-key>"
export AWS_SESSION_TOKEN="<your-session-token>" # If using temporary credentials
export AWS_REGION="<your-region>" # e.g., us-east-1
Use a named profile from ~/.aws/credentials:
export AWS_PROFILE="<profile-name>"
Verify AWS credentials are configured correctly:
aws sts get-caller-identity
You should see output showing your AWS account ID, user ARN, and user ID. If you receive an error, your credentials are not configured correctly.
Set NGC API Key
Before proceeding, set your NGC API key as an environment variable. This is required for automated ECR mirroring and GPU Operator deployment:
export NGC_API_KEY="nvapi-xxxxxxxxxxxxx" # Replace with your NGC API key
Network Planning#
VPC CIDR:
/16recommended for productionService CIDR:
/16, must not overlap with VPC CIDREgress is required for third-party registry access to pull both service artifacts and function containers
Node Pool Design#
The Terraform configuration supports flexible node pool designs for different deployment scenarios:
self-managed: 5 node pools (compute, GPU, control-plane, database, and secrets management), the extra compute node pool is primarily for supporting optional simulation components and can be disabled for inference-only self-hosted NVCFbyoc: 2 node pools (compute and GPU) - if deploying with this configuration,nodeSelectorsmust be disabled in the self-hosted stack environment configuration file.
Please refer to the codebase nvcf-base/terraform/tfvars-examples for the full list of node configurations and deployment options. Though the byoc configuration may support a self-hosted stack deployment, it is primarily meant for BYOC cluster deployments with NVIDIA-managed NVCF control plane services (see Overview).
Note
You can customize node pools (instance types, capacities, and configurations) by copying one of the example tfvars files from terraform/tfvars-examples/ to your environment directory and modifying it to match your requirements.
Automatic GPU Taint Configuration
GPU nodes are automatically tainted with nvidia.com/gpu=present:NoSchedule based on instance family detection (g* or p* patterns). No manual configuration required.
Cluster Creation#
Step 1: Create Environment#
The nvcf-base repository includes a base Terraform environment under terraform/envs/byoc/ containing the required Terraform configuration files. Create your own environment by copying this folder:
cd nvcf-base
cp -r terraform/envs/byoc terraform/envs/<your-environment>
cp terraform/tfvars-examples/self-managed-full.tfvars terraform/envs/<your-environment>/terraform.tfvars
Replace <your-environment> with your environment name (e.g., nvcf-prod, staging). This copies all required Terraform files (main.tf, variables.tf, providers.tf, outputs.tf) along with the tfvars configuration template.
Step 2: Configure Environment#
Edit terraform/envs/<your-environment>/terraform.tfvars to match your requirements. The key sections are described below. Feel free to use this example terraform.tfvars directly to bring up an EKS cluster ready for NVCF self-hosted control plane deployment. LLS (Low Latency Streaming) is disabled by default; enable it only if you plan to deploy simulation or streaming VM workloads (see LLS Installation).
Example terraform.tfvars Configuration
1# =============================================================================
2# NVCF Fully Self-Managed Configuration (Co-located)
3# =============================================================================
4# This configuration deploys a cluster with BOTH:
5# - NVCF control plane (self-hosted)
6# - BYOC workloads
7#
8# Co-located architecture - both components in the same EKS cluster.
9# =============================================================================
10
11# -----------------------------------------------------------------------------
12# REQUIRED: Cluster Identification
13# -----------------------------------------------------------------------------
14cluster_name = "my-self-hosted-cluster" # Must be under 20 characters if enabling LLS (EA limitation)
15cluster_version = "1.32"
16region = "us-west-2"
17environment = "production"
18
19# -----------------------------------------------------------------------------
20# VPC and Networking (larger for control plane + workloads)
21# -----------------------------------------------------------------------------
22# Default: null lets AWS auto-assign a non-colliding CIDR.
23# Override with a specific CIDR if you need deterministic addressing:
24# vpc_cidr = "10.110.0.0/16"
25vpc_cidr = null
26
27availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
28
29# When vpc_cidr is null, leave these as null for automatic subnet calculation.
30# When using a specific vpc_cidr, override with matching subnets, e.g.:
31# private_subnet_cidrs = ["10.110.0.0/19", "10.110.32.0/19", "10.110.64.0/19"]
32# public_subnet_cidrs = ["10.110.101.0/24", "10.110.102.0/24", "10.110.103.0/24"]
33private_subnet_cidrs = null
34public_subnet_cidrs = null
35
36service_ipv4_cidr = "172.20.0.0/16"
37
38create_nat_gateways = true
39
40# -----------------------------------------------------------------------------
41# Node Pool Configuration (Control Plane + BYOC)
42# -----------------------------------------------------------------------------
43node_pools = {
44 # NVCF Control Plane Nodes
45 "nvcf-control-plane" = {
46 instance_type = "m5.4xlarge" # Control plane services need CPU/memory
47 desired_capacity = 3
48 max_capacity = 5
49 min_capacity = 3
50 labels = {
51 "node-type" = "control-plane"
52 "workload" = "nvcf-control-plane"
53 "nvcf.nvidia.com/workload" = "control-plane"
54 }
55 },
56
57 # Compute nodes for BYOC workloads
58 "compute" = {
59 instance_type = "m5.2xlarge"
60 desired_capacity = 3
61 max_capacity = 10
62 min_capacity = 2
63 labels = {
64 "node-type" = "compute"
65 "workload" = "byoc"
66 }
67 },
68
69 # GPU nodes for BYOC workloads
70 # Change to appropriate GPU instance type for your workload. For single-GPU simulation workloads, this should be g6e.4xlarge.
71 # For very basic workloads to test the stack, we recommend g5.4xlarge (A10G) or for inference workloads, A100, H100 or better.
72 # min_capacity is 1 because the NVCF cluster agent (NVCA) will not be able to start if there are no GPU nodes.
73 "gpu" = {
74 instance_type = "g6e.4xlarge"
75 desired_capacity = 2
76 max_capacity = 8
77 min_capacity = 1
78 labels = {
79 "node-type" = "gpu"
80 "nvidia.com/gpu" = "true"
81 "workload" = "byoc-gpu"
82 }
83 },
84
85 # Cassandra nodes for control plane storage
86 "cassandra" = {
87 instance_type = "r5.2xlarge" # Memory-optimized for database
88 desired_capacity = 3
89 max_capacity = 5
90 min_capacity = 3
91 labels = {
92 "node-type" = "storage"
93 "workload" = "cassandra"
94 "nvcf.nvidia.com/workload" = "cassandra"
95 }
96 },
97
98 # OpenBao nodes for secrets management
99 "openbao" = {
100 instance_type = "m5.xlarge"
101 desired_capacity = 3
102 max_capacity = 3
103 min_capacity = 3
104 labels = {
105 "node-type" = "security"
106 "workload" = "openbao"
107 "nvcf.nvidia.com/workload" = "vault"
108 }
109 }
110}
111
112# Storage configuration (larger for control plane data)
113node_root_volume_size = 100 # GB for control plane nodes
114gpu_node_root_volume_size = 250 # GB for GPU nodes
115
116# AMI Configuration
117# Default (null) automatically discovers the latest Ubuntu 22.04 EKS-optimized AMI for your region
118# This is RECOMMENDED for most deployments (always uses latest security patches)
119# This determines the base OS image for the EKS nodes.
120node_ami_id = null
121
122# Advanced: Pin a specific AMI for compliance/reproducibility
123# NOTE: AMI IDs are region-specific. Examples:
124# us-west-2: ami-0bce1583264e581a6
125# us-east-1: ami-0e70225fadb23da91
126# us-east-2: ami-0a12b3c4d5e6f7890
127# Uncomment and update for your region:
128# node_ami_id = "ami-0bce1583264e581a6"
129
130# SSH access (recommended for control plane troubleshooting)
131# ssh_public_key = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIExample..."
132
133# -----------------------------------------------------------------------------
134# Feature Flags
135# -----------------------------------------------------------------------------
136
137# Set to true to create ECR repositories and copy NVCF images from NGC
138# IMPORTANT: Requires NGC_API_KEY to be set in your environment
139create_sm_ecr_repos = true
140
141# =============================================================================
142# Additional Configuration (Optional)
143# =============================================================================
144
145# Observability - OPTIONAL, DEPRECATED
146# WARNING: The CloudWatch Observability addon is disabled to avoid conflicts
147# with the stack bring-up.
148enable_cloudwatch_observability = false
149
150# S3 Buckets - OPTIONAL, REQUIRED for DDCS and UCC (Simulation components)
151create_s3_buckets = true
152s3_bucket_name = "my-self-hosted-data" # REPLACE: Must be globally unique
153
154# Autoscaling (important for handling varying workload)
155enable_autoscaling = true
156
157# -----------------------------------------------------------------------------
158# Advanced Autoscaling Configuration (optional)
159# -----------------------------------------------------------------------------
160# Uncomment and customize for fine-grained control
161
162# autoscaling_cooldown_period = 300
163# autoscaling_polling_interval = 30
164# autoscaling_scale_up_threshold = 70
165# autoscaling_scale_down_threshold = 30
166# gpu_autoscaling_enabled = true
167# gpu_autoscaling_min_nodes = 0
168# gpu_autoscaling_max_nodes = 10
169# compute_autoscaling_enabled = true
170# compute_autoscaling_min_nodes = 2
171# compute_autoscaling_max_nodes = 15
172# autoscaling_metrics = ["CPUUtilization", "MemoryUtilization"]
173# enable_spot_instances = false
174# spot_instance_percentage = 0
175# enable_predictive_scaling = false
176
177# -----------------------------------------------------------------------------
178# Tags
179# -----------------------------------------------------------------------------
180tags = {
181 Environment = "production"
182 Project = "nvcf-self-hosted"
183 ManagedBy = "terraform"
184 Deployment = "co-located"
185 CostCenter = "engineering"
186 Owner = "platform-team"
187 Architecture = "self-hosted-full"
188}
Note
If you plan on using NVCF streaming functions, cluster_name must be less than 20 characters. Please double-check before proceeding, or you’ll need to unwind and restart.
Warning
AMI IDs are region-specific. The sample configuration uses node_ami_id = null which automatically discovers the correct EKS-optimized AMI for your region. This is the recommended setting.
If you need to pin a specific AMI (for compliance or reproducibility), you must use an AMI ID that exists in your target region. Using an AMI from a different region will cause terraform apply to fail with “image id does not exist” errors. See AMI Does Not Exist Error in Troubleshooting.
ECR Registry Image Mirroring
For ECR users, this Terraform module can automatically mirror all required NVCF artifacts from NGC:
create_sm_ecr_repos = true # Enable automated mirroring
Important
Requires NGC_API_KEY environment variable set before running terraform apply. Generate this key from the nvcf-onprem organization at https://org.ngc.nvidia.com/setup/api-keys.
See Recommended for ECR Users: Automated ECR Mirroring for details on what’s included (control plane, LLS, worker components) and what’s not (simulation cache, custom streaming apps).
If you’re not using ECR or prefer manual mirroring, set create_sm_ecr_repos = false and follow Image Mirroring.
GPU Node Configuration
For GPU workloads, you must set the appropriate GPU instance type in the terraform.tfvars configuration. NVCF supports all GPU types supported by the NVIDIA GPU Operator.
Ensure the instance type is available in your chosen region and availability zones (specified in availability_zones).
Note
For single-GPU simulation workloads, this should be g6e.4xlarge or better.
"gpu" = {
instance_type = "g6e.4xlarge" # Change to appropriate GPU instance type for your workload.
desired_capacity = 2
max_capacity = 8
min_capacity = 1
labels = {
"node-type" = "gpu"
"nvidia.com/gpu" = "true"
"workload" = "byoc-gpu"
}
}
Deploying to Different AWS Regions
If deploying to a region other than us-west-2, you must update these three variables:
Variable |
Required Change |
Example |
|---|---|---|
|
Target AWS region |
|
|
Valid AZs for that region |
|
|
Set to |
|
Why these are required:
Availability zones are region-specific -
us-west-2adoesn’t exist inus-east-1AMI IDs are region-specific - setting to
nullauto-detects the latest EKS-optimized AMI for your regionRegion determines resource location - all AWS resources will be created in this region
Example for US-East-1:
region = "us-east-1"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
node_ami_id = null # Auto-detects correct AMI for us-east-1
To find availability zones for your region:
aws ec2 describe-availability-zones --region <your-region> \
--query 'AvailabilityZones[].ZoneName' --output text
For detailed guidance in any region, see nvcf-base/terraform/tfvars-examples/README.md.
Step 3: Initialize and Validate#
Initialize Terraform and validate the configuration.
cd terraform/envs/<your-environment>
terraform init
terraform validate
Warning
VPC Subnets (
availability_zones,private_subnet_cidrs,public_subnet_cidrs) - MUST span at least 2 AZs (AWS EKS requirement for high availability)Node Placement (
node_availability_zones) - MUST use only 1 AZ (LLS limitation)
Configuration:
# VPC subnets - keep multiple AZs (AWS EKS requirement)
availability_zones = [
"us-west-2a",
"us-west-2b",
"us-west-2c"
]
Do NOT change availability_zones to a single zone - this will cause Terraform to fail with “subnets not in at least two different availability zones” error.
Step 4: Deploy Cluster#
Expected Duration: 30-45 minutes
What gets deployed:
VPC with public/private subnets across 3 AZs
EKS control plane
Node pool configuration
IAM roles and policies
Security groups
S3 buckets (if enabled)
Ensure you are in the environment directory and have run
terraform init(see Step 3). Review the deployment plan.
terraform plan
Note
Review the plan output to verify expected resources will be created based on your configuration. Key items to check:
Node pools: Verify correct number and instance types (e.g., 5 node pools for self-managed deployment with optional caching components)
VPC/Networking: Confirm subnets match your CIDR configuration
S3 buckets: If
create_s3_buckets = true, verify bucket name is correct
Apply the configuration.
terraform apply
Verify Deployment#
After deployment completes, configure kubectl:
# Replace <region> and <cluster-name> with your values from terraform.tfvars
aws eks update-kubeconfig \
--region <region> \
--name <cluster-name>
Note
Use the same region and cluster_name values from your terraform.tfvars configuration.
Verify cluster health:
# Check all nodes are Ready
kubectl get nodes
# Verify GPU taints are applied automatically
kubectl get nodes -o=custom-columns="NAME:.metadata.name, INSTANCE:.metadata.labels.node\.kubernetes\.io/instance-type, TAINTS:.spec.taints"
Example output for GPU nodes (should match your GPU instance type):
NAME INSTANCE TAINTS
ip-10-120-x-x.compute.internal g6e.4xlarge [map[effect:NoSchedule key:nvidia.com/gpu value:present]]
Step 5: Deploy GPU Operator#
The NVIDIA GPU Operator is required for GPU workloads. It installs GPU drivers, device plugins, and monitoring components on GPU nodes.
Set NGC credentials:
export NGC_API_KEY="nvapi-xxxxxxxxxxxxx" # Your NGC API key
Deploy the GPU Operator:
# Navigate to core-apps under the nvcf-base top-level directory
cd /path/to/nvcf-base/core-apps
helmfile apply --selector component=gpu
Verify deployment is proceeding.
Expected Duration: 5-10 minutes
# Check GPU operator pods are running (if some pods are in init state, wait a few minutes)
kubectl get pods -n gpu-operator
# Verify GPU resources are advertised on nodes
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
Expected output: All pods should be Running, and GPU nodes should show available GPU count (e.g., 1 for g6e.4xlarge).
The GPU Operator installs:
GPU driver (NVIDIA 570.x)
NVIDIA device plugin
GPU feature discovery
DCGM exporter (metrics)
Next Steps#
With the infrastructure and GPU Operator deployed:
Begin control plane deployment by following Control Plane Installation.
Deploy optional application components (including simulation components such as DDCS, UCC, Storage API, LLS) under
nvcf-base/core-apps. See orphan.
Operations#
Scaling Node Pools#
Update terraform.tfvars and reapply:
# Edit desired_capacity or max_capacity for node pools
vim terraform/envs/<your-environment>/terraform.tfvars
cd terraform/envs/<your-environment>
terraform plan
terraform apply
Upgrading GPU Operator#
Re-run the deployment command to upgrade to the latest version:
helmfile apply --selector component=gpu
Note
To upgrade optional enhancements (container caching, simulation caching, LLS), re-run the corresponding make deploy-* commands from orphan.
Adding GPU Capacity#
GPU taints are applied automatically when new GPU nodes join:
Increase
desired_capacityfor gpu node pool in tfvarsRun
terraform applyNew nodes will automatically receive GPU taints
Verify:
kubectl get nodes -o=custom-columns="NAME:.metadata.name,TAINTS:.spec.taints"
Decommissioning#
cd terraform/envs/<your-environment>
terraform destroy
Warning
This destroys all cluster resources.
Troubleshooting#
GPU Taints Not Applied#
Symptoms: GPU nodes do not have nvidia.com/gpu taint
Diagnosis:
# SSH to GPU node
ssh ubuntu@<gpu-node-ip>
# Check cloud-init logs
sudo cat /var/log/cloud-init-output.log | grep -E "IMDSv2|GPU|TAINT"
Expected output:
DEBUG: Obtained IMDSv2 token
DEBUG: Instance Type: g5.12xlarge
DEBUG: Instance Family: g5
DEBUG: Matched GPU family (g* or p*) - adding GPU taint
DEBUG: Added GPU taint for GPU instance family: nvidia.com/gpu=present:NoSchedule
Resolution:
Verify instance type starts with
gorpCheck launch template user-data rendered correctly
Terminate node and let ASG create replacement
AMI Does Not Exist Error#
Symptom: During terraform apply, Auto Scaling Group creation fails with:
Error: creating Auto Scaling Group (my-cluster-gpu): ValidationError: You must use a
valid fully-formed launch template. The image id '[ami-0bce1583264e581a6]' does not exist
Cause: The node_ami_id is set to an AMI that doesn’t exist in your target region. AMI IDs are region-specific — an AMI available in us-west-2 will not exist in ap-south-1 or other regions.
Resolution:
Recommended: Use automatic AMI discovery
Set
node_ami_id = nullin yourterraform.tfvars. This automatically discovers the latest EKS-optimized Ubuntu AMI for your region:node_ami_id = null # Recommended - auto-discovers correct AMI for your region
If you must pin a specific AMI, find the correct AMI for your region:
# Replace <your-region> and <k8s-version> with your values aws ec2 describe-images --region <your-region> \ --owners amazon \ --filters "Name=name,Values=ubuntu-eks/k8s_<k8s-version>/images/*x86_64*" \ --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \ --output text # Example for us-west-2 with Kubernetes 1.32: aws ec2 describe-images --region us-west-2 \ --owners amazon \ --filters "Name=name,Values=ubuntu-eks/k8s_1.32/images/*x86_64*" \ --query 'Images | sort_by(@, &CreationDate) | [-1].[ImageId,Name]' \ --output text
After fixing, run
terraform applyagain.
Tip
Using node_ami_id = null is strongly recommended as it ensures you always get the latest security patches and eliminates region compatibility issues.
Pods Not Scheduling on GPU Nodes#
Symptoms: Pods with GPU requests stay in Pending state
Cause: Missing toleration for GPU taint
Resolution: Add toleration to pod spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
ImagePullBackOff Errors#
Symptoms: Pods fail to pull images from nvcr.io
Resolution:
# Verify NGC_API_KEY is set
echo $NGC_API_KEY
# Recreate image pull secret
kubectl create secret docker-registry nvcr-creds \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY" \
--namespace=<namespace> \
--dry-run=client -o yaml | kubectl apply -f -
# Restart affected pods
kubectl rollout restart deployment <deployment-name> -n <namespace>
Resuming Installation After AWS Credential Expiration#
Symptoms: Terraform fails during long-running operations (e.g., EKS cluster creation) with authentication errors, or you see “Resource already exists” errors when trying to resume
Cause: AWS credentials expired during Terraform provisioning. AWS credentials typically expire after a certain period (often 1 hour for temporary credentials), which can occur during long-running deployments like EKS cluster creation.
Resolution:
When AWS credentials expire mid-deployment, Terraform may lose track of resources it created. You can recover by removing the resource from Terraform state and re-importing it.
Example: EKS Cluster Creation
# Remove the EKS cluster from Terraform state
terraform state rm module.eks.aws_eks_cluster.main
# Re-import the existing cluster
terraform import module.eks.aws_eks_cluster.main <cluster_name>
# Continue with your terraform operation
terraform apply
General Process:
Identify the resource that was partially created (check AWS Console)
Remove it from Terraform state:
terraform state rm <resource_address>Import the existing resource:
terraform import <resource_address> <resource_id>Continue with
terraform apply
Tip
To find the correct resource address, use terraform state list to see all resources in your state file.
Note
This approach works for any Terraform resource that exists in AWS but Terraform has lost track of due to credential expiration. Investigate this process whenever you encounter “Resource already exists” errors after a failed deployment.
CSI Driver or CloudWatch Add-on Installation Timeouts#
Symptoms: Terraform times out when deploying CSI drivers (SMB CSI Driver) or CloudWatch Observability add-on, or add-on installation appears to hang
Common Causes:
AWS credential expiration during long-running deployments
Network connectivity issues between EKS and AWS services
IAM permissions issues preventing add-on installation
Resource limits or quota exhaustions in your AWS account
Resolution Steps:
Retry the deployment
AWS credential expiration is the most common cause. See Resuming Installation After AWS Credential Expiration for recovery steps.
If the issue persists, consult AWS documentation
AWS provides detailed troubleshooting guides for add-on installations:
CloudWatch Observability Add-on: Troubleshooting CloudWatch Observability EKS Add-on
CSI Driver Issues: Amazon EKS CSI Driver Troubleshooting
Verify add-on status via AWS CLI
# Check CloudWatch add-on status aws eks describe-addon \ --cluster-name <cluster-name> \ --addon-name amazon-cloudwatch-observability \ --region <region> # Check all add-ons aws eks list-addons \ --cluster-name <cluster-name> \ --region <region>
Last resort: Destroy and recreate
If troubleshooting steps fail to resolve the issue, you may need to start fresh. See the Decommissioning section to destroy the cluster, then return to Step 4: Deploy Cluster to recreate it.
Tip
Prevention: Use long-lived AWS credentials or AWS IAM roles when possible to avoid credential expiration during deployments. If using temporary credentials, ensure they have sufficient validity period (at least 2-3 hours) before starting a cluster deployment.