EKS Cluster Terraform (Optional)
EKS Cluster Terraform (Optional)
This guide is completely optional. If you already have a Kubernetes cluster with GPU nodes, you can skip this page and proceed directly to helmfile-installation to install the NVCF control plane.
This guide provides instructions for deploying the Amazon EKS infrastructure foundation for a fully self-hosted NVIDIA Cloud Functions (NVCF) deployment using Terraform. This includes:
- Amazon EKS cluster with dedicated node pools for various workloads
- GPU nodes with automatic taint configuration for inference workloads
- Core infrastructure components (VPC, subnets, IAM roles, security groups)
- NVIDIA GPU Operator deployment (required for GPU workloads)
- Infrastructure prerequisites for optional enhancements (LLS streaming, simulation caching)
This guide covers infrastructure deployment only. Some Terraform options configure AWS resources (IAM policies, S3 buckets) required by optional enhancements deployed later. See self-hosted-optional-enhancements for details on these components.
Prerequisites
Required Tools
- Terraform >= 1.0.0
- AWS CLI configured with credentials
- kubectl >= 1.28
- helm >= 3.17
- helmfile >= 0.150
- helm-diff plugin >=3.11
- helm-secrets plugin >=4.7.4
- skopeo (required only if using
create_sm_ecr_repos = truefor automated ECR mirroring)
Required Access
- AWS Account with permissions for EKS, VPC, EC2, IAM, S3
- NGC API Key from ngc.nvidia.com authenticated with
nvcf-onpremorganization - See self-hosted-image-mirroring for more details on required NGC Service Key scopes. - The
nvcf-baserepository must be downloaded to your local machine (see download-nvcf-base).
Configure AWS Credentials
Terraform requires valid AWS credentials to create resources. Configure your AWS credentials using one of the following methods before running any terraform commands:
AWS CLI (Recommended)
Environment Variables
AWS Profile
Configure credentials using the AWS CLI:
Verify AWS credentials are configured correctly:
You should see output showing your AWS account ID, user ARN, and user ID. If you receive an error, your credentials are not configured correctly.
Set NGC API Key
Before proceeding, set your NGC API key as an environment variable. This is required for automated ECR mirroring and GPU Operator deployment:
Network Planning
- VPC CIDR:
/16recommended for production - Service CIDR:
/16, must not overlap with VPC CIDR - Egress is required for third-party registry access to pull both service artifacts and function containers
Node Pool Design
The Terraform configuration supports flexible node pool designs for different deployment scenarios:
self-managed: 5 node pools (compute, GPU, control-plane, database, and secrets management), the extra compute node pool is primarily for supporting optional simulation components and can be disabled for inference-only self-hosted NVCFbyoc: 2 node pools (compute and GPU) - if deploying with this configuration,nodeSelectorsmust be disabled in the self-hosted stack environment configuration file.
Please refer to the codebase nvcf-base/terraform/tfvars-examples for the full list of node configurations and deployment options. Though the byoc configuration may support a self-hosted stack deployment, it is primarily meant for BYOC cluster deployments with NVIDIA-managed NVCF control plane services (see cluster-setup-management).
You can customize node pools (instance types, capacities, and configurations) by copying one of the example tfvars files from terraform/tfvars-examples/ to your environment directory and modifying it to match your requirements.
Automatic GPU Taint Configuration
GPU nodes are automatically tainted with nvidia.com/gpu=present:NoSchedule based on instance family detection (g* or p* patterns). No manual configuration required.
Cluster Creation
Step 1: Create Environment
The nvcf-base repository includes a base Terraform environment under terraform/envs/byoc/ containing the required Terraform configuration files. Create your own environment by copying this folder:
Replace <your-environment> with your environment name (e.g., nvcf-prod, staging). This copies all required Terraform files (main.tf, variables.tf, providers.tf, outputs.tf) along with the tfvars configuration template.
Step 2: Configure Environment
Edit terraform/envs/<your-environment>/terraform.tfvars to match your requirements. The key sections are described below. Feel free to use this example terraform.tfvars directly to bring up an EKS cluster ready for NVCF self-hosted control plane deployment. LLS (Low Latency Streaming) is disabled by default; enable it only if you plan to deploy simulation or streaming VM workloads (see self-hosted-lls-installation).
If you plan on using NVCF streaming functions, cluster_name must be less than 20 characters. Please double-check before proceeding, or you’ll need to unwind and restart.
AMI IDs are region-specific. The sample configuration uses node_ami_id = null which automatically discovers the correct EKS-optimized AMI for your region. This is the recommended setting.
If you need to pin a specific AMI (for compliance or reproducibility), you must use an AMI ID that exists in your target region. Using an AMI from a different region will cause terraform apply to fail with “image id does not exist” errors. See [AMI Does Not Exist Error] in Troubleshooting.
ECR Registry Image Mirroring
For ECR users, this Terraform module can automatically mirror all required NVCF artifacts from NGC:
Requires NGC_API_KEY environment variable set before running terraform apply. Generate this key from the nvcf-onprem organization at https://org.ngc.nvidia.com/setup/api-keys.
See ecr-automated-mirroring for details on what’s included (control plane, LLS, worker components) and what’s not (simulation cache, custom streaming apps).
If you’re not using ECR or prefer manual mirroring, set create_sm_ecr_repos = false and follow self-hosted-image-mirroring.
GPU Node Configuration
For GPU workloads, you must set the appropriate GPU instance type in the terraform.tfvars configuration. NVCF supports all GPU types supported by the NVIDIA GPU Operator.
Ensure the instance type is available in your chosen region and availability zones (specified in availability_zones).
For single-GPU simulation workloads, this should be g6e.4xlarge or better.
Deploying to Different AWS Regions
If deploying to a region other than us-west-2, you must update these three variables:
Why these are required:
- Availability zones are region-specific -
us-west-2adoesn’t exist inus-east-1 - AMI IDs are region-specific - setting to
nullauto-detects the latest EKS-optimized AMI for your region - Region determines resource location - all AWS resources will be created in this region
Example for US-East-1:
To find availability zones for your region:
For detailed guidance in any region, see nvcf-base/terraform/tfvars-examples/README.md.
Step 3: Initialize and Validate
Initialize Terraform and validate the configuration.
- VPC Subnets (
availability_zones,private_subnet_cidrs,public_subnet_cidrs) - MUST span at least 2 AZs (AWS EKS requirement for high availability) - Node Placement (
node_availability_zones) - MUST use only 1 AZ (LLS limitation)
Configuration:
Do NOT change availability_zones to a single zone - this will cause Terraform to fail with “subnets not in at least two different availability zones” error.
Step 4: Deploy Cluster
Expected Duration: 30-45 minutes
What gets deployed:
- VPC with public/private subnets across 3 AZs
- EKS control plane
- Node pool configuration
- IAM roles and policies
- Security groups
- S3 buckets (if enabled)
- Ensure you are in the environment directory and have run
terraform init(see Step 3). Review the deployment plan.
Review the plan output to verify expected resources will be created based on your configuration. Key items to check:
- Node pools: Verify correct number and instance types (e.g., 5 node pools for self-managed deployment with optional caching components)
- VPC/Networking: Confirm subnets match your CIDR configuration
- S3 buckets: If
create_s3_buckets = true, verify bucket name is correct
- Apply the configuration.
Verify Deployment
- After deployment completes, configure kubectl:
Use the same region and cluster_name values from your terraform.tfvars configuration.
- Verify cluster health:
Example output for GPU nodes (should match your GPU instance type):
Step 5: Deploy GPU Operator
The NVIDIA GPU Operator is required for GPU workloads. It installs GPU drivers, device plugins, and monitoring components on GPU nodes.
- Set NGC credentials:
- Deploy the GPU Operator:
- Verify deployment is proceeding.
Expected Duration: 5-10 minutes
Expected output: All pods should be Running, and GPU nodes should show available GPU count (e.g., 1 for g6e.4xlarge).
The GPU Operator installs:
- GPU driver (NVIDIA 570.x)
- NVIDIA device plugin
- GPU feature discovery
- DCGM exporter (metrics)
Next Steps
With the infrastructure and GPU Operator deployed:
- Begin control plane deployment by following helmfile-installation.
- Deploy optional application components (including simulation components such as DDCS, UCC, Storage API, LLS) under
nvcf-base/core-apps. See self-hosted-optional-enhancements.
Operations
Scaling Node Pools
Update terraform.tfvars and reapply:
Upgrading GPU Operator
Re-run the deployment command to upgrade to the latest version:
To upgrade optional enhancements (container caching, simulation caching, LLS), re-run the corresponding make deploy-* commands from self-hosted-optional-enhancements.
Adding GPU Capacity
GPU taints are applied automatically when new GPU nodes join:
- Increase
desired_capacityfor gpu node pool in tfvars - Run
terraform apply - New nodes will automatically receive GPU taints
- Verify:
kubectl get nodes -o=custom-columns="NAME:.metadata.name,TAINTS:.spec.taints"
Decommissioning
This destroys all cluster resources.
Troubleshooting
GPU Taints Not Applied
Symptoms: GPU nodes do not have nvidia.com/gpu taint
Diagnosis:
Expected output:
Resolution:
- Verify instance type starts with
gorp - Check launch template user-data rendered correctly
- Terminate node and let ASG create replacement
AMI Does Not Exist Error
Symptom: During terraform apply, Auto Scaling Group creation fails with:
Cause: The node_ami_id is set to an AMI that doesn’t exist in your target region. AMI IDs are region-specific — an AMI available in us-west-2 will not exist in ap-south-1 or other regions.
Resolution:
-
Recommended: Use automatic AMI discovery
Set
node_ami_id = nullin yourterraform.tfvars. This automatically discovers the latest EKS-optimized Ubuntu AMI for your region: -
If you must pin a specific AMI, find the correct AMI for your region:
-
After fixing, run
terraform applyagain.
Using node_ami_id = null is strongly recommended as it ensures you always get the latest security patches and eliminates region compatibility issues.
Pods Not Scheduling on GPU Nodes
Symptoms: Pods with GPU requests stay in Pending state
Cause: Missing toleration for GPU taint
Resolution: Add toleration to pod spec:
ImagePullBackOff Errors
Symptoms: Pods fail to pull images from nvcr.io
Resolution:
Resuming Installation After AWS Credential Expiration
Symptoms: Terraform fails during long-running operations (e.g., EKS cluster creation) with authentication errors, or you see “Resource already exists” errors when trying to resume
Cause: AWS credentials expired during Terraform provisioning. AWS credentials typically expire after a certain period (often 1 hour for temporary credentials), which can occur during long-running deployments like EKS cluster creation.
Resolution:
When AWS credentials expire mid-deployment, Terraform may lose track of resources it created. You can recover by removing the resource from Terraform state and re-importing it.
Example: EKS Cluster Creation
General Process:
- Identify the resource that was partially created (check AWS Console)
- Remove it from Terraform state:
terraform state rm <resource_address> - Import the existing resource:
terraform import <resource_address> <resource_id> - Continue with
terraform apply
To find the correct resource address, use terraform state list to see all resources in your state file.
This approach works for any Terraform resource that exists in AWS but Terraform has lost track of due to credential expiration. Investigate this process whenever you encounter “Resource already exists” errors after a failed deployment.
CSI Driver or CloudWatch Add-on Installation Timeouts
Symptoms: Terraform times out when deploying CSI drivers (SMB CSI Driver) or CloudWatch Observability add-on, or add-on installation appears to hang
Common Causes:
- AWS credential expiration during long-running deployments
- Network connectivity issues between EKS and AWS services
- IAM permissions issues preventing add-on installation
- Resource limits or quota exhaustions in your AWS account
Resolution Steps:
-
Retry the deployment
AWS credential expiration is the most common cause. See [Resuming Installation After AWS Credential Expiration] for recovery steps.
-
If the issue persists, consult AWS documentation
AWS provides detailed troubleshooting guides for add-on installations:
- CloudWatch Observability Add-on: Troubleshooting CloudWatch Observability EKS Add-on
- CSI Driver Issues: Amazon EKS CSI Driver Troubleshooting
-
Verify add-on status via AWS CLI
-
Last resort: Destroy and recreate
If troubleshooting steps fail to resolve the issue, you may need to start fresh. See the [Decommissioning] section to destroy the cluster, then return to [Step 4: Deploy Cluster] to recreate it.
Prevention: Use long-lived AWS credentials or AWS IAM roles when possible to avoid credential expiration during deployments. If using temporary credentials, ensure they have sufficient validity period (at least 2-3 hours) before starting a cluster deployment.