AWS#
NIM supports deployment on Amazon Web Services (AWS) through two approaches:
EKS (Amazon Elastic Kubernetes Service): Self-managed Kubernetes deployment using Helm charts with GPU-accelerated nodes.
SageMaker: Fully managed ML platform with managed real-time inference endpoints that use S3 for model storage.
EKS Deployment#
This guide walks through deploying NIM LLM on Amazon Elastic Kubernetes Service (EKS) with GPU-accelerated nodes.
Prerequisites#
Before you begin, ensure you have the following:
An AWS account with permissions to create EKS clusters, EC2 GPU instances, and associated networking and IAM resources.
Sufficient GPU instance quota in your target region. Verify in the AWS Console under Service Quotas > Amazon EC2 > “Running On-Demand [instance-type] instances”.
An NGC account and API key.
The following CLI tools installed:
Configure AWS Credentials#
Set up a named AWS CLI profile and confirm that it works before you create infrastructure.
Create a Named Profile#
Use a named profile so the AWS CLI and related tools use the expected account and region.
Create the AWS credentials directory, and open the credentials file:
mkdir -p ~/.aws nano ~/.aws/credentials
Add your profile to
~/.aws/credentials:[my-profile] aws_access_key_id=<your-access-key> aws_secret_access_key=<your-secret-key> aws_session_token=<your-session-token>
Set your profile and default region:
export AWS_PROFILE=my-profile export AWS_DEFAULT_REGION=us-east-2
Verify that the profile works:
aws sts get-caller-identity
Check GPU Instance Quota#
Confirm that your target region has enough GPU quota before you create the cluster.
Set the AWS region:
export AWS_REGION="us-east-2"
List the relevant GPU quotas:
aws service-quotas list-service-quotas \ --service-code ec2 \ --region $AWS_REGION \ --query "Quotas[?contains(QuotaName, 'On-Demand G') || contains(QuotaName, 'On-Demand P')].{Name:QuotaName, Value:Value}" \ --output table
Use the quota output to confirm the instance families you plan to use:
G-series (
g5.xlarge,g5.12xlarge, and so on) requires quota codeL-DB2E81BA.P-series (
p4d.24xlarge,p5.48xlarge) requires quota codeL-417A185B.
Set Up the EKS Cluster#
Create the control plane first, then add GPU-enabled worker nodes and the supporting storage and GPU components.
Set Environment Variables#
Start by defining the values that the EKS setup commands use.
export AWS_REGION="us-east-2"
export EKS_CLUSTER="my-eks-cluster"
export GPU_NODEGROUP="gpu-nodegroup"
Create the Cluster#
Follow these steps to create the EKS control plane and verify connectivity:
Create the EKS control plane (without any node group). Refer to the AWS EKS documentation for additional cluster creation options.
eksctl create cluster \ --name $EKS_CLUSTER \ --region $AWS_REGION \ --version 1.31 \ --without-nodegroup
After the cluster is ready, verify connectivity:
kubectl get svc
Tip
To update kubeconfig for an existing cluster, run:
aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER
Add a GPU Node Group#
Add a GPU-capable managed node group after the EKS control plane is ready.
Create the GPU node group:
eksctl create nodegroup \ --cluster $EKS_CLUSTER \ --region $AWS_REGION \ --name $GPU_NODEGROUP \ --node-type g5.xlarge \ --nodes 1 \ --node-ami-family AmazonLinux2 \ --node-volume-size 200
After the node group is ready, verify that the node joined the cluster:
kubectl get nodes -o wide
The following command options determine the key characteristics of the GPU node group:
--node-type: Select a value based on your GPU requirements.--nodes 1: Number of GPU instances to launch.--node-ami-family AmazonLinux2: Uses the GPU-optimized AMI with pre-installed NVIDIA drivers.--node-volume-size 200: Provides 200 GB of disk space for model storage and container images.
Install the NVIDIA GPU Operator#
The GPU Operator manages the device plugin, container runtime, and other NVIDIA components that are required for GPU scheduling. Refer to the GPU Operator documentation for details.
Install the GPU Operator after the node group is available.
Add the NVIDIA Helm repository, and update your local chart index:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials helm repo update
Install the GPU Operator:
helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator \ --wait --generate-name
Verify that the GPU Operator pods are ready:
kubectl -n gpu-operator get pods
Verify that the nodes advertise GPU capacity:
kubectl get nodes \ -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
Install the Amazon EBS CSI Driver#
The Amazon EBS CSI driver is required for dynamic PersistentVolumeClaim provisioning.
Create an IAM service account:
eksctl create iamserviceaccount \ --cluster $EKS_CLUSTER \ --region $AWS_REGION \ --namespace kube-system \ --name ebs-csi-controller-sa \ --role-name AmazonEKS_EBS_CSI_DriverRole \ --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \ --approve
Get your AWS account ID:
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
Install the addon:
eksctl create addon \ --cluster $EKS_CLUSTER \ --region $AWS_REGION \ --name aws-ebs-csi-driver \ --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole
Verify the installation:
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
Verify the Cluster Is Ready#
Run the following checks to confirm that the cluster is ready for deployment.
Confirm that the nodes are present:
kubectl get nodes -o wide
Confirm that the nodes advertise GPU capacity:
kubectl get nodes \ -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
Confirm that the Amazon EBS CSI Driver pods are running:
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
Confirm that the GPU Operator pods are running:
kubectl -n gpu-operator get pods
Deploy NIM#
After the cluster is ready, create the namespace, configure secrets and storage, and install the Helm chart.
Create the Namespace and Secrets#
Create the Kubernetes namespace and the secrets that the deployment uses.
Set the deployment variables:
export NAMESPACE="nim-llm" export RELEASE_NAME="my-nim" export NGC_API_KEY="<your-ngc-api-key>"
Create the namespace:
kubectl create namespace $NAMESPACE
Create the NGC registry secret:
kubectl create secret docker-registry ngc-secret \ --namespace $NAMESPACE \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password="$NGC_API_KEY"
Create the NGC API key secret:
kubectl create secret generic ngc-api \ --namespace $NAMESPACE \ --from-literal=NGC_API_KEY="$NGC_API_KEY"
Tip
If your model requires a gated Hugging Face repository, create an additional secret with your Hugging Face token:
kubectl create secret generic hf-token \
--namespace $NAMESPACE \
--from-literal=HF_TOKEN="<your-HF-token>"
Create a Model Cache PVC#
Create a persistent volume claim so model artifacts can be reused across pod restarts.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvidia-nim-cache-pvc
namespace: $NAMESPACE
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp2
resources:
requests:
storage: 200Gi
EOF
Note
gp3 is also a valid storageClassName.
Create a Helm Values File#
Save the following as custom-values-eks.yaml:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.1"
pullPolicy: IfNotPresent
model:
name: meta/llama-3.1-8b-instruct
ngcAPISecret: ngc-api
nimCache: /model-store
openaiPort: 8000
logLevel: INFO
env:
- name: NIM_MODEL_PROFILE
value: "<profile-hash>"
podSecurityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
persistence:
enabled: true
existingClaim: "nvidia-nim-cache-pvc"
resources:
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: ngc-secret
nodeSelector:
nvidia.com/gpu.present: "true"
service:
type: LoadBalancer
openaiPort: 8000
Note
The example above deploys a single-GPU model on a g5.xlarge instance. For larger models requiring multiple GPUs, use an instance type with sufficient GPUs (for example, p4d.24xlarge for 8x A100) and set resources.limits.nvidia.com/gpu to the corresponding GPU count in your values file.
Note
Replace <profile-hash> with a valid profile ID for your chosen model and GPU.
To list available profiles, refer to Model Profiles and Selection.
Install the Helm Chart#
Download the NIM Helm chart, and then install it.
Download the chart package:
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version>.tgz \ --username='$oauthtoken' --password=$NGC_API_KEY
Install the chart:
helm install $RELEASE_NAME nim-llm-<version>.tgz \ --namespace $NAMESPACE \ -f custom-values-eks.yaml
Monitor and Verify the Deployment#
Use the following steps to monitor the rollout and verify that the service is reachable.
Watch the pod status:
kubectl -n $NAMESPACE get pods -w
Stream the NIM logs:
kubectl -n $NAMESPACE logs -l app.kubernetes.io/name=nim-llm -f --tail=100
Get the service endpoint:
kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm
After
EXTERNAL-IPis populated, setEXTERNAL_IP:export EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc $RELEASE_NAME-nim-llm \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
Tip
If the LoadBalancer external IP is not available, use port-forwarding instead:
kubectl -n $NAMESPACE port-forward svc/$RELEASE_NAME-nim-llm 8000:8000
export EXTERNAL_IP=localhost
Check the readiness endpoint:
curl -s "http://$EXTERNAL_IP:8000/v1/health/ready"
List the available models:
curl -s "http://$EXTERNAL_IP:8000/v1/models" | python3 -m json.tool
Send a test chat completion request:
curl -X POST "http://$EXTERNAL_IP:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128 }'
Teardown#
Delete the deployment resources when you no longer need them.
Remove the Helm release:
helm uninstall $RELEASE_NAME -n $NAMESPACE
Delete the persistent volume claim and namespace:
kubectl delete pvc nvidia-nim-cache-pvc -n $NAMESPACE kubectl delete namespace $NAMESPACE
Delete the EKS cluster. For additional options, refer to the AWS documentation.
eksctl delete cluster --name $EKS_CLUSTER --region $AWS_REGION
SageMaker Deployment#
This guide covers deploying NIM LLM on Amazon SageMaker using managed real-time inference endpoints. Model storage is S3 only: SageMaker cannot pull models from NGC or Hugging Face at runtime. You can either mirror an NGC model to S3 or serve your own model already stored in S3.
Prerequisites#
Before you begin, ensure you have the following:
An AWS account with permissions to create SageMaker endpoints, ECR repositories, and S3 buckets.
An NGC API key if you use an NGC-based NIM image or mirror models to S3.
The following tools installed:
Docker (for building and pushing the NIM image to ECR)
Your SageMaker execution role must have access to the S3 bucket where model artifacts are stored, and to ECR if you push the container image to your own registry. For more information about IAM roles, refer to SageMaker execution roles.
Model Storage Options#
SageMaker endpoints use S3 for model data. Choose one of the following:
NGC model in S3: Upload a pre-built NGC model to S3 using percent-encoded keys (for example, by using the
mirror s3CLI). At runtime, setNIM_REPOSITORY_OVERRIDE=s3://${S3_BUCKET}so NIM reads the built-in manifest and downloads from S3. Refer to Model Download for S3 mirroring.Your own model in S3: Store your model in S3 with normal directory structure (for example,
s3://${S3_BUCKET}/path/to/model). SetNIM_MODEL_PATH=s3://${S3_BUCKET}/path/to/model. NIM generates the manifest at startup and downloads from S3.
Authentication to S3 is handled by the SageMaker execution role; you do not need to pass credentials inside the container.
Build and Push the NIM Image to ECR#
SageMaker runs your container from an image in Amazon ECR. Build your NIM image (or use an NGC image as the base), then push it to an ECR repository in your account.
Set the AWS region, repository name, and account ID:
export AWS_REGION="${AWS_REGION:-us-east-1}" export ECR_REPO_NAME="nim-llm" export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
Log in to Amazon ECR:
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
Create the ECR repository if it does not already exist:
aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION 2>/dev/null || true
Define the source image and target ECR URI:
export NIM_IMAGE="${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1" # or your image export ECR_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest"
Pull, tag, and push the image:
docker pull $NIM_IMAGE docker tag $NIM_IMAGE $ECR_URI docker push $ECR_URI
Create the SageMaker Model#
Create a SageMaker model resource that references your ECR image and sets the environment variables NIM needs at runtime.
Use an NGC Model Mirrored to S3#
If you have mirrored an NGC model to S3, use the NIM_REPOSITORY_OVERRIDE environment variable to specify the location.
Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models) and ${YOUR_NGC_API_KEY} with your NGC API key. Keep NGC_API_KEY secret and never commit it to version control.
Set the required environment variables:
# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above export ENDPOINT_NAME="nim-llama-8b" export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole" export S3_BUCKET="${YOUR_S3_BUCKET}" export NGC_API_KEY="${YOUR_NGC_API_KEY}"
Create the SageMaker model:
aws sagemaker create-model \ --model-name $ENDPOINT_NAME \ --execution-role-arn $EXECUTION_ROLE_ARN \ --primary-container '{ "Image": "'"$ECR_URI"'", "Environment": { "NGC_API_KEY": "'"$NGC_API_KEY"'", "NIM_REPOSITORY_OVERRIDE": "s3://'"$S3_BUCKET"'", "NIM_SERVED_MODEL_NAME": "llama-3.1-8b-instruct" } }'
Use a Custom Model in S3#
If you are serving your own model from S3, use the NIM_MODEL_PATH environment variable to specify the location.
Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models). The bucket must exist in the target region and your SageMaker execution role must have read access to it.
Set the required environment variables:
# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above export ENDPOINT_NAME="nim-custom-model" export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole" export S3_BUCKET="${YOUR_S3_BUCKET}" export S3_MODEL_URI="s3://${S3_BUCKET}/path/to/model"
Create the SageMaker model:
aws sagemaker create-model \ --model-name $ENDPOINT_NAME \ --execution-role-arn $EXECUTION_ROLE_ARN \ --primary-container '{ "Image": "'"$ECR_URI"'", "Environment": { "NIM_MODEL_PATH": "'"$S3_MODEL_URI"'", "NIM_SERVED_MODEL_NAME": "my-model" } }'
Use the same ENDPOINT_NAME for the model, endpoint config, and endpoint so they are linked.
Create Endpoint Config and Endpoint#
Create an endpoint configuration, create the endpoint, and then wait for it to become ready.
Set the instance type:
export INSTANCE_TYPE="ml.g6e.xlarge"
Create the endpoint configuration:
aws sagemaker create-endpoint-config \ --endpoint-config-name $ENDPOINT_NAME \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "'"$ENDPOINT_NAME"'", "InstanceType": "'"$INSTANCE_TYPE"'", "InitialInstanceCount": 1 }]'
Create the endpoint:
aws sagemaker create-endpoint \ --endpoint-name $ENDPOINT_NAME \ --endpoint-config-name $ENDPOINT_NAME
Check the endpoint status until it is
InService(this can take several minutes):aws sagemaker describe-endpoint --endpoint-name $ENDPOINT_NAME --query 'EndpointStatus' --output text
Verify the Deployment#
Invoke the endpoint using the SageMaker Runtime API. The request body is the same OpenAI-compatible JSON as other NIM deployments.
Invoke the endpoint:
aws sagemaker-runtime invoke-endpoint \ --endpoint-name $ENDPOINT_NAME \ --content-type "application/json" \ --body '{"model":"llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \ /tmp/response.json
Format the response:
python3 -m json.tool /tmp/response.json
For health checks, use the same invoke-endpoint command with a request to your health path if your container exposes it. Otherwise, confirm that the chat completions request above returns a valid response.
Teardown#
Delete the SageMaker resources when you no longer need them so you avoid ongoing charges.
Delete the endpoint:
aws sagemaker delete-endpoint --endpoint-name $ENDPOINT_NAME
After the endpoint deletion finishes, delete the endpoint configuration:
aws sagemaker delete-endpoint-config --endpoint-config-name $ENDPOINT_NAME
Delete the model:
aws sagemaker delete-model --model-name $ENDPOINT_NAME
Optional: Remove the ECR image and S3 model data if you no longer need them.