AWS#
NIM supports deployment on Amazon Web Services (AWS) through two approaches:
EKS (Amazon Elastic Kubernetes Service): Self-managed Kubernetes deployment using Helm charts with GPU-accelerated nodes.
SageMaker: Fully managed ML platform with managed real-time inference endpoints that use S3 for model storage.
EKS Deployment#
This guide walks through deploying NIM LLM on Amazon Elastic Kubernetes Service (EKS) with GPU-accelerated nodes.
Prerequisites#
Before you begin, ensure you have the following:
An AWS account with permissions to create EKS clusters, EC2 GPU instances, and associated networking and IAM resources.
Sufficient GPU instance quota in your target region. Verify in the AWS Console under Service Quotas > Amazon EC2 > “Running On-Demand [instance-type] instances”.
An NGC account and API key.
The following CLI tools installed:
Configure AWS Credentials#
Create a Named Profile#
Add your credentials to ~/.aws/credentials:
mkdir -p ~/.aws
nano ~/.aws/credentials
[my-profile]
aws_access_key_id=<your-access-key>
aws_secret_access_key=<your-secret-key>
aws_session_token=<your-session-token>
Set your profile and default region, then verify:
export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=us-east-2
aws sts get-caller-identity
Check GPU Instance Quota#
Verify you have sufficient GPU instance quota in your target region:
export AWS_REGION="us-east-2"
aws service-quotas list-service-quotas \
--service-code ec2 \
--region $AWS_REGION \
--query "Quotas[?contains(QuotaName, 'On-Demand G') || contains(QuotaName, 'On-Demand P')].{Name:QuotaName, Value:Value}" \
--output table
# G-series (g5.xlarge, g5.12xlarge, and so on): Requires quota code L-DB2E81BA, minimum 4 vCPUs for g5.xlarge
# P-series (p4d.24xlarge, p5.48xlarge): Requires quota code L-417A185B
Set Up the EKS Cluster#
Set Environment Variables#
export AWS_REGION="us-east-2"
export EKS_CLUSTER="my-eks-cluster"
export GPU_NODEGROUP="gpu-nodegroup"
Create the Cluster#
Refer to the AWS EKS documentation for additional cluster creation options.
eksctl create cluster \
--name $EKS_CLUSTER \
--region $AWS_REGION \
--version 1.31 \
--without-nodegroup
After the cluster is ready, verify connectivity:
kubectl get svc
Tip
To update kubeconfig for an existing cluster, run:
aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER
Add a GPU Node Group#
eksctl create nodegroup \
--cluster $EKS_CLUSTER \
--region $AWS_REGION \
--name $GPU_NODEGROUP \
--node-type g5.xlarge \
--nodes 1 \
--node-ami-family AmazonLinux2 \
--node-volume-size 200
Once the node group is ready, verify the node has joined:
kubectl get nodes -o wide
--node-type: Select a value based on your GPU requirements.--nodes 1: Number of GPU instances to launch.--node-ami-family AmazonLinux2: Uses the GPU-optimized AMI with pre-installed NVIDIA drivers.--node-volume-size 200: Provides 200 GB of disk space for model storage and container images.
Install the NVIDIA GPU Operator#
The GPU Operator manages the device plugin, container runtime, and other NVIDIA components that are required for GPU scheduling. Refer to the GPU Operator documentation for details.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update
helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator \
--wait --generate-name
Wait for all pods to become ready:
kubectl -n gpu-operator get pods
kubectl get nodes \
-o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
Install the Amazon EBS CSI Driver#
The EBS CSI driver is required for dynamic PersistentVolumeClaim provisioning.
Create an IAM service account:
eksctl create iamserviceaccount \
--cluster $EKS_CLUSTER \
--region $AWS_REGION \
--namespace kube-system \
--name ebs-csi-controller-sa \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
Install the addon:
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
eksctl create addon \
--cluster $EKS_CLUSTER \
--region $AWS_REGION \
--name aws-ebs-csi-driver \
--service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole
Verify:
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
Verify the Cluster Is Ready#
kubectl get nodes -o wide
kubectl get nodes \
-o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
kubectl -n gpu-operator get pods
Deploy NIM#
Create the Namespace and Secrets#
export NAMESPACE="nim-llm"
export RELEASE_NAME="my-nim"
export NGC_API_KEY="<your-ngc-api-key>"
kubectl create namespace $NAMESPACE
kubectl create secret docker-registry ngc-secret \
--namespace $NAMESPACE \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY"
kubectl create secret generic ngc-api \
--namespace $NAMESPACE \
--from-literal=NGC_API_KEY="$NGC_API_KEY"
Tip
If your model requires a gated Hugging Face repository, create an additional secret with your Hugging Face token:
kubectl create secret generic hf-token \
--namespace $NAMESPACE \
--from-literal=HF_TOKEN="<your-HF-token>"
Create a Model Cache PVC#
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvidia-nim-cache-pvc
namespace: $NAMESPACE
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp2
resources:
requests:
storage: 200Gi
EOF
Note
gp3 is also a valid storageClassName.
Create a Helm Values File#
Save the following as custom-values-eks.yaml:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.0"
pullPolicy: IfNotPresent
model:
name: meta/llama-3.1-8b-instruct
ngcAPISecret: ngc-api
nimCache: /model-store
openaiPort: 8000
logLevel: INFO
env:
- name: NIM_MODEL_PROFILE
value: "<profile-hash>"
podSecurityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
persistence:
enabled: true
existingClaim: "nvidia-nim-cache-pvc"
resources:
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: ngc-secret
nodeSelector:
nvidia.com/gpu.present: "true"
service:
type: LoadBalancer
openaiPort: 8000
Note
The example above deploys a single-GPU model on a g5.xlarge instance. For larger models requiring multiple GPUs, use an instance type with sufficient GPUs (for example, p4d.24xlarge for 8x A100) and set resources.limits.nvidia.com/gpu to the corresponding GPU count in your values file.
Note
Replace <profile-hash> with a valid profile ID for your chosen model and GPU.
To list available profiles, refer to Model Profiles and Selection.
Install the Helm Chart#
Download the NIM Helm chart and install it:
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version>.tgz \
--username='$oauthtoken' --password=$NGC_API_KEY
helm install $RELEASE_NAME nim-llm-<version>.tgz \
--namespace $NAMESPACE \
-f custom-values-eks.yaml
Monitor and Verify the Deployment#
Watch pod status and logs:
kubectl -n $NAMESPACE get pods -w
kubectl -n $NAMESPACE logs -l app.kubernetes.io/name=nim-llm -f --tail=100
Get the service endpoint:
kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm
Wait for EXTERNAL-IP to be populated, then:
export EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc $RELEASE_NAME-nim-llm \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
Tip
If the LoadBalancer external IP is not available, use port-forwarding instead:
kubectl -n $NAMESPACE port-forward svc/$RELEASE_NAME-nim-llm 8000:8000
export EXTERNAL_IP=localhost
Verify the deployment:
curl -s "http://$EXTERNAL_IP:8000/v1/health/ready"
curl -s "http://$EXTERNAL_IP:8000/v1/models" | python3 -m json.tool
curl -X POST "http://$EXTERNAL_IP:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'
Teardown#
Remove the NIM deployment:
helm uninstall $RELEASE_NAME -n $NAMESPACE
kubectl delete pvc nvidia-nim-cache-pvc -n $NAMESPACE
kubectl delete namespace $NAMESPACE
To delete the EKS cluster, refer to the AWS documentation.
eksctl delete cluster --name $EKS_CLUSTER --region $AWS_REGION
SageMaker Deployment#
This guide covers deploying NIM LLM on Amazon SageMaker using managed real-time inference endpoints. Model storage is S3 only: SageMaker cannot pull models from NGC or Hugging Face at runtime. You can either mirror an NGC model to S3 or serve your own model already stored in S3.
Prerequisites#
Before you begin, ensure you have:
An AWS account with permissions to create SageMaker endpoints, ECR repositories, and S3 buckets.
An NGC API key if you use an NGC-based NIM image or mirror models to S3.
The following installed:
Docker (for building and pushing the NIM image to ECR)
Your SageMaker execution role must have access to the S3 bucket where model artifacts are stored, and to ECR if you push the container image to your own registry. For more information about IAM roles, refer to SageMaker execution roles.
Model Storage Options#
SageMaker endpoints use S3 for model data. Choose one of the following:
NGC model in S3: Upload a pre-built NGC model to S3 using percent-encoded keys (for example, by using the
mirror s3CLI). At runtime, setNIM_REPOSITORY_OVERRIDE=s3://${S3_BUCKET}so NIM reads the baked-in manifest and downloads from S3. Refer to Model Download for S3 mirroring.Your own model in S3: Store your model in S3 with normal directory structure (for example,
s3://${S3_BUCKET}/path/to/model). SetNIM_MODEL_PATH=s3://${S3_BUCKET}/path/to/model. NIM generates the manifest at startup and downloads from S3.
Authentication to S3 is handled by the SageMaker execution role; you do not need to pass credentials inside the container.
Build and Push the NIM Image to ECR#
SageMaker runs your container from an image in Amazon ECR. Build your NIM image (or use an NGC image as the base), then push it to an ECR repository in your account.
Create an ECR repository (if needed):
export AWS_REGION="${AWS_REGION:-us-east-1}"
export ECR_REPO_NAME="nim-llm"
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION 2>/dev/null || true
Tag and push the image:
export NIM_IMAGE="${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.0" # or your image
export ECR_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest"
docker pull $NIM_IMAGE
docker tag $NIM_IMAGE $ECR_URI
docker push $ECR_URI
Replace ${AWS_ACCOUNT_ID} with your AWS account ID (for example, aws sts get-caller-identity --query Account --output text).
Create the SageMaker Model#
Create a SageMaker model resource that references your ECR image and sets the environment variables NIM needs at runtime.
Option A: NGC model mirrored to S3 (use NIM_REPOSITORY_OVERRIDE)
Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models) and ${YOUR_NGC_API_KEY} with your NGC API key. Keep NGC_API_KEY secret and never commit it to version control.
# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-llama-8b"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
aws sagemaker create-model \
--model-name $ENDPOINT_NAME \
--execution-role-arn $EXECUTION_ROLE_ARN \
--primary-container '{
"Image": "'"$ECR_URI"'",
"Environment": {
"NGC_API_KEY": "'"$NGC_API_KEY"'",
"NIM_REPOSITORY_OVERRIDE": "s3://'"$S3_BUCKET"'",
"NIM_SERVED_MODEL_NAME": "llama-3.1-8b-instruct"
}
}'
Option B: Your own model in S3 (use NIM_MODEL_PATH pointing to S3)
Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models). The bucket must exist in the target region and your SageMaker execution role must have read access to it.
# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-custom-model"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export S3_MODEL_URI="s3://${S3_BUCKET}/path/to/model"
aws sagemaker create-model \
--model-name $ENDPOINT_NAME \
--execution-role-arn $EXECUTION_ROLE_ARN \
--primary-container '{
"Image": "'"$ECR_URI"'",
"Environment": {
"NIM_MODEL_PATH": "'"$S3_MODEL_URI"'",
"NIM_SERVED_MODEL_NAME": "my-model"
}
}'
Use the same ENDPOINT_NAME for the model, endpoint config, and endpoint so they are linked.
Create Endpoint Config and Endpoint#
Create an endpoint configuration (instance type and count) and then create the endpoint.
export INSTANCE_TYPE="ml.g6e.xlarge"
aws sagemaker create-endpoint-config \
--endpoint-config-name $ENDPOINT_NAME \
--production-variants '[{
"VariantName": "AllTraffic",
"ModelName": "'"$ENDPOINT_NAME"'",
"InstanceType": "'"$INSTANCE_TYPE"'",
"InitialInstanceCount": 1
}]'
aws sagemaker create-endpoint \
--endpoint-name $ENDPOINT_NAME \
--endpoint-config-name $ENDPOINT_NAME
Wait for the endpoint to be InService (this can take several minutes). Check status:
aws sagemaker describe-endpoint --endpoint-name $ENDPOINT_NAME --query 'EndpointStatus' --output text
Verify the Deployment#
Invoke the endpoint using the SageMaker Runtime API. The request body is the same OpenAI-compatible JSON as other NIM deployments.
aws sagemaker-runtime invoke-endpoint \
--endpoint-name $ENDPOINT_NAME \
--content-type "application/json" \
--body '{"model":"llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
/tmp/response.json
cat /tmp/response.json | python3 -m json.tool
For health checks, use the same invoke-endpoint command with a request to your health path if your container exposes it. Otherwise, confirm that the chat completions request above returns a valid response.
Teardown#
Delete the endpoint, endpoint config, and model to avoid ongoing charges:
aws sagemaker delete-endpoint --endpoint-name $ENDPOINT_NAME
aws sagemaker delete-endpoint-config --endpoint-config-name $ENDPOINT_NAME
aws sagemaker delete-model --model-name $ENDPOINT_NAME
Delete the endpoint first and wait for it to finish deleting before deleting the endpoint config. Optionally remove the ECR image and S3 model data if no longer needed.