AWS#

NIM supports deployment on Amazon Web Services (AWS) through two approaches:

  • EKS (Amazon Elastic Kubernetes Service): Self-managed Kubernetes deployment using Helm charts with GPU-accelerated nodes.

  • SageMaker: Fully managed ML platform with managed real-time inference endpoints that use S3 for model storage.

EKS Deployment#

This guide walks through deploying NIM LLM on Amazon Elastic Kubernetes Service (EKS) with GPU-accelerated nodes.

Prerequisites#

Before you begin, ensure you have the following:

  • An AWS account with permissions to create EKS clusters, EC2 GPU instances, and associated networking and IAM resources.

  • Sufficient GPU instance quota in your target region. Verify in the AWS Console under Service Quotas > Amazon EC2 > “Running On-Demand [instance-type] instances”.

  • An NGC account and API key.

  • The following CLI tools installed:

Configure AWS Credentials#

Create a Named Profile#

Add your credentials to ~/.aws/credentials:

mkdir -p ~/.aws
nano ~/.aws/credentials
[my-profile]
aws_access_key_id=<your-access-key>
aws_secret_access_key=<your-secret-key>
aws_session_token=<your-session-token>

Set your profile and default region, then verify:

export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=us-east-2

aws sts get-caller-identity

Check GPU Instance Quota#

Verify you have sufficient GPU instance quota in your target region:

export AWS_REGION="us-east-2"

aws service-quotas list-service-quotas \
  --service-code ec2 \
  --region $AWS_REGION \
  --query "Quotas[?contains(QuotaName, 'On-Demand G') || contains(QuotaName, 'On-Demand P')].{Name:QuotaName, Value:Value}" \
  --output table

# G-series (g5.xlarge, g5.12xlarge, and so on): Requires quota code L-DB2E81BA, minimum 4 vCPUs for g5.xlarge
# P-series (p4d.24xlarge, p5.48xlarge): Requires quota code L-417A185B

Set Up the EKS Cluster#

Set Environment Variables#

export AWS_REGION="us-east-2"
export EKS_CLUSTER="my-eks-cluster"
export GPU_NODEGROUP="gpu-nodegroup"

Create the Cluster#

Refer to the AWS EKS documentation for additional cluster creation options.

eksctl create cluster \
  --name $EKS_CLUSTER \
  --region $AWS_REGION \
  --version 1.31 \
  --without-nodegroup

After the cluster is ready, verify connectivity:

kubectl get svc

Tip

To update kubeconfig for an existing cluster, run:

aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER

Add a GPU Node Group#

eksctl create nodegroup \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --name $GPU_NODEGROUP \
  --node-type g5.xlarge \
  --nodes 1 \
  --node-ami-family AmazonLinux2 \
  --node-volume-size 200

Once the node group is ready, verify the node has joined:

kubectl get nodes -o wide
  • --node-type: Select a value based on your GPU requirements.

  • --nodes 1: Number of GPU instances to launch.

  • --node-ami-family AmazonLinux2: Uses the GPU-optimized AMI with pre-installed NVIDIA drivers.

  • --node-volume-size 200: Provides 200 GB of disk space for model storage and container images.

Install the NVIDIA GPU Operator#

The GPU Operator manages the device plugin, container runtime, and other NVIDIA components that are required for GPU scheduling. Refer to the GPU Operator documentation for details.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update
helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator \
  --wait --generate-name

Wait for all pods to become ready:

kubectl -n gpu-operator get pods

kubectl get nodes \
  -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

Install the Amazon EBS CSI Driver#

The EBS CSI driver is required for dynamic PersistentVolumeClaim provisioning.

Create an IAM service account:

eksctl create iamserviceaccount \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --namespace kube-system \
  --name ebs-csi-controller-sa \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve

Install the addon:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

eksctl create addon \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole

Verify:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

Verify the Cluster Is Ready#

kubectl get nodes -o wide

kubectl get nodes \
  -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

kubectl -n gpu-operator get pods

Deploy NIM#

Create the Namespace and Secrets#

export NAMESPACE="nim-llm"
export RELEASE_NAME="my-nim"
export NGC_API_KEY="<your-ngc-api-key>"

kubectl create namespace $NAMESPACE

kubectl create secret docker-registry ngc-secret \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Tip

If your model requires a gated Hugging Face repository, create an additional secret with your Hugging Face token:

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="<your-HF-token>"

Create a Model Cache PVC#

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-cache-pvc
  namespace: $NAMESPACE
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp2
  resources:
    requests:
      storage: 200Gi
EOF

Note

gp3 is also a valid storageClassName.

Create a Helm Values File#

Save the following as custom-values-eks.yaml:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.0"
  pullPolicy: IfNotPresent

model:
  name: meta/llama-3.1-8b-instruct
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

env:
  - name: NIM_MODEL_PROFILE
    value: "<profile-hash>"

podSecurityContext:
  runAsUser: 0
  runAsGroup: 0
  fsGroup: 0

persistence:
  enabled: true
  existingClaim: "nvidia-nim-cache-pvc"

resources:
  limits:
    nvidia.com/gpu: 1

imagePullSecrets:
  - name: ngc-secret

nodeSelector:
  nvidia.com/gpu.present: "true"

service:
  type: LoadBalancer
  openaiPort: 8000

Note

The example above deploys a single-GPU model on a g5.xlarge instance. For larger models requiring multiple GPUs, use an instance type with sufficient GPUs (for example, p4d.24xlarge for 8x A100) and set resources.limits.nvidia.com/gpu to the corresponding GPU count in your values file.

Note

Replace <profile-hash> with a valid profile ID for your chosen model and GPU. To list available profiles, refer to Model Profiles and Selection.

Install the Helm Chart#

Download the NIM Helm chart and install it:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version>.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

helm install $RELEASE_NAME nim-llm-<version>.tgz \
  --namespace $NAMESPACE \
  -f custom-values-eks.yaml

Monitor and Verify the Deployment#

Watch pod status and logs:

kubectl -n $NAMESPACE get pods -w

kubectl -n $NAMESPACE logs -l app.kubernetes.io/name=nim-llm -f --tail=100

Get the service endpoint:

kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm

Wait for EXTERNAL-IP to be populated, then:

export EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc $RELEASE_NAME-nim-llm \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Tip

If the LoadBalancer external IP is not available, use port-forwarding instead:

kubectl -n $NAMESPACE port-forward svc/$RELEASE_NAME-nim-llm 8000:8000
export EXTERNAL_IP=localhost

Verify the deployment:

curl -s "http://$EXTERNAL_IP:8000/v1/health/ready"

curl -s "http://$EXTERNAL_IP:8000/v1/models" | python3 -m json.tool

curl -X POST "http://$EXTERNAL_IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

Teardown#

Remove the NIM deployment:

helm uninstall $RELEASE_NAME -n $NAMESPACE
kubectl delete pvc nvidia-nim-cache-pvc -n $NAMESPACE
kubectl delete namespace $NAMESPACE

To delete the EKS cluster, refer to the AWS documentation.

eksctl delete cluster --name $EKS_CLUSTER --region $AWS_REGION

SageMaker Deployment#

This guide covers deploying NIM LLM on Amazon SageMaker using managed real-time inference endpoints. Model storage is S3 only: SageMaker cannot pull models from NGC or Hugging Face at runtime. You can either mirror an NGC model to S3 or serve your own model already stored in S3.

Prerequisites#

Before you begin, ensure you have:

  • An AWS account with permissions to create SageMaker endpoints, ECR repositories, and S3 buckets.

  • An NGC API key if you use an NGC-based NIM image or mirror models to S3.

  • The following installed:

Your SageMaker execution role must have access to the S3 bucket where model artifacts are stored, and to ECR if you push the container image to your own registry. For more information about IAM roles, refer to SageMaker execution roles.

Model Storage Options#

SageMaker endpoints use S3 for model data. Choose one of the following:

  • NGC model in S3: Upload a pre-built NGC model to S3 using percent-encoded keys (for example, by using the mirror s3 CLI). At runtime, set NIM_REPOSITORY_OVERRIDE=s3://${S3_BUCKET} so NIM reads the baked-in manifest and downloads from S3. Refer to Model Download for S3 mirroring.

  • Your own model in S3: Store your model in S3 with normal directory structure (for example, s3://${S3_BUCKET}/path/to/model). Set NIM_MODEL_PATH=s3://${S3_BUCKET}/path/to/model. NIM generates the manifest at startup and downloads from S3.

Authentication to S3 is handled by the SageMaker execution role; you do not need to pass credentials inside the container.

Build and Push the NIM Image to ECR#

SageMaker runs your container from an image in Amazon ECR. Build your NIM image (or use an NGC image as the base), then push it to an ECR repository in your account.

  1. Create an ECR repository (if needed):

export AWS_REGION="${AWS_REGION:-us-east-1}"
export ECR_REPO_NAME="nim-llm"
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION 2>/dev/null || true
  1. Tag and push the image:

export NIM_IMAGE="${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.0"   # or your image
export ECR_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest"
docker pull $NIM_IMAGE
docker tag $NIM_IMAGE $ECR_URI
docker push $ECR_URI

Replace ${AWS_ACCOUNT_ID} with your AWS account ID (for example, aws sts get-caller-identity --query Account --output text).

Create the SageMaker Model#

Create a SageMaker model resource that references your ECR image and sets the environment variables NIM needs at runtime.

Option A: NGC model mirrored to S3 (use NIM_REPOSITORY_OVERRIDE)

Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models) and ${YOUR_NGC_API_KEY} with your NGC API key. Keep NGC_API_KEY secret and never commit it to version control.

# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-llama-8b"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export NGC_API_KEY="${YOUR_NGC_API_KEY}"

aws sagemaker create-model \
  --model-name $ENDPOINT_NAME \
  --execution-role-arn $EXECUTION_ROLE_ARN \
  --primary-container '{
    "Image": "'"$ECR_URI"'",
    "Environment": {
      "NGC_API_KEY": "'"$NGC_API_KEY"'",
      "NIM_REPOSITORY_OVERRIDE": "s3://'"$S3_BUCKET"'",
      "NIM_SERVED_MODEL_NAME": "llama-3.1-8b-instruct"
    }
  }'

Option B: Your own model in S3 (use NIM_MODEL_PATH pointing to S3)

Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models). The bucket must exist in the target region and your SageMaker execution role must have read access to it.

# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-custom-model"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export S3_MODEL_URI="s3://${S3_BUCKET}/path/to/model"

aws sagemaker create-model \
  --model-name $ENDPOINT_NAME \
  --execution-role-arn $EXECUTION_ROLE_ARN \
  --primary-container '{
    "Image": "'"$ECR_URI"'",
    "Environment": {
      "NIM_MODEL_PATH": "'"$S3_MODEL_URI"'",
      "NIM_SERVED_MODEL_NAME": "my-model"
    }
  }'

Use the same ENDPOINT_NAME for the model, endpoint config, and endpoint so they are linked.

Create Endpoint Config and Endpoint#

Create an endpoint configuration (instance type and count) and then create the endpoint.

export INSTANCE_TYPE="ml.g6e.xlarge"

aws sagemaker create-endpoint-config \
  --endpoint-config-name $ENDPOINT_NAME \
  --production-variants '[{
    "VariantName": "AllTraffic",
    "ModelName": "'"$ENDPOINT_NAME"'",
    "InstanceType": "'"$INSTANCE_TYPE"'",
    "InitialInstanceCount": 1
  }]'

aws sagemaker create-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --endpoint-config-name $ENDPOINT_NAME

Wait for the endpoint to be InService (this can take several minutes). Check status:

aws sagemaker describe-endpoint --endpoint-name $ENDPOINT_NAME --query 'EndpointStatus' --output text

Verify the Deployment#

Invoke the endpoint using the SageMaker Runtime API. The request body is the same OpenAI-compatible JSON as other NIM deployments.

aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type "application/json" \
  --body '{"model":"llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
  /tmp/response.json

cat /tmp/response.json | python3 -m json.tool

For health checks, use the same invoke-endpoint command with a request to your health path if your container exposes it. Otherwise, confirm that the chat completions request above returns a valid response.

Teardown#

Delete the endpoint, endpoint config, and model to avoid ongoing charges:

aws sagemaker delete-endpoint --endpoint-name $ENDPOINT_NAME
aws sagemaker delete-endpoint-config --endpoint-config-name $ENDPOINT_NAME
aws sagemaker delete-model --model-name $ENDPOINT_NAME

Delete the endpoint first and wait for it to finish deleting before deleting the endpoint config. Optionally remove the ECR image and S3 model data if no longer needed.