Is this page helpful?

AWS#

NIM supports deployment on Amazon Web Services (AWS) through two approaches:

EKS (Amazon Elastic Kubernetes Service): Self-managed Kubernetes deployment using Helm charts with GPU-accelerated nodes.
SageMaker: Fully managed ML platform with managed real-time inference endpoints that use S3 for model storage.

EKS Deployment#

This guide walks through deploying NIM LLM on Amazon Elastic Kubernetes Service (EKS) with GPU-accelerated nodes.

Prerequisites#

Before you begin, ensure you have the following:

An AWS account with permissions to create EKS clusters, EC2 GPU instances, and associated networking and IAM resources.
Sufficient GPU instance quota in your target region. Verify in the AWS Console under Service Quotas > Amazon EC2 > “Running On-Demand [instance-type] instances”.
An NGC account and API key.
The following CLI tools installed:
- AWS CLI v2
- eksctl
- kubectl
- Helm 3

Configure AWS Credentials#

Set up a named AWS CLI profile and confirm that it works before you create infrastructure.

Create a Named Profile#

Use a named profile so the AWS CLI and related tools use the expected account and region.

Create the AWS credentials directory, and open the credentials file:
```
mkdir -p ~/.aws
nano ~/.aws/credentials
```

Add your profile to ~/.aws/credentials:

[my-profile]
aws_access_key_id=<your-access-key>
aws_secret_access_key=<your-secret-key>
aws_session_token=<your-session-token>

Set your profile and default region:

export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=us-east-2

Verify that the profile works:
```
aws sts get-caller-identity
```

Check GPU Instance Quota#

Confirm that your target region has enough GPU quota before you create the cluster.

Set the AWS region:
```
export AWS_REGION="us-east-2"
```

List the relevant GPU quotas:

aws service-quotas list-service-quotas \
  --service-code ec2 \
  --region $AWS_REGION \
  --query "Quotas[?contains(QuotaName, 'On-Demand G') || contains(QuotaName, 'On-Demand P')].{Name:QuotaName, Value:Value}" \
  --output table

Use the quota output to confirm the instance families you plan to use:

G-series (g5.xlarge, g5.12xlarge, and so on) requires quota code L-DB2E81BA.
P-series (p4d.24xlarge, p5.48xlarge) requires quota code L-417A185B.

Set Up the EKS Cluster#

Create the control plane first, then add GPU-enabled worker nodes and the supporting storage and GPU components.

Set Environment Variables#

Start by defining the values that the EKS setup commands use.

export AWS_REGION="us-east-2"
export EKS_CLUSTER="my-eks-cluster"
export GPU_NODEGROUP="gpu-nodegroup"

Create the Cluster#

Follow these steps to create the EKS control plane and verify connectivity:

Create the EKS control plane (without any node group). Refer to the AWS EKS documentation for additional cluster creation options.

eksctl create cluster \
  --name $EKS_CLUSTER \
  --region $AWS_REGION \
  --version 1.31 \
  --without-nodegroup

After the cluster is ready, verify connectivity:
```
kubectl get svc
```

Tip

To update kubeconfig for an existing cluster, run:

aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER

Add a GPU Node Group#

Add a GPU-capable managed node group after the EKS control plane is ready.

Create the GPU node group:

eksctl create nodegroup \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --name $GPU_NODEGROUP \
  --node-type g5.xlarge \
  --nodes 1 \
  --node-ami-family AmazonLinux2 \
  --node-volume-size 200

After the node group is ready, verify that the node joined the cluster:
```
kubectl get nodes -o wide
```

The following command options determine the key characteristics of the GPU node group:

--node-type: Select a value based on your GPU requirements.
--nodes 1: Number of GPU instances to launch.
--node-ami-family AmazonLinux2: Uses the GPU-optimized AMI with pre-installed NVIDIA drivers.
--node-volume-size 200: Provides 200 GB of disk space for model storage and container images.

Install the NVIDIA GPU Operator#

The GPU Operator manages the device plugin, container runtime, and other NVIDIA components that are required for GPU scheduling. Refer to the GPU Operator documentation for details.

Install the GPU Operator after the node group is available.

Add the NVIDIA Helm repository, and update your local chart index:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update

Install the GPU Operator:

helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator \
  --wait --generate-name

Verify that the GPU Operator pods are ready:
```
kubectl -n gpu-operator get pods
```

Verify that the nodes advertise GPU capacity:

kubectl get nodes \
  -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

Associate the IAM OIDC Provider#

The Amazon EBS CSI driver uses IAM Roles for Service Accounts (IRSA), which requires an IAM OIDC provider associated with the cluster. Associate it before you create the driver’s IAM service account.

eksctl utils associate-iam-oidc-provider \
  --region $AWS_REGION \
  --cluster $EKS_CLUSTER \
  --approve

This command is idempotent: if the provider is already associated, eksctl reports that and makes no change.

Install the Amazon EBS CSI Driver#

The Amazon EBS CSI driver is required for dynamic PersistentVolumeClaim provisioning.

Create an IAM service account using the --override-existing-serviceaccounts flag to keep re-runs idempotent:

eksctl create iamserviceaccount \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --namespace kube-system \
  --name ebs-csi-controller-sa \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve \
  --override-existing-serviceaccounts

Get your AWS account ID:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Install the addon:
```
eksctl create addon \
  --cluster $EKS_CLUSTER \
  --region $AWS_REGION \
  --name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole \
  --force
```
Important

The previous step creates the ebs-csi-controller-sa ServiceAccount and labels it app.kubernetes.io/managed-by: eksctl. The Amazon EBS CSI managed addon installs its own copy of that ServiceAccount. Without --force, the addon refuses to reconcile the conflicting label, fails with a ConfigurationConflict error, and remains in the CREATE_FAILED state. The --force flag applies the addon with the OVERWRITE conflict-resolution mode so that the addon adopts the existing ServiceAccount, which allows the controller to start and the ebs.csi.aws.com driver to register. If the addon fails, the EBS CSI controller has no pods, and PersistentVolumeClaim resources stay in the Pending state.

Verify the installation:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

Verify the Cluster Is Ready#

Run the following checks to confirm that the cluster is ready for deployment.

Confirm that the nodes are present:
```
kubectl get nodes -o wide
```

Confirm that the nodes advertise GPU capacity:

kubectl get nodes \
  -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

Confirm that the Amazon EBS CSI Driver pods are running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

Confirm that the GPU Operator pods are running:
```
kubectl -n gpu-operator get pods
```

Deploy NIM#

After the cluster is ready, create the namespace, configure secrets and storage, and install the Helm chart.

Create the Namespace and Secrets#

Create the Kubernetes namespace and the secrets that the deployment uses.

Set the deployment variables:

export NAMESPACE="nim-llm"
export RELEASE_NAME="my-nim"
export NGC_API_KEY="<your-ngc-api-key>"

Create the namespace:
```
kubectl create namespace $NAMESPACE
```

Create the NGC registry secret:

kubectl create secret docker-registry ngc-secret \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

Create the NGC API key secret:

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Tip

If your model requires a gated Hugging Face repository, create an additional secret with your Hugging Face token:

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="<your-HF-token>"

Create a Model Cache PVC#

Create a persistent volume claim so model artifacts can be reused across pod restarts.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-cache-pvc
  namespace: $NAMESPACE
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp2
  resources:
    requests:
      storage: 200Gi
EOF

Note

gp3 is also a valid storageClassName.

Create a Helm Values File#

Save the following as custom-values-eks.yaml:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.7"
  pullPolicy: IfNotPresent

model:
  name: meta/llama-3.1-8b-instruct
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

env:
  - name: NIM_MODEL_PROFILE
    value: "<profile-hash>"

podSecurityContext:
  runAsUser: 0
  runAsGroup: 0
  fsGroup: 0

persistence:
  enabled: true
  existingClaim: "nvidia-nim-cache-pvc"

resources:
  limits:
    nvidia.com/gpu: 1

imagePullSecrets:
  - name: ngc-secret

nodeSelector:
  nvidia.com/gpu.present: "true"

startupProbe:
  timeoutSeconds: 10
readinessProbe:
  timeoutSeconds: 10
livenessProbe:
  timeoutSeconds: 10

service:
  type: LoadBalancer
  openaiPort: 8000

Important

The chart’s default probe timeoutSeconds of 1 is too short for NIM. The /v1/health/ready endpoint takes slightly more than one second to respond, so the default kube-probe times out before the pod becomes ready, and helm install --wait blocks until its timeout expires. Override startupProbe, readinessProbe, and livenessProbe with timeoutSeconds: 10 (as shown above) to give the probes enough time to respond.

Note

The example above deploys a single-GPU model on a g5.xlarge instance. For larger models requiring multiple GPUs, use an instance type with sufficient GPUs (for example, p4d.24xlarge for 8x A100) and set resources.limits.nvidia.com/gpu to the corresponding GPU count in your values file.

Note

Replace <profile-hash> with a valid profile ID for your chosen model and GPU. To list available profiles, refer to Model Profiles and Selection.

Install the Helm Chart#

Download the NIM Helm chart, and then install it.

Download the chart package:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version>.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

Install the chart:

helm install $RELEASE_NAME nim-llm-<version>.tgz \
  --namespace $NAMESPACE \
  -f custom-values-eks.yaml

Monitor and Verify the Deployment#

Use the following steps to monitor the rollout and verify that the service is reachable.

Watch the pod status:
```
kubectl -n $NAMESPACE get pods -w
```

Stream the NIM logs:

kubectl -n $NAMESPACE logs -l app.kubernetes.io/name=nim-llm -f --tail=100

Get the service endpoint:

kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm

After EXTERNAL-IP is populated, set EXTERNAL_IP:

export EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc $RELEASE_NAME-nim-llm \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Tip

If the LoadBalancer external IP is not available, use port-forwarding instead:

kubectl -n $NAMESPACE port-forward svc/$RELEASE_NAME-nim-llm 8000:8000
export EXTERNAL_IP=localhost

Check the readiness endpoint:
```
curl -s "http://$EXTERNAL_IP:8000/v1/health/ready"
```
Note

After the LoadBalancer hostname resolves, an AWS Elastic Load Balancer (ELB) takes about 30 seconds to 2 minutes to register the (already ready) pod as a healthy target. Running curl on the endpoint once during this window can return a transient curl exit code (7 connection refused, 28 timeout, or 52 empty reply) or an ELB-generated HTTP 502, 503, or 504. These are expected during target registration. Poll the readiness endpoint with retries until it returns HTTP 200 rather than treating a single failed curl execution as a deployment error.

List the available models:

curl -s "http://$EXTERNAL_IP:8000/v1/models" | python3 -m json.tool

Send a test chat completion request:
```
curl -X POST "http://$EXTERNAL_IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'
```
Note

The model field must match the model name the image serves, which is the NIM_SERVED_MODEL_NAME value baked into the image at build time (the com.nvidia.nim.model docker label). Replace meta/llama-3.1-8b-instruct with your image’s served model name. List the served name with curl -s "http://$EXTERNAL_IP:8000/v1/models". A mismatched model value returns a model not found error.

Teardown#

Delete the deployment resources when you no longer need them.

Remove the Helm release:

helm uninstall $RELEASE_NAME -n $NAMESPACE

Delete the persistent volume claim and namespace:

kubectl delete pvc nvidia-nim-cache-pvc -n $NAMESPACE
kubectl delete namespace $NAMESPACE

Delete the EKS cluster. For additional options, refer to the AWS documentation.
```
eksctl delete cluster --name $EKS_CLUSTER --region $AWS_REGION
```

SageMaker Deployment#

This guide covers deploying NIM LLM on Amazon SageMaker using managed real-time inference endpoints. Model artifacts are typically stored in S3, but if your SageMaker endpoint has outbound internet access (for example, through a NAT gateway or public subnet), NIM can also download models directly from NGC or Hugging Face at startup. Refer to Model Storage Options for details.

Native SageMaker Protocol Support#

NIM implements the SageMaker BYOC (Bring Your Own Container) protocol natively. When SageMaker mode is active, NIM:

Listens on port 8080 (the port SageMaker expects).
Responds to GET /ping — the health check SageMaker calls before marking an endpoint InService.
Accepts POST /invocations — the inference path SageMaker routes all requests to.

Activation — NIM auto-detects SageMaker mode when the SageMaker host agent injects any of SAGEMAKER_MULTI_MODEL, SAGEMAKER_REGION, or SAGEMAKER_BIND_TO_PORT into the container environment. You do not need to set NIM_SAGEMAKER_MODE when deploying to a real SageMaker endpoint. Use NIM_SAGEMAKER_MODE=1 to force-enable outside SageMaker (for example, local testing), or NIM_SAGEMAKER_MODE=0 to suppress auto-detection.

`NIM_SAGEMAKER_MODE`	Effect
`1`	Force SageMaker mode on; enable port 8080, `/ping`, `/invocations`
`0`	Suppress SageMaker mode even when `SAGEMAKER_*` vars are present
(unset)	Auto-detect from `SAGEMAKER_MULTI_MODEL`, `SAGEMAKER_REGION`, or `SAGEMAKER_BIND_TO_PORT`

SageMaker-Injected Environment Variables#

The following environment variables are not NIM configuration variables. They are injected automatically by the SageMaker host agent into every BYOC container running on a SageMaker endpoint. NIM reads them solely as signals for auto-detection; you do not need to set them yourself.

Variable	Purpose	Reference
`SAGEMAKER_BIND_TO_PORT`	The port the container must listen on (typically `8080`). When present, NIM uses this value as the server port.	Container requirements
`SAGEMAKER_MULTI_MODEL`	Injected when the endpoint is configured for multi-model hosting. NIM does not use the value; its presence triggers auto-detection.	Multi-model endpoints
`SAGEMAKER_REGION`	The AWS region where the endpoint is running (for example, `us-east-1`). NIM does not use the value; its presence triggers auto-detection.	SageMaker BYOC contract

If any of these variables is detected and NIM_SAGEMAKER_MODE is not explicitly set, NIM enables SageMaker mode automatically. Set NIM_SAGEMAKER_MODE=0 to suppress this behavior.

The request body for POST /invocations is identical to POST /v1/chat/completions — NIM transparently rewrites the path internally.

Prerequisites#

Before you begin, ensure you have the following:

An AWS account with permissions to create SageMaker endpoints, ECR repositories, and S3 buckets.
An NGC API key if you use an NGC-based NIM image or mirror models to S3.
A Hugging Face access token if you download models directly from Hugging Face.
The following tools installed:
- AWS CLI v2
- Docker (for building and pushing the NIM image to ECR)

Your SageMaker execution role must have access to the S3 bucket where model artifacts are stored, and to ECR if you push the container image to your own registry. For more information about IAM roles, refer to SageMaker execution roles.

Model Storage Options#

NIM on SageMaker supports the same model sources as other platforms. S3 is the most common choice, but if the SageMaker endpoint has outbound internet access (for example, through a NAT gateway or public subnet), NIM can also download directly from NGC or Hugging Face at startup.

Source	Method	Environment Variable	Outbound Internet Required
NGC (built-in manifest)	Container downloads from NGC at startup	`NGC_API_KEY`	Yes
NGC direct	NIM generates manifest, downloads from NGC	`NIM_MODEL_PATH=ngc://org/model:ver`	Yes
NGC mirrored to S3	Model files in S3 with percent-encoded keys	`NIM_REPOSITORY_OVERRIDE=s3://bucket`	No (refer to Model Download)
S3 direct	Your own model stored in S3	`NIM_MODEL_PATH=s3://bucket/path/to/model`	No
Hugging Face direct	NIM generates manifest, downloads from Hugging Face	`NIM_MODEL_PATH=hf://org/model`	Yes
Local path	Model files pre-loaded in the container	`NIM_MODEL_PATH=/opt/nim/models/`	No

Authentication to S3 is handled by the SageMaker execution role; you do not need to pass S3 credentials inside the container. However, for any S3-based source (NIM_REPOSITORY_OVERRIDE=s3://... or NIM_MODEL_PATH=s3://...), you must pass AWS_REGION and AWS_DEFAULT_REGION as container environment variables. The in-container AWS SDK requires a region, and SageMaker does not inject one. For Hugging Face direct downloads, pass HF_TOKEN as a container environment variable. For NGC downloads, pass NGC_API_KEY.

Build and Push the NIM Image to ECR#

SageMaker runs your container from an image in Amazon ECR. Build your NIM image (or use an NGC image as the base), then push it to an ECR repository in your account.

Set the AWS region, repository name, and account ID:

export AWS_REGION="${AWS_REGION:-us-east-1}"
export ECR_REPO_NAME="nim-llm"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Log in to Amazon ECR:

aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

Create the ECR repository if it does not already exist:

aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION 2>/dev/null || true

Define the source image and target ECR URI:

export NIM_IMAGE="${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.7"   # or your image
export ECR_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest"

Pull, tag, and push the image:

docker pull $NIM_IMAGE
docker tag $NIM_IMAGE $ECR_URI
docker push $ECR_URI

Create the SageMaker Model#

Create a SageMaker model resource that references your ECR image and sets the environment variables NIM needs at runtime.

Mirror an NGC Model to S3#

For an NGC NIM image deployed to a VPC-isolated endpoint (no outbound internet), mirror the model artifacts into your own S3 bucket first, then point NIM at the bucket using NIM_REPOSITORY_OVERRIDE. The image ships a mirror s3 command that reads its manifest and copies the model files to S3.

Set the bucket and create it if it does not exist:

export S3_BUCKET="${YOUR_S3_BUCKET}"
aws s3api create-bucket --bucket $S3_BUCKET --region $AWS_REGION \
  --create-bucket-configuration LocationConstraint=$AWS_REGION

Note

In us-east-1, omit --create-bucket-configuration because that region does not accept a LocationConstraint.

Run the mirror command from the NIM image. This step runs in your own environment rather than under the SageMaker execution role, so the container needs AWS credentials to write to S3: the -e AWS_ACCESS_KEY_ID, -e AWS_SECRET_ACCESS_KEY, and -e AWS_SESSION_TOKEN lines forward credentials from your host shell, which is required when you run this from a workstation or instance that has no attached IAM role. Also pass AWS_REGION and AWS_DEFAULT_REGION so the in-container AWS SDK can resolve the bucket region:
```
docker run --rm \
  -e NGC_API_KEY="$NGC_API_KEY" \
  -e AWS_REGION="$AWS_REGION" \
  -e AWS_DEFAULT_REGION="$AWS_REGION" \
  -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN \
  $NIM_IMAGE \
  mirror s3 -b $S3_BUCKET -m /opt/nim/etc/default/model_manifest.yaml
```

Confirm the bucket now has objects:

aws s3 ls s3://$S3_BUCKET/ --recursive | head

After the mirror finishes, create the model with NIM_REPOSITORY_OVERRIDE=s3://$S3_BUCKET as shown next.

Use an NGC Model Mirrored to S3#

If you have mirrored an NGC model to S3, use the NIM_REPOSITORY_OVERRIDE environment variable to specify the location.

Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models) and ${YOUR_NGC_API_KEY} with your NGC API key. Keep NGC_API_KEY secret and never commit it to version control.

Set the required environment variables:

# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-llama-8b"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export NGC_API_KEY="${YOUR_NGC_API_KEY}"

Create the SageMaker model:

aws sagemaker create-model \
  --model-name $ENDPOINT_NAME \
  --execution-role-arn $EXECUTION_ROLE_ARN \
  --primary-container '{
    "Image": "'"$ECR_URI"'",
    "Environment": {
      "NGC_API_KEY": "'"$NGC_API_KEY"'",
      "NIM_REPOSITORY_OVERRIDE": "s3://'"$S3_BUCKET"'",
      "NIM_SERVED_MODEL_NAME": "llama-3.1-8b-instruct",
      "AWS_REGION": "'"$AWS_REGION"'",
      "AWS_DEFAULT_REGION": "'"$AWS_REGION"'"
    }
  }'

Important

When NIM_REPOSITORY_OVERRIDE (or NIM_MODEL_PATH) points to S3, NIM uses the AWS SDK inside the container to download model artifacts. The SDK requires a region, and SageMaker does not inject one automatically. Pass both AWS_REGION and AWS_DEFAULT_REGION in the container Environment (set to the same region as your S3 bucket and endpoint), or the container fails at startup with no region selected and the endpoint never reaches InService.

Note

NIM_SERVED_MODEL_NAME must match the model name baked into the image at build time (the com.nvidia.nim.model docker label). A mismatch causes a model-profile lookup failure at startup. The same value must be used as the "model" field in the /invocations request body below.

Use a Custom Model in S3#

If you are serving your own model from S3, use the NIM_MODEL_PATH environment variable to specify the location.

Replace ${YOUR_S3_BUCKET} with your S3 bucket name (for example, my-nim-models). The bucket must exist in the target region and your SageMaker execution role must have read access to it.

Set the required environment variables:

# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-custom-model"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export S3_BUCKET="${YOUR_S3_BUCKET}"
export S3_MODEL_URI="s3://${S3_BUCKET}/path/to/model"

Create the SageMaker model:

aws sagemaker create-model \
  --model-name $ENDPOINT_NAME \
  --execution-role-arn $EXECUTION_ROLE_ARN \
  --primary-container '{
    "Image": "'"$ECR_URI"'",
    "Environment": {
      "NIM_MODEL_PATH": "'"$S3_MODEL_URI"'",
      "NIM_SERVED_MODEL_NAME": "my-model",
      "AWS_REGION": "'"$AWS_REGION"'",
      "AWS_DEFAULT_REGION": "'"$AWS_REGION"'"
    }
  }'

Use a Hugging Face Model Directly#

Find the model page on Hugging Face (for example, Llama-3.1-8B-Instruct). Accept the model’s license before deploying.

Set the required environment variables. Replace ${YOUR_HF_TOKEN} with your Hugging Face access token:

# Prerequisite: Ensure $ECR_URI is set from the "Build and Push the NIM Image to ECR" section above
export ENDPOINT_NAME="nim-hf-llama"
export EXECUTION_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/YourSageMakerExecutionRole"
export HF_TOKEN="${YOUR_HF_TOKEN}"

Create the SageMaker model:

aws sagemaker create-model \
  --model-name $ENDPOINT_NAME \
  --execution-role-arn $EXECUTION_ROLE_ARN \
  --primary-container '{
    "Image": "'"$ECR_URI"'",
    "Environment": {
      "NIM_MODEL_PATH": "hf://meta-llama/Llama-3.1-8B-Instruct",
      "HF_TOKEN": "'"$HF_TOKEN"'",
      "NIM_SERVED_MODEL_NAME": "llama-3.1-8b-instruct",
      "AWS_REGION": "'"$AWS_REGION"'",
      "AWS_DEFAULT_REGION": "'"$AWS_REGION"'"
    }
  }'

Note

Pass HF_TOKEN in the container Environment so NIM can authenticate to Hugging Face for gated models. Also pass AWS_REGION and AWS_DEFAULT_REGION: NIM’s in-container AWS SDK resolves its region from these variables and ignores the SAGEMAKER_REGION value that the SageMaker host agent injects, so omitting them can break any S3 access the runtime performs.

Note

This option requires outbound internet access from the SageMaker endpoint subnet. If your VPC does not have a NAT gateway or internet gateway, the download will fail at startup. For VPC-isolated deployments, use one of the S3-based options instead.

Use the same ENDPOINT_NAME for the model, endpoint config, and endpoint so they are linked.

Create Endpoint Config and Endpoint#

Create an endpoint configuration, create the endpoint, and then wait for it to become ready.

Set the instance type:
```
export INSTANCE_TYPE="ml.g6e.xlarge"
```
Create the endpoint configuration:
```
aws sagemaker create-endpoint-config \
  --endpoint-config-name $ENDPOINT_NAME \
  --production-variants '[{
    "VariantName": "AllTraffic",
    "ModelName": "'"$ENDPOINT_NAME"'",
    "InstanceType": "'"$INSTANCE_TYPE"'",
    "InitialInstanceCount": 1,
    "ContainerStartupHealthCheckTimeoutInSeconds": 1800,
    "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1"
  }]'
```
Important

On GPU instance families such as ml.g5, specify a valid SageMaker Inference AMI with InferenceAmiVersion. If you omit this field, SageMaker selects its current default GPU AMI, which may ship an NVIDIA driver (for example, driver 580 in al2023-ami-sagemaker-inference-gpu-4-1) that fails to start the container on ml.g5. The endpoint then fails with CannotStartContainerError and no CloudWatch log group is created, because the failure happens in the host’s NVIDIA driver and container toolkit before NIM produces any logs. Valid values for ml.g5 are al2-ami-sagemaker-inference-gpu-3-1 (driver 550, CUDA 12.4), al2-ami-sagemaker-inference-gpu-2-1, and al2-ami-sagemaker-inference-gpu-2 (driver 535, CUDA 12.2). Refer to Inference AMI versions.

Create the endpoint:

aws sagemaker create-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --endpoint-config-name $ENDPOINT_NAME

Check the endpoint status until it is InService (this can take several minutes):

aws sagemaker describe-endpoint --endpoint-name $ENDPOINT_NAME --query 'EndpointStatus' --output text

Verify the Deployment#

SageMaker calls GET /ping on port 8080 to determine readiness and routes all inference traffic to POST /invocations. Both endpoints are implemented natively by NIM. SageMaker mode is auto-detected from the host-agent-injected environment variables — no explicit flag needed.

Check Endpoint Status#

Wait for the endpoint to reach InService. SageMaker polls GET /ping internally during startup and only marks the endpoint ready when it returns 200:

aws sagemaker describe-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --query 'EndpointStatus' \
  --output text

Expected output when ready:

InService

Run an Inference Request#

Send a chat completion request through POST /invocations. The request body is the same OpenAI-compatible JSON as other NIM deployments — NIM rewrites the path to /v1/chat/completions internally:

Invoke the endpoint:

aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type "application/json" \
  --cli-binary-format raw-in-base64-out \
  --body '{"model":"llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' \
  /tmp/response.json && echo "HTTP 200 OK"

Format the response:

python3 -m json.tool /tmp/response.json

Expected response shape:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "llama-3.1-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": { "prompt_tokens": 10, "completion_tokens": 9, "total_tokens": 19 }
}

Note

For AWS CLI v2, --body is a blob parameter. Use --cli-binary-format raw-in-base64-out when sending raw JSON inline, otherwise the endpoint receives a malformed request body.

Note

POST /invocations is a direct pass-through to /v1/chat/completions. Any valid OpenAI chat completion body works, including streaming ("stream": true).

Note

Some AWS CLI installations do not expose invoke-endpoint-with-response-stream under aws sagemaker-runtime. If that subcommand is unavailable, verify non-streaming invoke-endpoint first and use a newer CLI build or an SDK-based client for streaming tests.

Teardown#

Delete the SageMaker resources when you no longer need them so you avoid ongoing charges.

Delete the endpoint:

aws sagemaker delete-endpoint --endpoint-name $ENDPOINT_NAME

After the endpoint deletion finishes, delete the endpoint configuration:

aws sagemaker delete-endpoint-config --endpoint-config-name $ENDPOINT_NAME

Delete the model:

aws sagemaker delete-model --model-name $ENDPOINT_NAME

Optional: Remove the ECR image and S3 model data if you no longer need them.

References#

Refer to the following AWS documentation for more information: