Air Gap Deployment for NVIDIA NIM for LLMs#

NVIDIA NIM for large language models (LLMs) supports serving models in an air gap system (also known as air wall, air-gapping, or disconnected network). In an air gap system, you can run a NIM with no internet connection, and with no connection to the NGC registry or Hugging Face Hub.

Before you use this documentation, review all prerequisites and instructions in Get Started with NIM. Refer to Serving Models from Local Assets.

Refer to the appropriate section based on your NIM:

Air Gap Deployment for Multi-LLM NIMs
Air Gap Deployment for LLM-specific NIMs

Air Gap Deployment for Multi-LLM NIMs#

Local Model Directory Option#

Use this option to deploy the created model repository using the create-model-store command within the NIM container to create a repository for a single model, as shown in the following example. A Hugging Face access token (HF_TOKEN) is required to run the tool that creates the model store.

Initialize Container Setup#

# Choose a container name for bookkeeping
export CONTAINER_NAME=LLM-NIM

# Set the multi-LLM NIM repository
export Repository=nim/nvidia/llm-nim

# Set the tag to latest or a specific version (for example, 1.14.0)
export TAG=latest

# Choose the multi-LLM NIM image from NGC
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository

# Provide write permissions to model-repo for the user
chown -R $(id -u) $MODEL_REPO
export NIM_SERVED_MODEL_NAME=my-model

# Hugging Face model repository
export NIM_MODEL_NAME=hf://nvidia/Llama-3.1-Nemotron-Nano-8B-v1

Create the Model Store#

The following command creates the model store in the location specified by MODEL_REPO:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -e HF_TOKEN \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME create-model-store --model-repo $NIM_MODEL_NAME --model-store /model-repo 

Now run the following Docker command in an air-gapped environment. Do not set HF_TOKEN (as shown in the following):

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=/model-repo \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Air Gap Deployment for LLM-Specific NIMs#

Offline Cache Option#

If NIM detects a previously loaded profile in the cache, it serves that profile from the cache. After downloading the profiles to cache by using download-to-cache, you can transfer the cache to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.

download-to-cache -p 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b

Do NOT provide the NGC_API_KEY (after running download-to-cache), as shown in the following example.

# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"

# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama-3.1-8b-instruct

# Set the repository from the registry output (for example, nim/meta/llama-3.1-8b-instruct)
export Repository=nim/meta/llama-3.1-8b-instruct

# Set the tag to latest or a specific version (for example, 1.13.1)
export TAG=latest

# Choose an LLM NIM Image from NGC using the repository and tag from previous steps (for example, llama-3.1-8b-instruct:latest)
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Local Model Directory Option#

Use this option to deploy the created model repository using the create-model-store command within the NIM container to create a repository for a single model, as shown in the following example.

# Provide write permissions to model-repo for the user
chown -R $(id -u) $MODEL_REPO
# Ensure model-store has permission to write inside the container
create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /model-repo 

Do NOT provide the NGC_API_KEY (after running create-model-store), as shown in the following example.

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama-3.1-8b-instruct

# Set the repository from the registry output (for example, nim/meta/llama-3.1-8b-instruct)
export Repository=nim/meta/llama-3.1-8b-instruct

# Set the tag to latest or a specific version (for example, 1.13.1)
export TAG=latest

# Choose an LLM NIM Image from NGC using the repository and tag from previous steps (for example, llama-3.1-8b-instruct:latest)
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model

Note

When you use create-model-store with vLLM profiles, set the NIM_MODEL_PROFILE environment variable to vllm. For SGLang profiles, set it to sglang. For TensorRT-LLM buildable profiles, set it to tensorrt_llm. NIM automatically selects tensorrt_llm for TensorRT-LLM prebuilt engine profiles.

Configure the other required environment variables as shown in the following example:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=/model-repo \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME