Air Gap Deployment for NVIDIA NIM for LLMs#

NVIDIA NIM for large language models (LLMs) supports serving models in an air gap system (also known as air wall, air-gapping or disconnected network). Before you use this documentation, review all prerequisites and instructions in Getting Started, and see Serving models from local assets.

Air Gap Deployment (offline cache route)#

If NIM detects a previously loaded profile in the cache, it serves that profile from the cache. After downloading the profiles to cache using download-to-cache, the cache can be transferred to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.

To see this in action, do NOT provide the NGC_API_KEY, as shown in the following example.

# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"

# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"

# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:1.2.1"

# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Air Gap Deployment (local model directory route)#

Another option for the air gap route is to deploy the created model repository using the create-model-store command within the NIM Container to create a repository for a single model, as shown in the following example.

create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository
# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:1.2.1"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model

# Note: For vLLM backend, specify the following environment variables
# to set the required parallel sizes. The default values are 1.
export NIM_TENSOR_PARALLEL_SIZE=<required_value>
export NIM_PIPELINE_PARALLEL_SIZE=<required_value>

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=/model-repo \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME