Air Gap Deployment for NVIDIA NIM for LLMs#

NVIDIA NIM for large language models (LLMs) supports serving models in an air gap system (also known as air wall, air-gapping or disconnected network). In an air gap system, you can run a NIM with no internet connection, and with no connection to the NGC registry.

Before you use this documentation, review all prerequisites and instructions in Get Started with NIM, and see Serving models from local assets.

You have two options for air gap deployment: offline cache and local model directory.

Air Gap Deployment (offline cache option)#

If NIM detects a previously loaded profile in the cache, it serves that profile from the cache. After downloading the profiles to cache by using download-to-cache, you can transfer the cache to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.

Do NOT provide the NGC_API_KEY, as shown in the following example.

# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"

# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"

# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:latest"

# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Air Gap Deployment (local model directory option)#

Another option for the air gap route is to deploy the created model repository by using the create-model-store command within the NIM Container to create a repository for a single model, as shown in the following example.

create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository

# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:latest"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model

# Note: For vLLM backend, specify the following environment variables
# to set the required parallel sizes. The default values are 1.
export NIM_TENSOR_PARALLEL_SIZE=<required_value>
export NIM_PIPELINE_PARALLEL_SIZE=<required_value>

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=/model-repo \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

IMPORTANT: The previous step results in a vLLM backend being used to run the model. There may be model directories which have a HuggingFace structure and would not work with a vLLM backend due to version differences in NIM LLM. In such a case, use NIM_FT_MODEL instead of NIM_MODEL_NAME for air gap deployments. This will serve the HuggingFace model after converting it to a Tensorrt-LLM engine structure. More elaborate steps available at Fine-Tuned Model Support.