Air Gap Deployment for NVIDIA NIM for LLMs#
NVIDIA NIM for large language models (LLMs) supports serving models in an air gap system (also known as air wall, air-gapping or disconnected network). Before you use this documentation, review all prerequisites and instructions in Getting Started, and see Serving models from local assets.
Air Gap Deployment (offline cache route)#
If NIM detects a previously loaded profile in the cache, it serves that profile from the cache.
After downloading the profiles to cache using download-to-cache
,
the cache can be transferred to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.
To see this in action, do NOT provide the NGC_API_KEY, as shown in the following example.
# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"
# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"
# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct
# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:1.2.1"
# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Air Gap Deployment (local model directory route)#
Another option for the air gap route is to deploy the created model repository using the create-model-store
command within the NIM Container to create a repository for a single model, as shown in the following example.
create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository
# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct
# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:1.2.1"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model
# Note: For vLLM backend, specify the following environment variables
# to set the required parallel sizes. The default values are 1.
export NIM_TENSOR_PARALLEL_SIZE=<required_value>
export NIM_PIPELINE_PARALLEL_SIZE=<required_value>
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_NAME=/model-repo \
-e NIM_SERVED_MODEL_NAME \
-v $MODEL_REPO:/model-repo \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME