Getting Started#
Prerequisites#
Setup#
NVIDIA AI Enterprise License: NVIDIA NIM for VLMs is available for self-hosting under the NVIDIA AI Enterprise (NVAIE) License.
NVIDIA GPU(s): NVIDIA NIM for VLMs (NIM for VLMs) runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogeneous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.
CPU: x86_64 architecture only for this release
OS: any Linux distributions that:
Have
glibc
>= 2.35 (see output ofld -v
)
CUDA Drivers: Follow the installation guide.
We recommend:
Using a network repository as part of a package manager installation and skipping the CUDA toolkit installation, as the libraries are available within the NIM container
Installing the open kernels for a specific version:
Major Version |
EOL |
Data Center & RTX/Quadro GPUs |
GeForce GPUs |
---|---|---|---|
> 550 |
TBD |
X |
X |
550 |
Feb 2025 |
X |
X |
545 |
Oct 2023 |
X |
X |
535 |
June 2026 |
X |
|
525 |
Nov 2023 |
X |
|
470 |
Sept 2024 |
X |
Install Docker.
Install the NVIDIA Container Toolkit.
After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.
To ensure that your setup is correct, run the following command:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
This command should produce output similar to the following, allowing you to confirm the CUDA driver version and available GPUs.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 |
| N/A 36C P0 112W / 700W | 78489MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Installing WSL2 for Windows#
Certain downloadable NIMs can be used on an RTX Windows system with Windows System for Linux (WSL). To enable WSL2, perform the following steps.
Be sure your computer can run WSL2 as described in the Prerequisites section of the WSL2 documentation.
Enable WSL2 on your Windows computer by following the steps in Install WSL command. By default, these steps install the Ubuntu distribution of Linux. For alternative installations, see Change the default Linux distribution installed.
Launch NVIDIA NIM for VLMs#
You can download and run the NIM of your choice from either the API catalog or NGC.
From NGC#
Generate an API key#
An NGC API key is required to access NGC resources. The key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.
When creating an NGC API key, ensure that at least NGC Catalog
is selected from the Services Included
dropdown. If this key is to be reused for other purposes, more services can be included.

Export the API key#
Pass the value of the API key to the docker run
command in the next section as the NGC_API_KEY
environment variable to download the appropriate models and resources when starting the NIM.
If you are not familiar with how to create the NGC_API_KEY
environment variable, the simplest way is to export it in your terminal:
export NGC_API_KEY=<value>
Run one of the following commands to make the key available at startup:
# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc
Other more secure options include saving the value in a file, which you can retrieve with cat $NGC_API_KEY_FILE
, or using a password manager.
Docker Login to NGC#
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry using the following command:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Use $oauthtoken
as the username and NGC_API_KEY
as the password. The $oauthtoken
username is a special name indicating that you will authenticate with an API key, not a username and password.
List Available NIMs#
This documentation uses the NGC CLI tool in several examples. For information on downloading and configuring the tool, see the NGC CLI documentation.
Use the following command to list the available NIMs in CSV format.
ngc registry image list --format_type csv 'nvcr.io/nim/meta/*vision**'
This command should produce output in the following format:
Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products
<name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1>
...
<nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>
Use the **Repository** and **Latest Tag** fields when you call the ``docker run`` command, as shown in the following section.
Launch NIM#
The following command launches a Docker container for the meta/llama-3.2-11b-vision-instruct
model. To launch a container for a different NIM, replace the values of Repository
and Latest_Tag
with values from the previous image list
command and change the value of CONTAINER_NAME
to something appropriate.
# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct
# The container name from the previous ngc registry image list command
Repository=llama-3.2-11b-vision-instruct
Latest_Tag=1.1.0
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Docker Run Parameters#
Flags |
Description |
---|---|
|
|
|
Delete the container after it stops (see Docker docs) |
|
Give a name to the NIM container
for bookkeeping (here
|
|
Ensure NVIDIA drivers are accessible in the container. |
|
Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs. |
|
Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled. |
|
Provide the container with the token necessary to download adequate models and resources from NGC. See Export the API key. |
|
Mount a cache directory from your
system ( |
|
Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory. |
|
Forward the port where the NIM
server is published inside the
container to access from the host
system. The left-hand side of
the colon ( |
|
Name and version of the VLM NIM container from NGC. The VLM NIM server automatically starts if no argument is provided after this. |
Note
See the Configuring a NIM topic for information about additional configuration settings.
Note
If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u)
option to the docker run
call.
Note
NIM automatically selects the most suitable profile based on your system specifications. For details, see Automatic Profile Selection
Run Inference#
During startup, the NIM container downloads the required resources and serves the model behind an API endpoint. The following message indicates a successful startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Once you see this message, you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:
curl -X GET 'http://0.0.0.0:8000/v1/models'
Tip
Pipe the results of curl
commands into a tool like jq or python -m json.tool
to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq
.
This command should produce output similar to the following:
{
"object": "list",
"data": [
{
"id": "meta/llama-3.2-11b-vision-instruct",
"object": "model",
"created": 1724796510,
"owned_by": "system",
"root": "meta/llama-3.2-11b-vision-instruct",
"parent": null,
"max_model_len": 131072,
"permission": [
{
"id": "modelperm-c2e069f426cc43088eb408f388578289",
"object": "model_permission",
"created": 1724796510,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
To check the readiness of the service:
curl -X GET 'http://0.0.0.0:8000/v1/health/ready'
, which will respond with 200 if the server is ready to accept requests.
OpenAI Chat Completion Request#
The Chat Completions endpoint is typically used with chat
or instruct
tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true
.
Important
Update the model name according to the model you are running.
For example, for a meta/llama-3.2-11b-vision-instruct
model, you might provide the URL of an image and query the NIM server from command line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 256
}'
Alternatively, you can use the OpenAI Python SDK library
pip install -U openai
Run the client and query the chat completion API:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Passing Images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64
command, or in python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Text-only support
Some clients may not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img>
tags (ensure that you correctly escape quotes):
{
"role": "user",
"content": "What is in this image? <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}
This is also compatible with the base64 representation.
{
"role": "user",
"content": "What is in this image? <img src=\"...ciBmZWxsb3chIQ==\" />"
}
Text-only Queries#
Many VLMs such as meta/llama-3.2-11b-vision-instruct
support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 256
}'
Or using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.
Important
Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
],
"max_tokens": 256
}'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
chat_response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Using LangChain#
NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).
Install LangChain using the following command:
pip install -U langchain-openai langchain-core
Query the OpenAI Chat Completions endpoint using LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
model = ChatOpenAI(
model="meta/llama-3.2-11b-vision-instruct",
openai_api_base="http://0.0.0.0:8000/v1",
openai_api_key="not-needed"
)
message = HumanMessage(
content=[
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
},
],
)
print(model.invoke([message]))
Llama Stack Chat Completion Request#
NIM for VLMs additionally supports the Llama Stack Client inference API for Llama VLMs, such as meta/llama-3.2-11b-vision-instruct
. With the Llama Stack API, developers can easily integrate Llama VLMs into their applications. To stream the result, set "stream": true
.
Important
Update the model name according to the model you are running.
For example, for a meta/llama-3.2-11b-vision-instruct
model, you might provide the URL of an image and query the NIM server from the command line:
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"image":
{
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
]
}'
Alternatively, you can use the Llama Stack Client Python Library
pip install llama-stack-client==0.0.50
Important
The examples below assume llama-stack-client version 0.0.50. Modify the requests accordingly if you choose to install a newer version.
Run the client and query the chat completion API:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://0.0.0.0:8000")
messages = [
{
"role": "user",
"content": [
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)
Passing Images#
NIM for VLMs follows the Llama Stack specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send them in your payload.
{
"image": {
"uri": "...ciBmZWxsb3chIQ=="
}
}
Text-only support
As in the OpenAI API case, NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img>
tags:
{
"role": "user",
"content": "What is in this image?<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}
This is also compatible with the base64 representation.
{
"role": "user",
"content": "What is in this image?<img src=\"...ciBmZWxsb3chIQ==\" />"
}
Text-only Queries#
Many VLMs such as meta/llama-3.2-11b-vision-instruct
support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please consult model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
}'
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversation with repeated interactions between a user and the model.
Important
Multi-turn capability is not available for all VLMs. Please consult model cards for support on multi-turn conversation.
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-11b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"image":
{
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ...",
"stop_reason": "end_of_turn"
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
}'
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")
messages = [
{
"role": "user",
"content": [
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ...",
"stop_reason": "end_of_turn"
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
iterator = client.inference.chat_completion(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)
Stopping the Container#
If a Docker container is launched with the --name
command line option, you can stop the running container using the following command.
# In the previous sections, the environment variable CONTAINER_NAME was
# defined using `export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct`
docker stop $CONTAINER_NAME
Use docker kill
if stop
is not responsive. Then, follow it with docker rm $CONTAINER_NAME
if you do not intend to restart this container as-is (using docker start $CONTAINER_NAME
). In which case you will need to re-use the docker run ...
instructions from the top of this section to start a new container for your NIM.
If you did not start a container with --name
, look at the output of docker ps
to get a container ID for the image you used.
Serving models from local assets#
NIM for VLMs provides utilities which enable downloading models to a local directory either as a model repository or to NIM cache. See the Utilities section for details.
Use the previous commands to launch a NIM container. From there, you can view and download models locally.
Use the list-model-profiles
command to list the available profiles.
You can download any of the profiles to the NIM cache using the download-to-cache
command. For example:
download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b
You can also let the download-to-cache
decide the most optimal profile given the hardware to download by providing no profiles to download, as shown in the following example.
download-to-cache
Air Gap Deployment (offline cache route)#
NIM supports serving models in an Air Gap system (also known as air wall, air-gapping or disconnected network).
If NIM detects a previously loaded profile in the cache, it serves that profile from the cache.
After downloading the profiles to cache using download-to-cache
, the cache can be transferred to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.
To see this in action, do NOT provide the NGC_API_KEY, as shown in the following example.
# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"
# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"
# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Air Gap Deployment (local model directory route)#
Another option for the air gap route is to deploy the created model repository using the create-model-store
command within the NIM Container to create a repository for a single model, as shown in the following example.
create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_NAME=/model-repo \
-e NIM_SERVED_MODEL_NAME \
-v $MODEL_REPO:/model-repo \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME