Getting Started#

Prerequisites#

Setup#

NVIDIA AI Enterprise License: NVIDIA NIM for VLMs is available for self-hosting under the NVIDIA AI Enterprise (NVAIE) License.
NVIDIA GPU(s): NVIDIA NIM for VLMs (NIM for VLMs) runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. Homogeneous multi-GPUs systems with tensor parallelism enabled are also supported. See the Support Matrix for more information.
CPU: x86_64 architecture only for this release
OS: any Linux distributions that:
- Are supported by the NVIDIA Container toolkit
- Have glibc >= 2.35 (see output of ld -v)
CUDA Drivers: Follow the installation guide.
We recommend:
- Using a network repository as part of a package manager installation and skipping the CUDA toolkit installation, as the libraries are available within the NIM container
- Installing the open kernels for a specific version:

Major Version	EOL	Data Center & RTX/Quadro GPUs	GeForce GPUs
> 550	TBD	X	X
550	Feb 2025	X	X
545	Oct 2023	X	X
535	June 2026	X
525	Nov 2023	X
470	Sept 2024	X

Install Docker.
Install the NVIDIA Container Toolkit.

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to the following, allowing you to confirm the CUDA driver version and available GPUs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   36C    P0            112W /  700W |   78489MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Installing WSL2 for Windows#

Certain downloadable NIMs can be used on an RTX Windows system with Windows System for Linux (WSL). To enable WSL2, perform the following steps.

Be sure your computer can run WSL2 as described in the Prerequisites section of the WSL2 documentation.
Enable WSL2 on your Windows computer by following the steps in Install WSL command. By default, these steps install the Ubuntu distribution of Linux. For alternative installations, see Change the default Linux distribution installed.

Launch NVIDIA NIM for VLMs#

You can download and run the NIM of your choice from either the API catalog or NGC.

From NGC#

Generate an API key#

An NGC API key is required to access NGC resources. The key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.

When creating an NGC API key, ensure that at least NGC Catalog is selected from the Services Included dropdown. If this key is to be reused for other purposes, more services can be included.

Export the API key#

Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you are not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal:

export NGC_API_KEY=<value>

Run one of the following commands to make the key available at startup:

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Other more secure options include saving the value in a file, which you can retrieve with cat $NGC_API_KEY_FILE, or using a password manager.

Docker Login to NGC#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry using the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name indicating that you will authenticate with an API key, not a username and password.

List Available NIMs#

This documentation uses the NGC CLI tool in several examples. For information on downloading and configuring the tool, see the NGC CLI documentation.

Use the following command to list the available NIMs in CSV format.

ngc registry image list --format_type csv 'nvcr.io/nim/meta/*vision**'

This command should produce output in the following format:

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products
<name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1>
...
<nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>

Use the **Repository** and **Latest Tag** fields when you call the ``docker run`` command, as shown in the following section.

Launch NIM#

The following command launches a Docker container for the meta/llama-3.2-11b-vision-instruct model. To launch a container for a different NIM, replace the values of Repository and Latest_Tag with values from the previous image list command and change the value of CONTAINER_NAME to something appropriate.

# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct

# The container name from the previous ngc registry image list command
Repository=llama-3.2-11b-vision-instruct
Latest_Tag=1.1.0

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Docker Run Parameters#

Flags	Description
`-it`	`--interactive` + `--tty` (see Docker docs)
`--rm`	Delete the container after it stops (see Docker docs)
`--name=meta-llama-3-2-11b-vision-instruct`	Give a name to the NIM container for bookkeeping (here `meta-llama-3-2-11b-vision-instruct`). Use any preferred value.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus all`	Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.
`--shm-size=16GB`	Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled.
`-e NGC_API_KEY`	Provide the container with the token necessary to download adequate models and resources from NGC. See Export the API key.
`-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"`	Mount a cache directory from your system (`~/.cache/nim` here) inside the NIM (defaults to `/opt/nim/.cache`), allowing downloaded models and artifacts to be reused by follow-up runs.
`-u $(id -u)`	Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.
`-p 8000:8000`	Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of the colon (`:`) represents the host system’s IP and port (`8000` in this case), while the right-hand side corresponds to the container port where the NIM server is published (defaulting to `8000`).
`$IMG_NAME`	Name and version of the VLM NIM container from NGC. The VLM NIM server automatically starts if no argument is provided after this.

Note

See the Configuring a NIM topic for information about additional configuration settings.

Note

If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u) option to the docker run call.

Note

NIM automatically selects the most suitable profile based on your system specifications. For details, see Automatic Profile Selection

Run Inference#

During startup, the NIM container downloads the required resources and serves the model behind an API endpoint. The following message indicates a successful startup.

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once you see this message, you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:

curl -X GET 'http://0.0.0.0:8000/v1/models'

Tip

Pipe the results of curl commands into a tool like jq or python -m json.tool to make the output of the API easier to read. For example: curl -s http://0.0.0.0:8000/v1/models | jq.

This command should produce output similar to the following:

{
  "object": "list",
  "data": [
    {
      "id": "meta/llama-3.2-11b-vision-instruct",
      "object": "model",
      "created": 1724796510,
      "owned_by": "system",
      "root": "meta/llama-3.2-11b-vision-instruct",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-c2e069f426cc43088eb408f388578289",
          "object": "model_permission",
          "created": 1724796510,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

To check the readiness of the service:

curl -X GET 'http://0.0.0.0:8000/v1/health/ready'

, which will respond with 200 if the server is ready to accept requests.

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

For example, for a meta/llama-3.2-11b-vision-instruct model, you might provide the URL of an image and query the NIM server from command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 256
    }'

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the chat completion API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, you can use the base64 command, or in python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Text-only support

Some clients may not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags (ensure that you correctly escape quotes):

{
    "role": "user",
    "content": "What is in this image? <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}

This is also compatible with the base64 representation.

{
    "role": "user",
    "content": "What is in this image? <img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}

Text-only Queries#

Many VLMs such as meta/llama-3.2-11b-vision-instruct support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 256
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ..."
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ],
        "max_tokens": 256
    }'

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ..."
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]
chat_response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="meta/llama-3.2-11b-vision-instruct",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
        },
    ],
)

print(model.invoke([message]))

Llama Stack Chat Completion Request#

NIM for VLMs additionally supports the Llama Stack Client inference API for Llama VLMs, such as meta/llama-3.2-11b-vision-instruct. With the Llama Stack API, developers can easily integrate Llama VLMs into their applications. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

For example, for a meta/llama-3.2-11b-vision-instruct model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            }
        ]
    }'

Alternatively, you can use the Llama Stack Client Python Library

pip install llama-stack-client==0.0.50

Important

The examples below assume llama-stack-client version 0.0.50. Modify the requests accordingly if you choose to install a newer version.

Run the client and query the chat completion API:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)

Passing Images#

NIM for VLMs follows the Llama Stack specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "image": {
        "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send them in your payload.

{
    "image": {
        "uri": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

Text-only support

As in the OpenAI API case, NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags:

{
    "role": "user",
    "content": "What is in this image?<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\" />"
}

This is also compatible with the base64 representation.

{
    "role": "user",
    "content": "What is in this image?<img src=\"data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ==\" />"
}

Text-only Queries#

Many VLMs such as meta/llama-3.2-11b-vision-instruct support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please consult model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ]
    }'

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversation with repeated interactions between a user and the model.

Important

Multi-turn capability is not available for all VLMs. Please consult model cards for support on multi-turn conversation.

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "meta/llama-3.2-11b-vision-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ...",
                "stop_reason": "end_of_turn"
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ]
    }'

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ...",
        "stop_reason": "end_of_turn"
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]

iterator = client.inference.chat_completion(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)

Stopping the Container#

If a Docker container is launched with the --name command line option, you can stop the running container using the following command.

# In the previous sections, the environment variable CONTAINER_NAME was
# defined using `export CONTAINER_NAME=meta-llama-3-2-11b-vision-instruct`
docker stop $CONTAINER_NAME

Use docker kill if stop is not responsive. Then, follow it with docker rm $CONTAINER_NAME if you do not intend to restart this container as-is (using docker start $CONTAINER_NAME). In which case you will need to re-use the docker run ... instructions from the top of this section to start a new container for your NIM.

If you did not start a container with --name, look at the output of docker ps to get a container ID for the image you used.

Serving models from local assets#

NIM for VLMs provides utilities which enable downloading models to a local directory either as a model repository or to NIM cache. See the Utilities section for details.

Use the previous commands to launch a NIM container. From there, you can view and download models locally.

Use the list-model-profiles command to list the available profiles.

You can download any of the profiles to the NIM cache using the download-to-cache command. For example:

download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b

You can also let the download-to-cache decide the most optimal profile given the hardware to download by providing no profiles to download, as shown in the following example.

download-to-cache

Air Gap Deployment (offline cache route)#

NIM supports serving models in an Air Gap system (also known as air wall, air-gapping or disconnected network). If NIM detects a previously loaded profile in the cache, it serves that profile from the cache. After downloading the profiles to cache using download-to-cache, the cache can be transferred to an air-gapped system to run a NIM without any internet connection and with no connection to the NGC registry.

To see this in action, do NOT provide the NGC_API_KEY, as shown in the following example.

# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"

# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"

# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
-v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Air Gap Deployment (local model directory route)#

Another option for the air gap route is to deploy the created model repository using the create-model-store command within the NIM Container to create a repository for a single model, as shown in the following example.

create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model

docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NIM_MODEL_NAME=/model-repo \
-e NIM_SERVED_MODEL_NAME \
-v $MODEL_REPO:/model-repo \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME