Getting Started#

Prerequisites#

Setup#

  • NVIDIA AI Enterprise License: NVIDIA NIM for VLMs is available for self-hosting under the NVIDIA AI Enterprise (NVAIE) License.

  • NVIDIA GPU(s): NVIDIA NIM for VLMs (NIM for VLMs) runs on NVIDIA GPUs that have sufficient GPU memory, and that are optimized configurations. See the :ref:Support Matrix <vila-support-matrix> for more information.

  • CPU: x86_64 architecture only for this release

  • OS: any Linux distributions which:

  • CUDA Drivers: Follow the installation guide.

    NVIDIA recommend:

Major Version

EOL

Data Center & RTX/Quadro GPUs

GeForce GPUs

> 550

TBD

X

X

550

Feb 2025

X

X

545

Oct 2023

X

X

535

June 2026

X

525

Nov 2023

X

470

Sept 2024

X

  1. Installing Docker.

  2. Installing the NVIDIA Container Toolkit.

After installing the toolkit, follow the instructions in the Configure Docker section in the NVIDIA Container Toolkit documentation.

To ensure that your setup is correct, run the following command:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This command should produce output similar to the following, where you can confirm the CUDA driver version and available GPUs.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   36C    P0            112W /  700W |   78489MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Installing WSL2 for Windows#

Certain downloadable NIMs can be used on an RTX Windows system with Windows System for Linux (WSL). To enable WSL2, perform the following steps.

  1. Be sure your computer can run WSL2 as described in the Prerequisites section of the WSL2 documentation.

  2. Enable WSL2 on your Windows computer by following the steps in Install WSL command. By default, these steps install the Ubuntu distribution of Linux. For alternative installations, see Change the default Linux distribution installed.

Launch NVIDIA NIM for VLMs#

You can download and run the NIM of your choice from either the API catalog or the NGC.

From NGC#

Generate an API key#

An NGC API key is required to access NGC resources. The key can be generated from the Personal Keys page.

When creating an NGC API key, ensure that at least NGC Catalog is selected from the Services Included dropdown. Include other services to use the key for other purposes.

Generate Personal Key

Export the API key#

Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal:

export NGC_API_KEY=<value>

Run one of the following commands to make the key available at startup:

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Other, more secure options include saving the value in a file, which you can retrieve with cat $NGC_API_KEY_FILE, or using a password manager.

Docker Login to NGC#

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name indicating that you will authenticate with an API key, not a user name and password.

List Available NIMs#

This documentation uses the NGC CLI tool in several examples. For information on downloading and configuring the tool, see the NGC CLI documentation.

Use the following command to list the available NIMs in CSV format.

ngc registry image list --format_type csv 'nvcr.io/nim/nvidia/vila*'

This command should produce output in the following format:

Name,Repository,Latest Tag,Image Size,Updated Date,Permission,Signed Tag?,Access Type,Associated Products
<name1>,<repository1>,<latest tag1>,<image size1>,<updated date1>,<permission1>,<signed tag?1>,<access type1>,<associated products1>
...
<nameN>,<repositoryN>,<latest tagN>,<image sizeN>,<updated dateN>,<permissionN>,<signed tag?N>,<access typeN>,<associated productsN>

Use the **Repository** and **Latest Tag** fields when you call the ``docker run`` command, as shown in the following section.

Launch NIM#

The following command launches a Docker container for the nvidia-vila model. To launch a container for a different NIM, replace the values of Repository and Latest_Tag with values from the previous image list command and change the value of CONTAINER_NAME to something appropriate.

# Choose a container name for bookkeeping
export CONTAINER_NAME=nvidia-vila

# The container name from the previous ngc registry image list command
Repository=vila-1.5-35b
Latest_Tag=latest

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Docker Run Parameters#

Flags

Description

-it

--interactive + --tty (see Docker docs)

--rm

Delete the container after it stops (see Docker docs)

--name=nvidia-vila

Give a name to the NIM container for bookkeeping (here nvidia-vila). Use any preferred value.

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container.

--gpus all

Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.

--shm-size=16GB

Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled.

-e NGC_API_KEY

Provide the container with the token necessary to download adequate models and resources from NGC. See Export the API key.

-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"

Mount a cache directory from your system (~/.cache/nim here) inside the NIM (defaults to /opt/nim/.cache), allowing downloaded models and artifacts to be reused by follow-up runs.

-u $(id -u)

Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.

-p 8000:8000

Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of : is the host system ip:port (8000 here), while the right-hand side is the container port where the NIM server is published (defaults to 8000).

$IMG_NAME

Name and version of the VLM NIM container from NGC. The VLM NIM server automatically starts if no argument is provided after this.

Note

See the Configuring a NIM topic for information about additional configuration settings.

Note

If you have an issue with permission mismatches when downloading models in your local cache directory, add the `-u $(id -u)` option to the `docker run` call.

Note

NIM for VLMs automatically selects the most suitable profile based on your system specifications. For details, see Automatic Profile Selection

Run Inference#

During startup, the NIM container downloads the required resources and serves the model behind an API endpoint. The following message indicates a successful startup.

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once you see this message, you can validate the deployment of NIM by executing an inference request. In a new terminal, run the following command to show a list of models available for inference:

curl -X GET 'http://0.0.0.0:8000/v1/models'

Tip

Pipe the results of `curl` commands into a tool like jq or `python -m json.tool` to make the output of the API easier to read. For example: `curl -s http://0.0.0.0:8000/v1/models | jq`.

This command should produce output similar to the following:

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/vila",
      "object": "model",
      "created": 1727381869,
      "owned_by": "system",
      "root": "nvidia/vila",
      "parent": null,
      "max_model_len": 2048,
      "permission": [
        {
          "id": "modelperm-1dbf0dcbb7014450839e464a6fb04038",
          "object": "model_permission",
          "created": 1727381869,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update the model name according to your requirements. For example, for a `nvidia/vila` model, you might use the following command:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/vila",
        "messages": [
            {
                "role":"user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe the scene"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 256
    }'

You can also use the OpenAI Python API library

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the scene"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/vila",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

You may also provide an image as a base64 string:

import base64
import requests
from io import BytesIO

from PIL import Image
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"
image = Image.open(requests.get(url, stream=True).raw)
buffer = BytesIO()
image.save(buffer, format="PNG")
image_b64 = base64.b64encode(buffer.getvalue()).decode()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the scene"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_b64}"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/vila",
    messages=messages,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="nvidia/vila",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "Describe the scene"},
        {
            "type": "image_url",
            "image_url": {"url": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"},
        },
    ],
)

print(model.invoke([message]))

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Public direct URL

Passing the direct URL of an image directs the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "...ciBmZWxsb3chIQ=="
    }
}

Text-only support

Some clients might not support this vision extension of the chat API. NIM for VLMs exposes a way to send your images using the text-only fields, using HTML <img> tags:

{
    "role": "user",
    "content": "What is in this image? <img src=\"https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png\" />"
}

This is also compatible with the base64 representation.

Stopping the Container#

If a Docker container is launched with the --name command line option, you can stop the running container using the following command.

# In the previous sections, the environment variable CONTAINER_NAME was
# defined using `export CONTAINER_NAME=nvidia-vila`
docker stop $CONTAINER_NAME

Use docker kill if stop is not responsive. Follow either by docker rm $CONTAINER_NAME if you do not intend to restart this container as-is (with docker start $CONTAINER_NAME), in which case you will need to re-use the docker run ... instructions from the top of this section to start a new container for your NIM.

If you did not start a container with --name, look at the output of docker ps to get a container ID for the image you used.