Query the Nemotron Nano 12B v2 VL API#

For more information on this model, see the model card on build.nvidia.com.

Launch NIM#

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-nemotron-nano-12b-v2-vl"

# The container name from the previous ngc registry image list command
Repository="nemotron-nano-12b-v2-vl"
Latest_Tag="1.5.0"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Note

The -u $(id -u) option in the above docker run command is to ensure that the UID in the spawned container is the same as that of the user on the host. It is usually recommended if the $LOCAL_NIM_CACHE path on the host has permissions that forbid other users from writing to it.

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

Note

Most of the snippets below use a max_tokens value. This is mainly for illustration purposes where the output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length. For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised compared to non-reasoning examples.

For example, for a nvidia/nemotron-nano-12b-v2-vl model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-nano-12b-v2-vl",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 1024
    }'

You can include "stream": true to the request body above for streaming responses.

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the chat completion API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=messages,
    max_tokens=1024,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

The above code snippet can be adapted to handle streaming responses as follows:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=messages,
    max_tokens=1024,
    # Take note of this param.
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Passing images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

Supported image formats are JPG, JPEG, and PNG.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, you can use the base64 command, or in python:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Passing videos#

NIM for VLMs extends the OpenAI specification to pass videos as part of the HTTP payload in a user message.

The format is similar as the one described in the previous section, but replacing image_url with video_url. Refer to Supported codecs and video formats for the supported file formats.

Public direct URL

Passing the direct URL of a video will cause the container to download that video at runtime.

{
    "type": "video_url",
    "video_url": {
        "url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
    }
}

Base64 data

Another option, useful for videos not already on the web, is to first base64-encode the video bytes and send that in your payload.

{
    "type": "video_url",
    "video_url": {
        "url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert videos to base64, you can use the base64 command, or in Python:

with open("video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode()

Efficient Video Sampling (EVS)

EVS enables pruning of video tokens using model-provided embeddings and positions to reduce compute and latency for video requests. Control EVS using the NIM_VIDEO_PRUNING_RATE environment variable with a value between 0 and 1. The value represents the fraction of video tokens removed. The default value of 0.0 disables EVS. Higher values prune more tokens, improving throughput and latency with potential accuracy trade-offs. EVS applies to video inputs only. For this model, NIM_VIDEO_PRUNING_RATE=0.75 has shown to have minimal accuracy impact while giving a significant performance boost.

Sampling and preprocessing parameters#

Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.

Video sampling

To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs API field.

Either fps or num_frames can be specified (but not both at the same time).

"media_io_kwargs": {"video": { "fps": 3.0 }}

or

"media_io_kwargs": {"video": { "num_frames": 16 }}

As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 4.0 FPS, matching the training data.

Note

Secifying values of fps or num_frames higher than the actual values for a given video will result in a 400 error code.

Note

The value of fps or num_frames is directly correlated to the temporal resolution of the model’s outputs. For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of the true values.

Note

The model supports a maximum of 128 frames per video, depending on the video resolution. The default sampling rate is 2 FPS.

Image and video tiles

To control the maximum number of tiles for image and video processing, use the mm_processor_kwargs API field.

"mm_processor_kwargs": {"max_num_tiles": 3}

This parameter controls how images and videos are split into tiles for processing. Higher values allow processing of higher-resolution media at the cost of increased computation.

OpenAI Python SDK

Use the extra_body parameter to pass these parameters in the OpenAI Python SDK.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this video?"
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=messages,
    max_tokens=1024,
    stream=False,
    extra_body={
        # Alternatively, this can be:
        # "media_io_kwargs": {"video": {"num_frames": some_int}},
        "media_io_kwargs": {"video": {"fps": 1.0}},
    }
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Summary table

The table below summarizes the parameters detailed above:

Preprocessing and sampling parameters#

Name

Example

Default

Notes

Video FPS sampling

"media_io_kwargs": {"video": {"fps": 3.0}}

4.0

  • Higher FPS can enhance accuracy at the cost of more computation.

  • Mutually exclusive with explicit number of frames.

Sampling N frames in a video

"media_io_kwargs": {"video": {"num_frames": 16}}

N/A

  • A higher number of frames can enhance accuracy at the cost of more computation.

  • Mutually exclusive with explicit FPS.

Maximum number of tiles

"mm_processor_kwargs": {"max_num_tiles": 3}

N/A

  • Controls how images and videos are split into tiles.

  • Higher values allow higher-resolution media processing at the cost of more computation.

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Reasoning#

This model supports reasoning with text and image inputs. To turn on reasoning, add /think to the system prompt. For video inputs, reasoning is not supported.

Example with reasoning turned on:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-nano-12b-v2-vl",
        "messages": [
            {
                "role": "system",
                "content": "/think"
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 4096
    }'

Text-only Queries#

Many VLMs such as nvidia/nemotron-nano-12b-v2-vl support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-nano-12b-v2-vl",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-nano-12b-v2-vl",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ..."
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ],
        "max_tokens": 4096
    }'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ..."
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="nvidia/nemotron-nano-12b-v2-vl",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
        },
    ],
)

print(model.invoke([message]))

Batched Inference#

For improved throughput when processing multiple requests, you can use asynchronous batching to send concurrent requests to the NIM endpoint.

Install the async OpenAI client:

pip install -U openai asyncio

The following example demonstrates sending multiple concurrent requests with the same video:

import asyncio
from openai import AsyncOpenAI

async def process_video_request(client, video_url, prompt, request_id):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "video_url", "video_url": {"url": video_url}}
            ]
        }
    ]

    response = await client.chat.completions.create(
        model="nvidia/nemotron-nano-12b-v2-vl",
        messages=messages,
        max_tokens=1024,
    )

    return request_id, response.choices[0].message.content

async def batch_process():
    client = AsyncOpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

    video_url = "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
    prompts = [
        "What is happening in this video?",
        "Describe the main objects in this video.",
    ]

    tasks = [
        process_video_request(client, video_url, prompt, i)
        for i, prompt in enumerate(prompts)
    ]

    results = await asyncio.gather(*tasks)

    for request_id, content in results:
        print(f"Request {request_id}: {content}")

asyncio.run(batch_process())