Is this page helpful?

Query the Qwen3.5-397B-A17B API#

This page shows how to launch the NIM container and call the Chat Completions API with curl, the OpenAI Python SDK, and LangChain. It covers image inputs, text-only queries, multi-turn conversations, and function calling.

For more information on this model, refer to the model card on build.nvidia.com.

Launch NIM#

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=qwen-qwen3.5-397b-a17b

# The container name from the previous ngc registry image list command
Repository="qwen3.5-397b-a17b"
Latest_Tag="1.7.0-variant"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/qwen/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME

This NIM is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

OpenAI Chat Completions Request#

The Chat Completions endpoint is typically used with chat or instruct-tuned models designed for a conversational approach. With the endpoint, prompts are sent as messages with roles and content, providing a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Note

The snippets below use max_tokens to keep examples short. For reasoning examples, the max_tokens value is higher because reasoning output is typically longer.

For example, provide the URL of an image and query the NIM server. Add "stream": true in the request body for streaming responses.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "qwen/qwen3.5-397b-a17b",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 1024
    }'

You can include "stream": true in the request body above for streaming responses.

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the Chat Completions API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="qwen/qwen3.5-397b-a17b",
    messages=messages,
    max_tokens=1024,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

The above code snippet can be adapted to handle streaming responses as follows:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="qwen/qwen3.5-397b-a17b",
    messages=messages,
    max_tokens=1024,
    # Take note of this param.
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

The supported image formats are GIF, JPG, JPEG, and PNG.

To adjust the maximum number of images allowed per request, set the environment variable NIM_MAX_IMAGES_PER_PROMPT. The default value is 5.

Public direct URL

Passing the direct URL of an image will cause the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, you can use the base64 command or the following Python code:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Passing Videos#

Important

Video input is not enabled by default. To enable, set NIM_MAX_VIDEOS_PER_PROMPT=1.

NIM for VLMs follows the OpenAI specification to pass videos as part of the HTTP payload in a user message.

The format is similar to the one described in the previous section, except that it replaces image_url with video_url.

Public direct URL

Passing the direct URL of a video will cause the container to download that video at runtime.

{
    "type": "video_url",
    "video_url": {
        "url": "https://download.samplelib.com/mp4/sample-5s.mp4"
    }
}

Base64 data

Another option, useful for videos not already on the web, is to first base64-encode the video bytes and send that in your payload.

{
    "type": "video_url",
    "video_url": {
        "url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert videos to base64, you can use the base64 command or the following Python code:

with open("video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode()

Sampling and Preprocessing Parameters#

Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.

Video sampling

To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs API field.

Specify either fps or num_frames. If you specify both, the model uses the option that results in the least number of frames.

"media_io_kwargs": {"video": { "fps": 3.0 }}

or

"media_io_kwargs": {"video": { "num_frames": 16 }}

As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 2.0 FPS, matching the recommended value.

Note

Specifying values of fps or num_frames higher than the actual values for a given video results in an HTTP 400 error.

Note

The value of fps or num_frames is directly correlated to the temporal resolution of the model’s outputs. For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of the true values.

Video pixel count

To balance accuracy and performance, you can specify shortest_edge and longest_edge to control frame size after preprocessing. These parameters specify the minimum and maximum number of pixels for the ensemble of sampled frames.

Set these values using the top-level mm_processor_kwargs field:

"mm_processor_kwargs": {"size": { "shortest_edge": 1568, "longest_edge": 262144 }}

Defaults are shortest_edge=65536 and longest_edge=16777216.

Each patch of 32x32x2=2048 pixels maps to a multimodal input token.

Image frame size

For image inputs, this can be specified the same way:

"mm_processor_kwargs": {"size": { "shortest_edge": 1568, "longest_edge": 262144 }}

Each patch of 32x32=1024 pixels, for each image, maps to a multimodal input token.

OpenAI Python SDK

Use the extra_body parameter to pass these parameters in the OpenAI Python SDK.

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
model = client.models.list().data[0].id

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://download.samplelib.com/mp4/sample-5s.mp4"
                }
            },
            {
                "type": "text",
                "text": "What is in this video?"
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model=model,
    messages=messages,
    max_tokens=1024,
    stream=False,
    extra_body={
        "mm_processor_kwargs": {"size": {"shortest_edge": 1568, "longest_edge": 262144}},
        # Alternatively, this can be:
        # "media_io_kwargs": {"video": {"num_frames": some_int}},
        "media_io_kwargs": {"video": {"fps": 1.0}},
    }
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Summary table

The table below summarizes the parameters detailed above:

Preprocessing and sampling params#
Name	Example	Default	Notes
Video FPS sampling	`"media_io_kwargs": {"video": {"fps": 3.0}}`	4 FPS	Higher FPS can enhance accuracy at the cost of more computation. Mutually exclusive with explicit number of frames.
Sampling N frames in a video	`"media_io_kwargs": {"video": {"num_frames": 16}}`	N/A (default is FPS sampling, not fixed number of frames)	A higher number of frames can enhance accuracy at the cost of more computation. Mutually exclusive with explicit FPS.
Min / max number of pixels for videos and images	`"mm_processor_kwargs": {"size": { "shortest_edge": 1568, "longest_edge": 262144 }}`	`shortest_edge=65536, longest_edge=16777216`	Higher resolutions can enhance accuracy at the cost of more computation.

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Reasoning#

This model supports reasoning It is turned on by default. To turn off reasoning, add "chat_template_kwargs": { "enable_thinking": false } in the request body.

Example with reasoning turned off:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "qwen/qwen3.5-397b-a17b",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                            }
                    }
                ]
            }
        ],
        "chat_template_kwargs": { "enable_thinking": false },
        "max_tokens": 4096
    }'

You can also omit the reasoning tokens from the response by setting "include_reasoning": false in the request body. The model will still reason internally. Setting "include_reasoning": false is not supported for streaming responses.

Text-only Queries#

Many VLMs such as qwen/qwen3.5-397b-a17b support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "qwen/qwen3.5-397b-a17b",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096,
        "stream": true
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
stream = client.chat.completions.create(
    model="qwen/qwen3.5-397b-a17b",
    messages=messages,
    max_tokens=4096,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "qwen/qwen3.5-397b-a17b",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows an **NVIDIA DGX system**, which is NVIDIAs flagship line of AI supercomputers/servers. ..."
            },
            {
                "role": "user",
                "content": "When was this system released?"
            }
        ],
        "max_tokens": 4096
    }'

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows an **NVIDIA DGX system**, which is NVIDIAs flagship line of AI supercomputers/servers. ..."
    },
    {
        "role": "user",
        "content": "When was this system released?"
    }
]
chat_response = client.chat.completions.create(
    model="qwen/qwen3.5-397b-a17b",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="qwen/qwen3.5-397b-a17b",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
        },
    ],
)

print(model.invoke([message]))