Query the NVIDIA Nemotron 3 Nano Omni API#

For more information on this model, see the model card on build.nvidia.com.

Launch NIM#

Make sure you complete the steps in Get Started with NIM before you launch the NIM.

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-nemotron-3-nano-omni-30b-a3b-reasoning"

# The container name from the previous ngc registry image list command
Repository="nemotron-3-nano-omni-30b-a3b-reasoning"
Latest_Tag="1.7.0-variant"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+w "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME

This NIM is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Important

Update the model name according to the model you are running.

Note

Most of the snippets below use a max_tokens value. This is mainly for illustration purposes where the output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length. For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised compared to non-reasoning examples.

For example, for a nvidia/nemotron-3-nano-omni-30b-a3b-reasoning model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 4096
    }'

You can include "stream": true in the request body above for streaming responses.

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the chat completion API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

The above code snippet can be adapted to handle streaming responses as follows:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=True,
)

printed_reasoning = False
printed_content = False
for chunk in stream:
    reasoning = getattr(chunk.choices[0].delta, "reasoning", None) or None
    content = getattr(chunk.choices[0].delta, "content", None) or None
    if reasoning is not None:
        if not printed_reasoning:
            printed_reasoning = True
            print("reasoning:", end="", flush=True)
        print(reasoning, end="", flush=True)
    elif content is not None:
        if not printed_content:
            printed_content = True
            print("\ncontent:", end="", flush=True)
        print(content, end="", flush=True)
print()

Passing Images, Videos, and Audio#

This model accepts images, videos, and audio as input. Media is passed as part of the HTTP payload in a user message, following the OpenAI specification.

Supported formats#

Modality

Formats

Image

GIF, JPG, JPEG, PNG

Video

MP4

Audio

WAV, MP3, FLAC

Public direct URL

Passing the direct URL of a media file will cause the container to download it at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
    }
}
{
    "type": "video_url",
    "video_url": {
        "url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
    }
}

Refer to Supported Codecs and Video Formats for the supported file formats.

{
    "type": "audio_url",
    "audio_url": {
        "url": "https://sample-files.com/downloads/audio/wav/voice-sample.wav"
    }
}

Note

Some public URLs that are accessible from a browser may reject programmatic downloads (for example, based on the User-Agent header). If a URL request fails at runtime, use base64-encoded data instead.

Base64 data

For media not already on the web, base64-encode the bytes and send them inline.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}
{
    "type": "video_url",
    "video_url": {
        "url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}
{
    "type": "audio_url",
    "audio_url": {
        "url": "data:audio/wav;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert a file to base64 in Python:

import base64
with open(file_path, "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

Full example with the OpenAI Python SDK

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"}}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is happening in this video?"},
            {"type": "video_url", "video_url": {"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"}}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe and summarize this audio."},
            {"type": "audio_url", "audio_url": {"url": "https://sample-files.com/downloads/audio/wav/voice-sample.wav"}}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

Audio-in Video#

The model accepts video files that contain an embedded audio track. To enable this feature, pass a video containing audio via the video_url field described above, and set "mm_processor_kwargs": {"use_audio_in_video": true} in the request body.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe what is happening in this video, including any speech."
                    },
                    {
                        "type": "video_url",
                        "video_url": {
                            "url": "https://assets.ngc.nvidia.com/products/api-catalog/active-speaker-detection/video_1.mp4"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 4096,
        "mm_processor_kwargs": {"use_audio_in_video": true}
    }'

Sampling and Preprocessing Parameters#

Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.

Video sampling

To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs API field.

Either fps or num_frames can be specified. If one is specified, set the other to -1. If both are specified and positive, the number of frames sampled will be the maximum of the two.

"media_io_kwargs": {"video": { "fps": 3.0, "num_frames": -1}}

or

"media_io_kwargs": {"video": { "num_frames": 16, "fps": -1 }}

As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 2.0 FPS, matching the training data.

Note

The value of fps or num_frames is directly correlated to the temporal resolution of the model’s outputs. For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of the true values.

OpenAI Python SDK

Use the extra_body parameter to pass these parameters in the OpenAI Python SDK.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this video?"
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
    extra_body={
        # Alternatively, this can be:
        # "media_io_kwargs": {"video": {"num_frames": some_int, "fps": -1}},
        "media_io_kwargs": {"video": {"fps": 1.0, "num_frames": -1}},
    }
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

Efficient Video Sampling (EVS)

EVS enables pruning of video tokens to reduce compute and latency for video requests. Control EVS using the NIM_VIDEO_PRUNING_RATE environment variable with a value between 0 and 1. The value represents the fraction of video tokens removed. The default value of 0.5 provides a good balance between accuracy and performance. Setting it to 0 disables EVS entirely. Higher values prune more tokens, improving throughput and latency with potential accuracy trade-offs. EVS applies to video inputs only.

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Reasoning#

This model supports reasoning with text and vision inputs. Thinking is enabled by default. To disable it for a specific request, pass "chat_template_kwargs": {"enable_thinking": false} in the request body.

When reasoning is enabled, the response separates the reasoning trace from the final answer into two fields:

  • reasoning: the model’s internal chain-of-thought.

  • content: the final answer presented to the user.

Warning

If max_tokens is set too low, the model may exhaust the token budget during the reasoning phase, resulting in an empty content field. Use a sufficiently large value (for example, 4096 or higher) when reasoning is enabled.

Example with reasoning turned off:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 4096,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Reasoning Budget Control#

This model supports a thinking budget that limits the maximum number of tokens used for reasoning. You can control the reasoning budget by setting the thinking_token_budget parameter.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "user",
                "content": "What is the derivative of x^2?"
            }
        ],
        "max_tokens": 4096,
        "thinking_token_budget": 100
    }'

Text-only Queries#

Many VLMs such as nvidia/nemotron-3-nano-omni-30b-a3b-reasoning support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096
    }'

Or using the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

Multi-turn Conversation#

Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.

Important

Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows a boardwalk in a field of tall grass. ..."
            },
            {
                "role": "user",
                "content": "What would be the best season to visit this place?"
            }
        ],
        "max_tokens": 4096
    }'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows a boardwalk in a field of tall grass. ..."
    },
    {
        "role": "user",
        "content": "What would be the best season to visit this place?"
    }
]
chat_response = client.chat.completions.create(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    messages=messages,
    max_tokens=4096,
    stream=False
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

Using LangChain#

NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
        },
    ],
)

print(model.invoke([message]))

Batched Inference#

For improved throughput when processing multiple requests, you can use asynchronous batching to send concurrent requests to the NIM endpoint.

Install the async OpenAI client:

pip install -U openai

The following example demonstrates sending multiple concurrent requests with the same video:

import asyncio
from openai import AsyncOpenAI

async def process_video_request(client, video_url, prompt, request_id):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "video_url", "video_url": {"url": video_url}}
            ]
        }
    ]

    response = await client.chat.completions.create(
        model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
        messages=messages,
        max_tokens=4096,
    )

    msg = response.choices[0].message
    return request_id, msg.reasoning, msg.content

async def batch_process():
    client = AsyncOpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

    video_url = "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
    prompts = [
        "What is happening in this video?",
        "Describe the main objects in this video.",
    ]

    tasks = [
        process_video_request(client, video_url, prompt, i)
        for i, prompt in enumerate(prompts)
    ]

    results = await asyncio.gather(*tasks)

    for request_id, reasoning, content in results:
        print(f"Request {request_id} reasoning: {reasoning}")
        print(f"Request {request_id} content: {content}")

asyncio.run(batch_process())