Query the NVIDIA Nemotron 3 Nano Omni API#
For more information on this model, see the model card on build.nvidia.com.
Launch NIM#
Make sure you complete the steps in Get Started with NIM before you launch the NIM.
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-nemotron-3-nano-omni-30b-a3b-reasoning"
# The container name from the previous ngc registry image list command
Repository="nemotron-3-nano-omni-30b-a3b-reasoning"
Latest_Tag="1.7.0-variant"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Optionally add sticky bit to avoid issues writing to the cache if the container is running as a different user
chmod -R a+w "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME
This NIM is built with a specialized base container and is subject to
limitations. It uses the release tag 1.7.0-variant for the
container image. Refer to Notes on NIM Container Variants
for more information.
OpenAI Chat Completion Request#
The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.
Important
Update the model name according to the model you are running.
Note
Most of the snippets below use a max_tokens value. This is mainly for illustration purposes where the
output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length.
For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised
compared to non-reasoning examples.
For example, for a nvidia/nemotron-3-nano-omni-30b-a3b-reasoning model,
you might provide the URL of an image
and query the NIM server from the command
line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
}
],
"max_tokens": 4096
}'
You can include "stream": true in the request body above for streaming responses.
Alternatively, you can use the OpenAI Python SDK library
pip install -U openai
Run the client and query the chat completion API:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
The above code snippet can be adapted to handle streaming responses as follows:
# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=True,
)
printed_reasoning = False
printed_content = False
for chunk in stream:
reasoning = getattr(chunk.choices[0].delta, "reasoning", None) or None
content = getattr(chunk.choices[0].delta, "content", None) or None
if reasoning is not None:
if not printed_reasoning:
printed_reasoning = True
print("reasoning:", end="", flush=True)
print(reasoning, end="", flush=True)
elif content is not None:
if not printed_content:
printed_content = True
print("\ncontent:", end="", flush=True)
print(content, end="", flush=True)
print()
Passing Images, Videos, and Audio#
This model accepts images, videos, and audio as input. Media is passed as part of the HTTP payload in a user message, following the OpenAI specification.
Modality |
Formats |
|---|---|
Image |
GIF, JPG, JPEG, PNG |
Video |
MP4 |
Audio |
WAV, MP3, FLAC |
Public direct URL
Passing the direct URL of a media file will cause the container to download it at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
{
"type": "video_url",
"video_url": {
"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
}
}
Refer to Supported Codecs and Video Formats for the supported file formats.
{
"type": "audio_url",
"audio_url": {
"url": "https://sample-files.com/downloads/audio/wav/voice-sample.wav"
}
}
Note
Some public URLs that are accessible from a browser may reject programmatic downloads (for example, based on the User-Agent header). If a URL request fails at runtime, use base64-encoded data instead.
Base64 data
For media not already on the web, base64-encode the bytes and send them inline.
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
{
"type": "audio_url",
"audio_url": {
"url": "data:audio/wav;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert a file to base64 in Python:
import base64
with open(file_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
Full example with the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"}}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is happening in this video?"},
{"type": "video_url", "video_url": {"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"}}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe and summarize this audio."},
{"type": "audio_url", "audio_url": {"url": "https://sample-files.com/downloads/audio/wav/voice-sample.wav"}}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
Audio-in Video#
The model accepts video files that contain an embedded audio track. To enable
this feature, pass a video containing audio via the video_url field
described above, and set "mm_processor_kwargs": {"use_audio_in_video": true}
in the request body.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what is happening in this video, including any speech."
},
{
"type": "video_url",
"video_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/active-speaker-detection/video_1.mp4"
}
}
]
}
],
"max_tokens": 4096,
"mm_processor_kwargs": {"use_audio_in_video": true}
}'
Sampling and Preprocessing Parameters#
Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.
Video sampling
To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs API field.
Either fps or num_frames can be specified. If one is specified, set the other to -1. If both are specified and positive, the number of frames sampled will be the maximum of the two.
"media_io_kwargs": {"video": { "fps": 3.0, "num_frames": -1}}
or
"media_io_kwargs": {"video": { "num_frames": 16, "fps": -1 }}
As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 2.0 FPS, matching the training data.
Note
The value of fps or num_frames is directly correlated to the temporal resolution of the model’s outputs.
For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of
the true values.
OpenAI Python SDK
Use the extra_body parameter to pass these parameters in the OpenAI Python SDK.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this video?"
},
{
"type": "video_url",
"video_url": {
"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
extra_body={
# Alternatively, this can be:
# "media_io_kwargs": {"video": {"num_frames": some_int, "fps": -1}},
"media_io_kwargs": {"video": {"fps": 1.0, "num_frames": -1}},
}
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
Efficient Video Sampling (EVS)
EVS enables pruning of video tokens to reduce compute and latency for video
requests. Control EVS using the NIM_VIDEO_PRUNING_RATE environment variable
with a value between 0 and 1. The value represents the fraction of video
tokens removed. The default value of 0.5 provides a good balance between
accuracy and performance. Setting it to 0 disables EVS entirely. Higher
values prune more tokens, improving throughput and latency with potential
accuracy trade-offs. EVS applies to video inputs only.
Function (Tool) Calling#
You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).
Reasoning#
This model supports reasoning with text and vision inputs. Thinking is
enabled by default. To disable it for a specific request, pass
"chat_template_kwargs": {"enable_thinking": false} in the request body.
When reasoning is enabled, the response separates the reasoning trace from the final answer into two fields:
reasoning: the model’s internal chain-of-thought.content: the final answer presented to the user.
Warning
If max_tokens is set too low, the model may exhaust the token budget
during the reasoning phase, resulting in an empty content field. Use a
sufficiently large value (for example, 4096 or higher) when reasoning is enabled.
Example with reasoning turned off:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
}
],
"max_tokens": 4096,
"chat_template_kwargs": {"enable_thinking": false}
}'
Reasoning Budget Control#
This model supports a thinking budget that limits the maximum number of tokens used for reasoning.
You can control the reasoning budget by setting the thinking_token_budget parameter.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "user",
"content": "What is the derivative of x^2?"
}
],
"max_tokens": 4096,
"thinking_token_budget": 100
}'
Text-only Queries#
Many VLMs such as nvidia/nemotron-3-nano-omni-30b-a3b-reasoning support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 4096
}'
Or using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.
Important
Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
],
"max_tokens": 4096
}'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
stream=False
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
Using LangChain#
NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).
Install LangChain using the following command:
pip install -U langchain-openai langchain-core
Query the OpenAI Chat Completions endpoint using LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
model = ChatOpenAI(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
openai_api_base="http://0.0.0.0:8000/v1",
openai_api_key="not-needed"
)
message = HumanMessage(
content=[
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
},
],
)
print(model.invoke([message]))
Batched Inference#
For improved throughput when processing multiple requests, you can use asynchronous batching to send concurrent requests to the NIM endpoint.
Install the async OpenAI client:
pip install -U openai
The following example demonstrates sending multiple concurrent requests with the same video:
import asyncio
from openai import AsyncOpenAI
async def process_video_request(client, video_url, prompt, request_id):
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "video_url", "video_url": {"url": video_url}}
]
}
]
response = await client.chat.completions.create(
model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
messages=messages,
max_tokens=4096,
)
msg = response.choices[0].message
return request_id, msg.reasoning, msg.content
async def batch_process():
client = AsyncOpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
video_url = "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
prompts = [
"What is happening in this video?",
"Describe the main objects in this video.",
]
tasks = [
process_video_request(client, video_url, prompt, i)
for i, prompt in enumerate(prompts)
]
results = await asyncio.gather(*tasks)
for request_id, reasoning, content in results:
print(f"Request {request_id} reasoning: {reasoning}")
print(f"Request {request_id} content: {content}")
asyncio.run(batch_process())