Query the Cosmos Reason1 7B API#
For more information on this model, see the model card on build.nvidia.com.
Launch NIM#
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-cosmos-reason1-7b"
# The container name from the previous ngc registry image list command
Repository="cosmos-reason1-7b"
Latest_Tag="1.4.0"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Note
The -u $(id -u)
option in the above docker run
command is to ensure that the UID in the spawned
container is the same as that of the user on the host. It is usually recommended if the $LOCAL_NIM_CACHE
path on the host has permissions that forbid other users from writing to it.
OpenAI Chat Completion Request#
The Chat Completions endpoint is typically used with chat
or instruct
tuned models designed for a conversational approach. With the endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true
.
Important
Update the model name according to the model you are running.
Note
Most of the snippets below use a max_tokens
value. This is mainly for illustration purposes where the
output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length.
For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised
compared to non-reasoning examples.
For example, for a nvidia/cosmos-reason1-7b
model, you might provide the URL of an image and query the NIM server from the command line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason1-7b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 256
}'
You can include "stream": true
to the request body above for streaming responses.
Alternatively, you can use the OpenAI Python SDK library
pip install -U openai
Run the client and query the chat completion API:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=256,
stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
The above code snippet can be adapted to handle streaming responses as follows:
# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=256,
# Take note of this param.
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
text = delta.content
# Print immediately and without a newline to update the output as the response is
# streamed in.
print(text, end="", flush=True)
# Final newline.
print()
Passing images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64
command, or in python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Passing videos#
NIM for VLMs extends the OpenAI specification to pass videos as part of the HTTP payload in a user message.
The format is similar as the one described in the previous section, but replacing image_url
with video_url
.
Public direct URL
Passing the direct URL of a video will cause the container to download that video at runtime.
{
"type": "video_url",
"video_url": {
"url": "https://download.samplelib.com/mp4/sample-5s.mp4"
}
}
Base64 data
Another option, useful for videos not already on the web, is to first base64-encode the video bytes and send that in your payload.
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert videos to base64, you can use the base64
command, or in Python:
with open("video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode()
Sampling and preprocessing parameters#
Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.
Video sampling
To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs
API field.
Either fps
or num_frames
can be specified (but not both at the same time).
"media_io_kwargs": {"video": { "fps": 3.0 }}
or
"media_io_kwargs": {"video": { "num_frames": 16 }}
As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 4.0 FPS, matching the training data.
Note
Secifying values of fps
or num_frames
higher than the actual values for a given video will result in a 400 error code.
Note
The value of fps
or num_frames
is directly correlated to the temporal resolution of the model’s outputs.
For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of
the true values.
Video frame size
In order to pick a balance between accuracy and performance, min_pixels
and max_pixels
can be specified to control the size of frames after preprocessing.
This is done using the top-level field mm_processor_kwargs
:
"mm_processor_kwargs": {"videos_kwargs": { "min_pixels": 1568, "max_pixels": 262144 }}
Defaults are min_pixels=3136
and max_pixels=12845056
.
Each patch of 28x28=784 pixels, for each group of 2 sampled frames, maps to a multimodal input token. By default, every 2 sampled frames are mapped to between 4 and 16384 tokens depending on the frame size. This model was tested to perform best with 16k multimodal tokens or less.
Image frame size
For image inputs, this can be specified similarly:
"mm_processor_kwargs": {"images_kwargs": { "min_pixels": 1568, "max_pixels": 262144 }}
Each patch of 28x28=784 pixels, for each image, maps to a multimodal input token. By default, every input image is mapped to between 4 and 16384 tokens depending on the image size. This model was tested to perform best with 16k multimodal tokens or less.
OpenAI Python SDK
Use the extra_body
parameter to pass these parameters in the OpenAI Python SDK.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this video?"
},
{
"type": "video_url",
"video_url": {
"url": "https://download.samplelib.com/mp4/sample-5s.mp4"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=1024,
stream=False,
extra_body={
"mm_processor_kwargs": {"videos_kwargs": {"min_pixels": 1568, "max_pixels": 262144}},
# Alternatively, this can be:
# "media_io_kwargs": {"video": {"num_frames": some_int}},
"media_io_kwargs": {"video": {"fps": 1.0}},
}
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Summary table
The table below summarizes the parameters detailed above:
Name |
Example |
Default |
Notes |
---|---|---|---|
Video FPS sampling |
|
4.0 |
|
Sampling N frames in a video |
|
N/A |
|
Min / max number of pixels for videos |
|
|
Higher resolutions can enhance accuracy at the cost of more computation. |
Min / max number of pixels for images |
|
|
Higher resolutions can enhance accuracy at the cost of more computation. |
Reasoning#
Following the guidance from model authors, append the following to the system prompt to encourage a long chain-of-thought reasoning response.
Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>.
Example:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason1-7b",
"messages": [
{
"role": "system",
"content": "Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 4096
}'
Temporal localization#
nvidia/cosmos-reason1-7b
can identify the timestamps of events in videos.
Warning
The model expect timestamps to be watermarked onto a video the same way as done during training.
The README
from the cosmos-reason1
repository shows how an input video can be modified to have the
timestamps watermarked.
Note
The model’s ability to extract timestamps within a desired temporal resolution is directly affected
by the fps
at which they are processed. See the Sampling and preprocessing parameters section
above for more details.
Assuming the input video has been watermarked with the timestamps, we can ask the model to identify the timestamps of events as in the below example.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
_SYSTEM_PROMPT = """You are a specialized Autonomous Vehicle behavior analyst. Your task is to analyze the video and identify MULTIPLE discrete driving events with precise timestamps.
CRITICAL REQUIREMENTS:
1. Extract timestamps from the bottom of each frame
2. Segment the video into SEPARATE events (minimum 3-5 events per video)
3. Each event should be 2-5 seconds long maximum
4. Look for changes in: speed, direction, lane position, traffic interactions, signals, obstacles
MANDATORY EVENT TYPES TO IDENTIFY:
- Lane changes or lane keeping
- Speed adjustments (acceleration/deceleration)
- Turning maneuvers
- Traffic light responses
- Pedestrian/vehicle interactions
- Parking or stopping actions
- Navigation decisions
Answer the question in the following format:
<think>
I will analyze the video systematically:
1. First, identify ALL visible timestamps throughout the video
2. Break the video into 1-3 second segments
3. For EACH segment, identify what specific action occurs
4. Determine the physical reasoning for each action
5. Ensure I have at least 3-6 separate events
6. Always answer in English
Event 1: <start time> - <end time> - Action and reasoning
Event 2: <start time> - <end time> - Action and reasoning
Event 3: <start time> - <end time> - Action and reasoning
Event 4: <start time> - <end time> - Action and reasoning
[Continue for all events identified]
</think>
<answer>
<start time> - <end time> Specific driving action and detailed explanation of why the ego vehicle made this decision.
<start time> - <end time> Specific driving action and detailed explanation of why the ego vehicle made this decision.
<start time> - <end time> Specific driving action and detailed explanation of why the ego vehicle made this decision.
<start time> - <end time> Specific driving action and detailed explanation of why the ego vehicle made this decision.
[Continue for all events identified]
</answer>
"""
messages = [
{"role": "user", "content": [{"type": "text", "text": _SYSTEM_PROMPT}]},
{"role": "user", "content": [
{
"type": "text",
"text": "describe ego vehicle actions"
},
{
"type": "video_url",
"video_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos-reason1-7b/av_construction_stop_timestamped.mp4"
}
}
]},
]
response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=2048,
temperature=0.2,
top_p=0.95,
extra_body={
"media_io_kwargs": {"video": {"fps": 3}},
"nvext": {"repetition_penalty": 1.05},
},
)
assistant_message = response.choices[0].message.content
print(assistant_message)
Anomaly detection#
nvidia/cosmos-reason1-7b
can reason about the plausibility / physical accuracy of
videos and help identify anomalies and artifacts.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
_SYSTEM_PROMPT = """You are an assistant for video anomaly detection. The goal is to identify artifacts and anomalies in the video.
You are an expert in the physics of moving vehicles.
When analyzing a video, reason not only about what happens but also about what should happen as a natural consequence of events.
Watch carefully and focus on the following details:
* Physical accuracy (gravity, collision, object interaction, fluid dynamics, object permanence, etc.)
* Cause-and-effect
* Temporal consistency
* Spatial consistency
* Expected side-effects of impacts / collisions.
Please reason by:
* Identifying key events (e.g. collisions).
* Inferring the expected physical side effects of such events.
* Compare the actual video to these expectations.
* Highlight anomalies that arise from mismatches.
Here are some examples of things you should not include in your analysis:
+ Do not consider legality or human etiquette. Your analysis should be limited strictly to physical realism.
* Avoid ungrounded and over-general explanations such as overall impression, artistic style, or background elements.
* The video has no sound. Avoid explanations based on sound.
* Do not mention lighting, shadows, or blurring in your analysis.
* Do not try to capture the 'mood' of the scene.
Answer the question in English with provided options in the following format:
<think>
your reasoning
</think>
<answer>
your answer
</answer>
"""
messages = [
{"role": "user", "content": [{"type": "text", "text": _SYSTEM_PROMPT}]},
{"role": "user", "content": [
{
"type": "text",
"text": "Is the ego vehicle moving realistically when it drives over the curb on the median?"
},
{
"type": "video_url",
"video_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos-reason1-7b/car_curb.mp4"
}
}
]},
]
response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=4096,
temperature=0.0,
top_p=1.0,
extra_body={
"media_io_kwargs": {"video": {"fps": 4}},
},
)
assistant_message = response.choices[0].message.content
print(assistant_message)
Structured generation#
nvidia/cosmos-reason1-7b
supports structured generation.
This can be useful for e.g. robotics planning related tasks:
from typing import List, Literal, Union
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
_SYSTEM_PROMPT = """You are an expert Physical AI planning model trained to understand space, time, and physics for robots and autonomous systems.
Your role: When asked to plan the next N steps for a robot or agent, analyze the video/scene and generate a specific number of sequential actions with detailed reasoning.
The user may specify how many steps to plan (e.g., "next step", "next 2 steps", "next 3 steps").
ALWAYS respond in this format in English:
<think>
1. Scene Analysis: [Describe current environment, objects, robot state]
2. Goal Understanding: [What the user wants to accomplish]
3. Step Count: [How many steps requested by user]
4. Sequential Planning: [Plan each step in logical order, considering:
- Physical constraints and safety
- Object interactions and reachability
- Temporal dependencies between actions
- Spatial relationships and object positioning
5. Physics Validation: [Check each planned step for physical plausibility]
</think>
<answer>
**Planned Actions (Next [N] Steps):**
**Step 1:** [Clear, specific action description]
- Reasoning: [Why this action first, physical considerations]
- Success condition: [How to verify completion]
**Step 2:** [Clear, specific action description]
- Reasoning: [Why this action follows step 1, dependencies]
- Success condition: [How to verify completion]
[Continue for N total steps as requested]
</answer>
"""
class MoveAction(BaseModel):
type: Literal["move"]
direction: str
distance_m: float
class TurnAction(BaseModel):
type: Literal["turn"]
angle_deg: float
class ActionsPayload(BaseModel):
actions: List[Union[MoveAction, TurnAction]]
# Example prompt/messages
messages = [
{"role": "user", "content": [{"type": "text", "text": _SYSTEM_PROMPT}]},
{"role": "user", "content": [
{
"type": "text",
"text": (
"The overall goal is to \"toast bread\". The agent in the video is currently "
"performing one subtask out of many to complete this instruction. For the agent in "
"the video, what are the 3 most plausible next immediate subtasks?"
),
},
{
"type": "video_url",
"video_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos-reason1-7b/planning.mp4"
}
}
]},
]
# Send the request with the new schema
response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=4096,
response_format={
"type": "json_schema",
"json_schema": {
"name": "ActionsPayload",
"schema": ActionsPayload.model_json_schema()
}
},
temperature=0.2,
top_p=0.95,
extra_body={
"media_io_kwargs": {"video": {"fps": 4}},
"nvext": {"repetition_penalty": 1.05},
},
)
assistant_message = response.choices[0].message.content
print(assistant_message)
For more details on the output formatting options, including JSON schemas, regular expressions, and context-free grammars, see Structured Generation.
Text-only Queries#
Many VLMs such as nvidia/cosmos-reason1-7b
support text-only queries, where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason1-7b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 4096
}'
Or using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
chat_response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=4096,
stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Multi-turn Conversation#
Instruction-tuned VLMs may also support multi-turn conversations with repeated interactions between the user and the model.
Important
Multi-turn capability is not available for all VLMs. Please refer to the model cards for information on multi-turn conversations.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason1-7b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
],
"max_tokens": 4096
}'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows a boardwalk in a field of tall grass. ..."
},
{
"role": "user",
"content": "What would be the best season to visit this place?"
}
]
chat_response = client.chat.completions.create(
model="nvidia/cosmos-reason1-7b",
messages=messages,
max_tokens=4096,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Using LangChain#
NIM for VLMs allows seamless integration with LangChain, a framework for developing applications powered by large language models (LLMs).
Install LangChain using the following command:
pip install -U langchain-openai langchain-core
Query the OpenAI Chat Completions endpoint using LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
model = ChatOpenAI(
model="nvidia/cosmos-reason1-7b",
openai_api_base="http://0.0.0.0:8000/v1",
openai_api_key="not-needed"
)
message = HumanMessage(
content=[
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},
},
],
)
print(model.invoke([message]))