Query the Cosmos Reason2 API#
This model is available in two sizes: 2B and 8B. This guide uses the 2B version as an example, but the 8B version is easily adaptable as shown below. For more information on this model, see the model cards on Hugging Face:
Launch NIM#
The following command launches a Docker container for the 2B version of this specific model:
Note
Launching the NIM might take a couple of minutes to initialize and compile the model to ensure the best performance.
# Choose a container name for bookkeeping
export CONTAINER_NAME="nvidia-cosmos-reason2-2b" # or "nvidia-cosmos-reason2-8b"
# The container name from the previous ngc registry image list command
Repository="cosmos-reason2-2b" # or "cosmos-reason2-8b"
Latest_Tag="1.6.0"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Note
The -u $(id -u) option in the above docker run command is to ensure that the UID in the spawned
container is the same as that of the user on the host. It is usually recommended if the $LOCAL_NIM_CACHE
path on the host has permissions that forbid other users from writing to it.
Refer to the Docker container for the 8B version of this specific model.
export CONTAINER_NAME="nvidia-cosmos-reason2-8b"
export REPOSITORY="cosmos-reason2-8b"
export IMG_NAME="nvcr.io/nim/nvidia/${REPOSITORY}:${LATEST_TAG}"
docker run -it --rm --name=$CONTAINER_NAME \
OpenAI Chat Completion Request#
The Chat Completions endpoint is typically used with chat or instruct
tuned models designed for a conversational approach. With the endpoint, prompts
are sent in the form of messages with roles and contents. To stream the result,
set "stream": true.
Important
Update the model name according to the model you are running.
Note
Most of the snippets below use a max_tokens value. This is mainly for illustration purposes where the
output is unlikely to be much longer, and to ensure the generation requests terminate at a reasonable length.
For reasoning examples, where it is common for there to be a lot more output tokens, that upper bound is raised
compared to non-reasoning examples.
For example, for a nvidia/cosmos-reason2-2b model, you might
provide the URL of an image and query the NIM server from the command line:
curl -X 'POST' \
'http://127.0.0.1:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason2-2b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1280px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
{
"type": "text",
"text": "What is in this image?"
}
]
}
],
"max_tokens": 256,
"stream": false
}'
You can include "stream": true in the request body for streaming responses.
Alternatively, you can use the OpenAI Python SDK library
pip install -U openai
Run the client and query the chat completion API:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1280px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
{
"type": "text",
"text": "What is in this image?"
}
]
}
]
chat_response = client.chat.completions.create(
model="nvidia/cosmos-reason2-2b",
messages=messages,
max_tokens=256,
stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
The above code snippet can be adapted to handle streaming responses as follows:
# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
model="nvidia/cosmos-reason2-2b",
messages=messages,
max_tokens=256,
# Take note of this param.
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
text = delta.content
# Print immediately and without a newline to update the output as the response is
# streamed in.
print(text, end="", flush=True)
# Final newline.
print()
Passing Images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Important
Supported image formats are JPG, JPEG, and PNG.
Public direct URL
Passing the direct URL of an image will cause the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1280px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to first base64-encode the image bytes and send that in your payload.
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, you can use the base64 command, or in python:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Passing Videos#
NIM for VLMs extends the OpenAI specification to pass videos as part of the HTTP payload in a user message.
The format is similar as the one described in the previous section, but replacing image_url with video_url.
Public direct URL
Passing the direct URL of a video will cause the container to download that video at runtime.
{
"type": "video_url",
"video_url": {
"url": "https://download.samplelib.com/mp4/sample-5s.mp4"
}
}
Base64 data
Another option, useful for videos not already on the web, is to first base64-encode the video bytes and send that in your payload.
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert videos to base64, you can use the base64 command, or in Python:
with open("video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode()
Sampling and Preprocessing Parameters#
Extensions of the OpenAI API are proposed to better control sampling and preprocessing of images and videos at request time.
Video sampling
To control how frames are sampled from video inputs, sampling parameters are exposed using the top-level media_io_kwargs API field.
Either fps or num_frames can be specified. If both are specified, the model will use the option with the least number of frames.
"media_io_kwargs": {"video": { "fps": 3.0 }}
or
"media_io_kwargs": {"video": { "num_frames": 16 }}
As a general guideline, sampling more frames can result in better accuracy but hurts performance. The default sampling rate is 4.0 FPS, matching the training data.
Note
Specifying values of fps or num_frames higher than the actual values for a given video will result in a 400 error code.
Note
The value of fps or num_frames is directly correlated to the temporal resolution of the model’s outputs.
For example, at 2 FPS, timestamp precision in the generated output will be at best within +/- 0.25 seconds of
the true values.
Video pixel count
In order to pick a balance between accuracy and performance, shortest_edge and longest_edge can be specified to control the size of frames after preprocessing.
They specify the minimum and maximum number of pixels for the ensemble of sampled frames.
This is done using the top-level field mm_processor_kwargs:
"mm_processor_kwargs": {"size": { "shortest_edge": 1568, "longest_edge": 262144 }}
Defaults are shortest_edge=3136 and longest_edge=12845056.
Each patch of 32x32x2=2048 pixels maps to a multimodal input token. This model was tested to perform best with 16k multimodal tokens or less.
Image frame size
For image inputs, this can be specified the same way:
"mm_processor_kwargs": {"size": { "shortest_edge": 1568, "longest_edge": 262144 }}
Each patch of 32x32=1024 pixels, for each image, maps to a multimodal input token. This model was tested to perform best with 16k multimodal tokens or less.
OpenAI Python SDK
Use the extra_body parameter to pass these parameters in the OpenAI Python SDK.
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
model = client.models.list().data[0].id
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": "https://download.samplelib.com/mp4/sample-5s.mp4"
}
},
{
"type": "text",
"text": "What is in this video?"
}
]
}
]
chat_response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1024,
stream=False,
extra_body={
"mm_processor_kwargs": {"size": {"shortest_edge": 1568, "longest_edge": 262144}},
# Alternatively, this can be:
# "media_io_kwargs": {"video": {"num_frames": some_int}},
"media_io_kwargs": {"video": {"fps": 1.0}},
}
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Summary table
The table below summarizes the parameters detailed above:
Name |
Example |
Default |
Notes |
|---|---|---|---|
Video FPS sampling |
|
4 FPS |
|
Sampling N frames in a video |
|
N/A (default is FPS sampling, not fixed number of frames) |
|
Min / max number of pixels for videos and images |
|
|
Higher resolutions can enhance accuracy at the cost of more computation. |
Reasoning#
To enable reasoning, append the following message at the end of the user prompt. This will encourage the model to produce a long chain-of-thought reasoning response.
Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>.
Example:
curl -X 'POST' \
'http://127.0.0.1:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason2-2b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1280px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
{
"type": "text",
"text": "What is in this image? \nAnswer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."
}
]
}
],
"max_tokens": 4096
}'
Synthetic Data Generation Critic#
cosmos-reason2 can work as a critic for Synthetically
Generated Data (SDG). The model can evaluate adherence to physics, object
permanence, and lack of anomalies.
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
model = client.models.list().data[0].id
_USER_PROMPT = """Approve or reject this generated video for inclusion in a dataset for physical world model AI training. It must perfectly adhere to physics, object permanence, and have no anomalies. Any issue or concern causes rejection.
Answer the question using the following format:
<think>
Your reasoning.
</think>
Write your final answer immediately after the </think> tag. Answer with Approve or Reject only."""
message = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "video_url", "video_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos-reason1-7b/car_curb.mp4"}},
{"type": "text", "text": _USER_PROMPT},
]},
]
response = client.chat.completions.create(
model=model,
messages=message,
max_tokens=4096,
temperature=0.3,
top_p=0.3,
extra_body={"media_io_kwargs": {"fps": 4}},
)
print(response.choices[0].message.content)
2D Trajectory Creation#
cosmos-reason2 supports point coordinates on image inputs.
A pixel coordinate is normalized to 0-1000. The origin is the left top corner. X is to the right (horizontal axis),
and Y is to the bottom (vertical axis). It doesn’t matter what the aspect ratio is;
each axis has independent 0-1000 normalization (for example, 1920x1080).
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
model = client.models.list().data[0].id
_USER_PROMPT = """You are given the task "Move the left bottle to far right". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {"point_2d": [x, y], "label": "gripper trajectory"}.
Answer the question using the following format:
<think>
Your reasoning.
</think>
Write your final answer immediately after the </think> tag."""
message = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos-reason1-7b/critic_rejection_sampling.jpg"}},
{"type": "text", "text": _USER_PROMPT},
]},
]
response = client.chat.completions.create(
model=model,
messages=message,
max_tokens=4096,
temperature=0.3,
top_p=0.3,
extra_body={"media_io_kwargs": {"fps": 2}},
)
print(response.choices[0].message.content)
Text-only Queries#
Many VLMs such as nvidia/cosmos-reason2-2b and
nvidia/cosmos-reason2-8b support text-only queries, where a
VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Please refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://127.0.0.1:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/cosmos-reason2-2b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 4096
}'
Or using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
model = client.models.list().data[0].id
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
chat_response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=4096,
stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)