Query the DiffusionGemma 26B A4B IT API#

For more information on this model, refer to the model page.

Launch NIM#

Make sure you complete the steps in Get Started with NIM before you launch the NIM.

The following command launches a Docker container for this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=google-diffusiongemma-26b-a4b-it

# The container name from the previous ngc registry image list command
Repository="diffusiongemma-26b-a4b-it"
Latest_Tag="1.7.0-variant"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/google/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME

This NIM is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

OpenAI Chat Completion Request#

The Chat Completions endpoint is typically used with chat or instruct tuned models designed for a conversational approach. With this endpoint, prompts are sent in the form of messages with roles and content, giving a natural way to keep track of a multi-turn conversation. To stream the response, set "stream": true.

Note

Examples use max_tokens to keep output short. Reasoning examples use a higher max_tokens because reasoning output is typically longer.

For example, for a google/diffusiongemma-26b-a4b-it model, you might provide the URL of an image and query the NIM server from the command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/diffusiongemma-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 1024
    }'

You can include "stream": true in the preceding request body for streaming responses.

Alternatively, you can use the OpenAI Python SDK library

pip install -U openai

Run the client and query the Chat Completions API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=1024,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

To stream responses, pass stream=True and iterate over the response:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=1024,
    # Take note of this param.
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Passing Images and Videos#

This model accepts images, and videos as input. Media is passed as part of the HTTP payload in a user message, following the OpenAI specification.

Supported formats#

Modality

Formats

Image

GIF, JPG, JPEG, PNG

Video

MP4

Public direct URL

Passing the direct URL of a media file will cause the container to download it at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
    }
}
{
    "type": "video_url",
    "video_url": {
        "url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"
    }
}

Refer to Supported Codecs and Video Formats for the supported file formats.

Note

Some public URLs that are accessible from a browser may reject programmatic downloads (for example, based on the User-Agent header). If a URL request fails at runtime, use base64-encoded data instead.

Base64 data

For media not already on the web, base64-encode the bytes and send them inline.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}
{
    "type": "video_url",
    "video_url": {
        "url": "data:video/mp4;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert a file to base64 in Python:

import base64
with open(file_path, "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

Full example with the OpenAI Python SDK

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"}}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is happening in this video?"},
            {"type": "video_url", "video_url": {"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"}}
        ]
    }
]
chat_response = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=False,
)
msg = chat_response.choices[0].message
print("reasoning:", msg.reasoning)
print("content:", msg.content)

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Reasoning#

This model supports reasoning with text and image inputs. It is turned on by default. To turn off reasoning, add "chat_template_kwargs": { "thinking": false } in the request body.

Example with reasoning turned off:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/diffusiongemma-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                            }
                    }
                ]
            }
        ],
        "chat_template_kwargs": { "thinking": false },
        "max_tokens": 4096
    }'

You can also omit the reasoning tokens from the response by setting "include_reasoning": false in the request body. The model will still reason internally. Setting "include_reasoning": false is not supported for streaming responses.

Text-only Queries#

Many VLMs such as google/diffusiongemma-26b-a4b-it support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/diffusiongemma-26b-a4b-it",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096,
        "stream": true
    }'

Using the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
stream = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=True,
)

 for chunk in stream:
     delta = chunk.choices[0].delta
     if delta and delta.content:
         text = delta.content
         # Print immediately and without a newline to update the output as the response is
         # streamed in.
         print(text, end="", flush=True)
 # Final newline.
 print()

Multi-turn Conversation#

This model supports multi-turn conversations: send multiple messages with alternating user and assistant roles.

Important

Multi-turn capability is not available for all VLMs. Refer to the model cards for information on multi-turn conversations.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/diffusiongemma-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows an **NVIDIA DGX system**, which is the flagship line of AI supercomputers/servers from NVIDIA. ..."
            },
            {
                "role": "user",
                "content": "When was this system released?"
            }
        ],
        "max_tokens": 4096
    }'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows an **NVIDIA DGX system**, which is the flagship line of AI supercomputers/servers from NVIDIA. ..."
    },
    {
        "role": "user",
        "content": "When was this system released?"
    }
]
chat_response = client.chat.completions.create(
    model="google/diffusiongemma-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

You can call NIM from LangChain, a framework for building applications with large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="google/diffusiongemma-26b-a4b-it",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
        },
    ],
)

print(model.invoke([message]))