Query the Gemma 4 API#

This page shows how to launch the NIM container and call the Chat Completions API with curl, the OpenAI Python SDK, and LangChain. It covers image inputs, text-only queries, multi-turn conversations, and function calling.

This model is available in two sizes: 26B and 31B. This guide uses the 26B size as an example, but the 31B size is easily adaptable as shown in the following example.

For more information on this model, refer to the model cards:

Launch NIM#

Make sure you complete the steps in Get Started with NIM before you launch the NIM.

The following command launches a Docker container for the 26B size of this specific model:

# Choose a container name for bookkeeping
export CONTAINER_NAME=google-gemma-4-26b-a4b-it # or google-gemma-4-31b-it

# The container name from the previous ngc registry image list command
Repository="gemma-4-26b-a4b-it"  # or "gemma-4-31b-it"
Latest_Tag="1.7.0-variant"  # or "1.7.1-variant"

# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/google/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME

Refer to the Docker container for the 31B size of this specific model.

This NIM is built with a specialized base container and is subject to limitations. It uses release tag 1.7.0-variant for the Gemma-4-26B-A4B-IT container image and release tag 1.7.1-variant for the Gemma 4 31B Instruct container image. Refer to Notes on NIM Container Variants for more information.

OpenAI Chat Completions Request#

Use the Chat Completions endpoint with chat or instruct-tuned models designed for a conversational approach. Send prompts as messages with roles and content to keep track of a multi-turn conversation.

Note

The snippets below use max_tokens to keep examples short. For reasoning examples, the max_tokens value is higher because reasoning output is typically longer.

Provide the URL of an image and query the NIM server. To stream the result, set "stream": true in the request body.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/gemma-4-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                            }
                    }
                ]
            }
        ],
        "max_tokens": 1024
    }'

You can also use the OpenAI Python SDK:

pip install -U openai

Run the client and query the Chat Completions API:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="google/gemma-4-26b-a4b-it",
    messages=messages,
    max_tokens=1024,
    stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

To stream responses, pass stream=True and iterate over the response:

# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
    model="google/gemma-4-26b-a4b-it",
    messages=messages,
    max_tokens=1024,
    # Take note of this param.
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Passing Images#

NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.

Important

The supported image formats are GIF, JPG, JPEG, and PNG.

To adjust the maximum number of images allowed per request, set the environment variable NIM_MAX_IMAGES_PER_PROMPT.

Public direct URL

Passing the direct URL of an image causes the container to download that image at runtime.

{
    "type": "image_url",
    "image_url": {
        "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
    }
}

Base64 data

Another option, useful for images not already on the web, is to Base64-encode the image bytes and send the data in your payload.

{
    "type": "image_url",
    "image_url": {
        "url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
    }
}

To convert images to base64, use the base64 command or the following Python code:

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

Function (Tool) Calling#

You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).

Reasoning#

This model supports reasoning. Reasoning is off by default. To turn it on, add "chat_template_kwargs": { "enable_thinking": true } in the request body.

Example with reasoning enabled:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/gemma-4-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": "What is 25 * 37? Think step by step."
            }
        ],
        "chat_template_kwargs": { "enable_thinking": true },
        "include_reasoning": true,
        "max_tokens": 4096
    }'

The response contains a reasoning field with the model’s chain-of-thought and a content field with the final answer.

To omit the reasoning tokens from the response while still allowing the model to reason internally, set "include_reasoning": false in the request body.

To explicitly disable reasoning, set "chat_template_kwargs": { "enable_thinking": false }.

Text-Only Queries#

Many VLMs such as google/gemma-4-26b-a4b-it support text-only queries, where a VLM behaves exactly like a (text-only) LLM.

Important

Text-only capability is not available for all VLMs. Refer to the model cards in Support Matrix for support on text-only queries.

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/gemma-4-26b-a4b-it",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant"
            },
            {
                "role": "user",
                "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
            }
        ],
        "max_tokens": 4096,
        "stream": true
    }'

Using the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
    }
]
stream = client.chat.completions.create(
    model="google/gemma-4-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        text = delta.content
        # Print immediately and without a newline to update the output as the response is
        # streamed in.
        print(text, end="", flush=True)
# Final newline.
print()

Multi-Turn Conversation#

This model supports multi-turn conversations: send multiple messages with alternating user and assistant roles.

Important

Multi-turn capability is not available for all VLMs. Refer to the model cards for information on multi-turn conversations.

Important

Messages must follow strict user / assistant role alternation. The first message must have the role user, and the last message must also be user. Sending a conversation that ends with an assistant message returns a 400 BadRequestError.

If you need the last message to be assistant (for example, to continue a partial response), include the following top-level request parameters:

"add_generation_prompt": false,
"continue_final_message": true
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "google/gemma-4-26b-a4b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                            }
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "This image shows an **NVIDIA DGX system**, which is NVIDIA'\''s flagship line of AI supercomputers/servers. ..."
            },
            {
                "role": "user",
                "content": "When was this system released?"
            }
        ],
        "max_tokens": 4096
    }'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "This image shows an **NVIDIA DGX system**, which is NVIDIA's flagship line of AI supercomputers/servers. ..."
    },
    {
        "role": "user",
        "content": "When was this system released?"
    }
]
chat_response = client.chat.completions.create(
    model="google/gemma-4-26b-a4b-it",
    messages=messages,
    max_tokens=4096,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Using LangChain#

You can call NIM from LangChain, a framework for building applications with large language models (LLMs).

Install LangChain using the following command:

pip install -U langchain-openai langchain-core

Query the OpenAI Chat Completions endpoint using LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

model = ChatOpenAI(
    model="google/gemma-4-26b-a4b-it",
    openai_api_base="http://0.0.0.0:8000/v1",
    openai_api_key="not-needed"
)

message = HumanMessage(
    content=[
        {"type": "text", "text": "What is in this image?"},
        {
            "type": "image_url",
            "image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
        },
    ],
)

print(model.invoke([message]))