Query the Gemma 4 31B Instruct API#
This page shows how to launch the NIM container and call the Chat Completions API with curl, the OpenAI Python SDK, and LangChain. It covers image inputs, text-only queries, multi-turn conversations, and function calling.
For more information on this model, refer to the model card.
Launch NIM#
Make sure you complete the steps in Get Started with NIM before you launch the NIM.
The following command launches a Docker container for this specific model:
# Choose a container name for bookkeeping
export CONTAINER_NAME=google-gemma-4-31b-it
# The container name from the previous ngc registry image list command
Repository="gemma-4-31b-it"
Latest_Tag="1.7.0-variant"
# Choose a VLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/google/${Repository}:${Latest_Tag}"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the VLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8000:8000 \
$IMG_NAME
This NIM is built with a specialized base container and is subject to
limitations. It uses the release tag 1.7.0-variant for the
container image. Refer to Notes on NIM Container Variants
for more information.
OpenAI Chat Completions Request#
Use the Chat Completions endpoint with chat or instruct-tuned models
designed for a conversational approach. Send prompts as messages with roles
and content to keep track of a multi-turn conversation.
Note
The snippets below use max_tokens to keep examples short. For reasoning
examples, the max_tokens value is higher because reasoning output is
typically longer.
Provide the URL of an image and query the NIM server.
To stream the result, set "stream": true in the request body.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "google/gemma-4-31b-it",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
}
],
"max_tokens": 1024
}'
You can also use the OpenAI Python SDK:
pip install -U openai
Run the client and query the Chat Completions API:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="google/gemma-4-31b-it",
messages=messages,
max_tokens=1024,
stream=False,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
To stream responses, pass stream=True and iterate over the response:
# Code preceding `client.chat.completions.create` is the same.
stream = client.chat.completions.create(
model="google/gemma-4-31b-it",
messages=messages,
max_tokens=1024,
# Take note of this param.
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
text = delta.content
# Print immediately and without a newline to update the output as the response is
# streamed in.
print(text, end="", flush=True)
# Final newline.
print()
Passing Images#
NIM for VLMs follows the OpenAI specification to pass images as part of the HTTP payload in a user message.
Important
The supported image formats are GIF, JPG, JPEG, and PNG.
To adjust the maximum number of images allowed per request, set the
environment variable
NIM_MAX_IMAGES_PER_PROMPT.
Public direct URL
Passing the direct URL of an image causes the container to download that image at runtime.
{
"type": "image_url",
"image_url": {
"url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
}
}
Base64 data
Another option, useful for images not already on the web, is to Base64-encode the image bytes and send the data in your payload.
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,SGVsbG8gZGVh...ciBmZWxsb3chIQ=="
}
}
To convert images to base64, use the base64 command or the following Python code:
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Function (Tool) Calling#
You can connect NIM to external tools and services using function calling (also known as tool calling). For more information, refer to Call Functions (Tools).
Reasoning#
This model supports reasoning. Reasoning is off by default. To turn it on, add
"chat_template_kwargs": { "enable_thinking": true } in the request body.
Example with reasoning enabled:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "google/gemma-4-31b-it",
"messages": [
{
"role": "user",
"content": "What is 25 * 37? Think step by step."
}
],
"chat_template_kwargs": { "enable_thinking": true },
"include_reasoning": true,
"max_tokens": 4096
}'
The response contains a reasoning field with the model’s chain-of-thought
and a content field with the final answer.
To omit the reasoning tokens from the response while still allowing the model to
reason internally, set "include_reasoning": false in the request body.
To explicitly disable reasoning, set
"chat_template_kwargs": { "enable_thinking": false }.
Text-Only Queries#
Many VLMs such as google/gemma-4-31b-it support text-only queries,
where a VLM behaves exactly like a (text-only) LLM.
Important
Text-only capability is not available for all VLMs. Refer to the model cards in Support Matrix for support on text-only queries.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "google/gemma-4-31b-it",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
],
"max_tokens": 4096,
"stream": true
}'
Using the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Create a detailed itinerary for a week-long adventure trip through Southeast Asia."
}
]
stream = client.chat.completions.create(
model="google/gemma-4-31b-it",
messages=messages,
max_tokens=4096,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
text = delta.content
# Print immediately and without a newline to update the output as the response is
# streamed in.
print(text, end="", flush=True)
# Final newline.
print()
Multi-Turn Conversation#
This model supports multi-turn conversations: send multiple messages with alternating user and assistant roles.
Important
Multi-turn capability is not available for all VLMs. Refer to the model cards for information on multi-turn conversations.
Important
Messages must follow strict user / assistant role alternation. The
first message must have the role user, and the last message must also be
user. Sending a conversation that ends with an assistant message
returns a 400 BadRequestError.
If you need the last message to be assistant (for example, to continue a
partial response), include the following top-level request parameters:
"add_generation_prompt": false,
"continue_final_message": true
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "google/gemma-4-31b-it",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows an **NVIDIA DGX system**, which is NVIDIA'\''s flagship line of AI supercomputers/servers. ..."
},
{
"role": "user",
"content": "When was this system released?"
}
],
"max_tokens": 4096
}'
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/dgx-b200/dgx-b200-hero-bm-v2-l580-d.jpg"
}
}
]
},
{
"role": "assistant",
"content": "This image shows an **NVIDIA DGX system**, which is NVIDIA's flagship line of AI supercomputers/servers. ..."
},
{
"role": "user",
"content": "When was this system released?"
}
]
chat_response = client.chat.completions.create(
model="google/gemma-4-31b-it",
messages=messages,
max_tokens=4096,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Using LangChain#
You can call NIM from LangChain, a framework for building applications with large language models (LLMs).
Install LangChain using the following command:
pip install -U langchain-openai langchain-core
Query the OpenAI Chat Completions endpoint using LangChain:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
model = ChatOpenAI(
model="google/gemma-4-31b-it",
openai_api_base="http://0.0.0.0:8000/v1",
openai_api_key="not-needed"
)
message = HumanMessage(
content=[
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://assets.ngc.nvidia.com/products/api-catalog/phi-3-5-vision/example1b.jpg"},
},
],
)
print(model.invoke([message]))