Structured Generation
NIM for VLMs supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices.
This can be useful where NIM is part of a larger pipeline and the VLM outputs are expected to be in a certain format.
Ensuring consistent output format for downstream processing
Validating complex data structures
Automating data extraction from unstructured text
Improving reliability in multi-step pipelines
Below are some examples of how the outputs can be constrained in different ways.
Note: Only the OpenAI endpoint exposes the input fields for structured generation.
JSON Schema
You can constrain the output to follow a particular JSON schema by using the response_format
parameter in the OpenAI schema, with json_schema
as the type
.
Details are available in OpenAI’s documentation (section A new option for the response_format parameter
)
NVIDIA recommends that you specify a JSON schema using the type json_schema
instead json_object
. Using the json_object
type enables the model to generate any valid JSON, including empty JSON.
Example: Extracting information from a movie poster
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
# Define the pydantic models for the response format
class Date(BaseModel):
day: int = Field(ge=1, le=31)
month: int = Field(ge=1, le=12)
year: Optional[int] = Field(ge=1895)
class MovieDetails(BaseModel):
title: str
release_date: Date
publishers: List[str]
# Prepare the question and input image
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"Look at the poster image. Return the title and other information about this movie in JSON format."
},
{
"type": "image_url",
"image_url":
{
"url": "https://vignette1.wikia.nocookie.net/disney/images/f/f2/Walleposter.jpg"
}
}
]},
]
# Send the request with `json_schema`
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {"name": "MovieDetails", "schema": MovieDetails.model_json_schema()}
}
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date": {"year": 2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }
Newer versions of OpenAI SDK offer native support for Pydantic objects as described in the native SDK support section. Run pip install -U openai to install the latest SDK version.
response = client.beta.chat.completions.parse(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
response_format=MovieDetails,
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date": {"year": 2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }
By using JSON schemas, you can ensure that the VLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.
Regular Expressions
You can specify a regular expression for the output format using the guided_regex
parameter in the nvext
extension to the OpenAI schema.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"Return the number of cars seen in this image"
},
{
"type": "image_url",
"image_url":
{
"url": "https://cdn.ebaumsworld.com/mediaFiles/picture/202553/84419818.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_regex": regex}},
stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 2
Choices
You can specify a list of choices for the output using the guided_choice
parameter in the nvext
extension to the OpenAI schema.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
# We send the list of choices in the prompt to help the model but this is not
# strictly necessary, the model will have to follow the choices in any case
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"What is the state of pollution in this image? It should be one of {choices}"
},
{
"type": "image_url",
"image_url":
{
"url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_choice": choices}},
stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Bad
Context-free Grammar
You can specify a context-free grammar in the EBNF format using the guided_grammar
parameter in the nvext
extension to the OpenAI schema.
The grammar is defined using the EBNF language.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
?start: "There are " num " cars in this image."
?num: /[1-5]/
"""
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_grammar": grammar}},
stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# There are 2 cars in this image.