Structured Generation#
NIM for VLMs supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices.
This can be useful where NIM is part of a larger pipeline and the VLM outputs are expected to be in a certain format.
Ensuring consistent output format for downstream processing
Validating complex data structures
Automating data extraction from unstructured text
Improving reliability in multi-step pipelines
Below are some examples of how the outputs can be constrained in different ways.
Important
Structured generation is only supported for Llama 3.2 models. Only the OpenAI endpoint exposes the input fields for structured generation.
JSON Schema#
You can constrain the output to follow a particular JSON schema by using the response_format
parameter in the OpenAI schema, with json_schema
as the type
.
Details are available in OpenAI’s documentation (section A new option for the response_format parameter
)
NVIDIA recommends that you specify a JSON schema using the type json_schema
instead json_object
. Using the json_object
type enables the model to generate any valid JSON, including empty JSON.
Example: Extracting information from a movie poster#
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
# Define the pydantic models for the response format
class Date(BaseModel):
day: int = Field(ge=1, le=31)
month: int = Field(ge=1, le=12)
year: Optional[int] = Field(ge=1895)
class MovieDetails(BaseModel):
title: str
release_date: Date
publishers: List[str]
# Prepare the question and input image
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"Look at the poster image. Return the title and other information about this movie in JSON format."
},
{
"type": "image_url",
"image_url":
{
"url": "https://vignette1.wikia.nocookie.net/disney/images/f/f2/Walleposter.jpg"
}
}
]},
]
# Send the request with `json_schema`
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {"name": "MovieDetails", "schema": MovieDetails.model_json_schema()}
}
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date": {"year": 2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }
Newer versions of OpenAI SDK offer native support for Pydantic objects as described in the native SDK support section. Run pip install -U openai to install the latest SDK version.
response = client.beta.chat.completions.parse(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
response_format=MovieDetails,
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date": {"year": 2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }
By using JSON schemas, you can ensure that the VLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.
Regular Expressions#
You can specify a regular expression for the output format using the guided_regex
parameter in the nvext
extension to the OpenAI schema.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"Return the number of cars seen in this image"
},
{
"type": "image_url",
"image_url":
{
"url": "https://cdn.ebaumsworld.com/mediaFiles/picture/202553/84419818.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_regex": regex}},
stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 2
Choices#
You can specify a list of choices for the output using the guided_choice
parameter in the nvext
extension to the OpenAI schema.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
# We send the list of choices in the prompt to help the model but this is not
# strictly necessary, the model will have to follow the choices in any case
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"What is the state of pollution in this image? It should be one of {choices}"
},
{
"type": "image_url",
"image_url":
{
"url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_choice": choices}},
stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Bad
Context-free Grammar#
You can specify a context-free grammar in the EBNF format using the guided_grammar
parameter in the nvext
extension to the OpenAI schema.
The grammar is defined using the EBNF language.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
?start: "There are " num " cars in this image."
?num: /[1-5]/
"""
messages = [
{"role": "user", "content": [
{
"type": "text",
"text": f"What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
}
}
]},
]
response = client.chat.completions.create(
model="meta/llama-3.2-11b-vision-instruct",
messages=messages,
extra_body={"nvext": {"guided_grammar": grammar}},
stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# There are 2 cars in this image.