Structured Generation#

NIM for VLMs supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices.

This can be useful where NIM is part of a larger pipeline and the VLM outputs are expected to be in a certain format.

  • Ensuring consistent output format for downstream processing

  • Validating complex data structures

  • Automating data extraction from unstructured text

  • Improving reliability in multi-step pipelines

Below are some examples of how the outputs can be constrained in different ways.

Important

Structured generation is only supported for Llama 3.2 models. Only the OpenAI endpoint exposes the input fields for structured generation.

JSON Schema#

You can constrain the output to follow a particular JSON schema by using the response_format parameter in the OpenAI schema, with json_schema as the type. Details are available in OpenAI’s documentation (section A new option for the response_format parameter)

NVIDIA recommends that you specify a JSON schema using the type json_schema instead json_object. Using the json_object type enables the model to generate any valid JSON, including empty JSON.

Example: Extracting information from a movie poster#

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

# Define the pydantic models for the response format
class Date(BaseModel):
    day: int = Field(ge=1, le=31)
    month: int = Field(ge=1, le=12)
    year: Optional[int] = Field(ge=1895)

class MovieDetails(BaseModel):
    title: str
    release_date: Date
    publishers: List[str]

# Prepare the question and input image
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Look at the poster image. Return the title and other information about this movie in JSON format."
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://vignette1.wikia.nocookie.net/disney/images/f/f2/Walleposter.jpg"
            }
        }
    ]},
]
# Send the request with `json_schema`
response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "MovieDetails", "schema": MovieDetails.model_json_schema()}
    }
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date":  {"year":  2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }

Newer versions of OpenAI SDK offer native support for Pydantic objects as described in the native SDK support section. Run pip install -U openai to install the latest SDK version.

response = client.beta.chat.completions.parse(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    response_format=MovieDetails,
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date":  {"year":  2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }

By using JSON schemas, you can ensure that the VLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.

Regular Expressions#

You can specify a regular expression for the output format using the guided_regex parameter in the nvext extension to the OpenAI schema.

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Return the number of cars seen in this image"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://cdn.ebaumsworld.com/mediaFiles/picture/202553/84419818.jpg"
            }
        }
    ]},
]
response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_regex": regex}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 2

Choices#

You can specify a list of choices for the output using the guided_choice parameter in the nvext extension to the OpenAI schema.

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
# We send the list of choices in the prompt to help the model but this is not
# strictly necessary, the model will have to follow the choices in any case
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"What is the state of pollution in this image? It should be one of {choices}"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
            }
        }
    ]},
]

response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_choice": choices}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Bad

Context-free Grammar#

You can specify a context-free grammar in the EBNF format using the guided_grammar parameter in the nvext extension to the OpenAI schema.

The grammar is defined using the EBNF language.

from openai import OpenAI


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
    ?start: "There are " num " cars in this image."

    ?num: /[1-5]/
"""

messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"What is in this image?"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
            }
        }
    ]},
]
response = client.chat.completions.create(
    model="meta/llama-3.2-11b-vision-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_grammar": grammar}},
    stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# There are 2 cars in this image.