Structured Generation#

NIM for VLMs supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices.

This can be useful where NIM is part of a larger pipeline and the VLM outputs are expected to be in a certain format.

Ensuring consistent output format for downstream processing
Validating complex data structures
Automating data extraction from unstructured text
Improving reliability in multi-step pipelines

Below are some examples of how the outputs can be constrained in different ways.

JSON Schema#

You can constrain the output to follow a particular JSON schema by using the response_format parameter in the OpenAI schema, with json_schema as the type. Details are available in OpenAI’s documentation (section A new option for the response_format parameter)

NVIDIA recommends that you specify a JSON schema using the type json_schema instead json_object. Using the json_object type enables the model to generate any valid JSON, including empty JSON.

Example: Extracting information from a movie poster#

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

# Define the pydantic models for the response format
class Date(BaseModel):
    day: int = Field(ge=1, le=31)
    month: int = Field(ge=1, le=12)
    year: Optional[int] = Field(ge=1895)

class MovieDetails(BaseModel):
    title: str
    release_date: Date
    publishers: List[str]

# Prepare the question and input image
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Look at the poster image. Return the title and other information about this movie in JSON format."
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://vignette1.wikia.nocookie.net/disney/images/f/f2/Walleposter.jpg"
            }
        }
    ]},
]
# Send the request with `json_schema`
response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "MovieDetails", "schema": MovieDetails.model_json_schema()}
    }
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date":  {"year":  2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }

Newer versions of OpenAI SDK offer native support for Pydantic objects as described in the native SDK support section. Run pip install -U openai to install the latest SDK version.

response = client.beta.chat.completions.parse(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    response_format=MovieDetails,
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# { "title": "WALL-E", "release_date":  {"year":  2008, "month": 6, "day": 27}, "publishers": ["Walt Disney Pictures", "Pixar Animation Studios"] }

By using JSON schemas, you can ensure that the VLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.

Regular Expressions#

You can specify a regular expression for the output format using the guided_regex parameter in the nvext extension to the OpenAI schema.

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Return the number of cars seen in this image"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://cdn.ebaumsworld.com/mediaFiles/picture/202553/84419818.jpg"
            }
        }
    ]},
]
response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    extra_body={"nvext": {"guided_regex": regex}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 2

Choices#

You can specify a list of choices for the output using the guided_choice parameter in the nvext extension to the OpenAI schema.

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
# We send the list of choices in the prompt to help the model but this is not
# strictly necessary, the model will have to follow the choices in any case
messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"What is the state of pollution in this image? It should be one of {choices}"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
            }
        }
    ]},
]

response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    extra_body={"nvext": {"guided_choice": choices}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Bad

Context-free Grammar#

You can specify a context-free grammar in the EBNF format using the guided_grammar parameter in the nvext extension to the OpenAI schema.

The grammar is defined using the EBNF language.

from openai import OpenAI


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
    ?start: "There are " num " cars in this image."

    ?num: /[1-5]/
"""

messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"What is in this image?"
        },
        {
            "type": "image_url",
            "image_url":
            {
                "url": "https://m.media-amazon.com/images/I/51A5iA+lNcL._AC_.jpg"
            }
        }
    ]},
]
response = client.chat.completions.create(
    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    messages=messages,
    extra_body={"nvext": {"guided_grammar": grammar}},
    stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# There are 2 cars in this image.