Structured Generation with NVIDIA NIM for LLMs#

NIM LLM supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices. This can be useful where NIM is part of a larger pipeline and the LLM outputs are expected to be in a certain format. Below are some examples of how the outputs can be constrained in different ways.

We recommend that you use guided_json for optimal performance and reliability. guided_json uses the xgrammar backend, which is the fastest option available. However, xgrammar does not support all configurations of guided decoding. If the system falls back to outlines, it can cause performance issues, particularly during the first inference.

Additionally, xgrammar supports a wider range of regular expressions compared to outlines. Using outlines backend for regular expressions might cause the regular expressions to fail to compile.

Note

When using SGLang profiles, guided decoding backend support is limited:

Supported: xgrammar and outlines
Not supported: lm-format-enforcer

JSON Schema#

You can constrain the output to follow a particular JSON schema by using the guided_json parameter in the nvext extension to the OpenAI schema. This approach is particularly useful in the following scenarios:

Ensuring consistent output format for downstream processing
Validating complex data structures
Automating data extraction from unstructured text
Improving reliability in multi-step pipelines

Important

We recommended using the guided_json parameter to specify a JSON schema, instead of using response_format={"type": "json_object"}. The response_format option with type "json_object" permits the model to produce any valid JSON, including empty objects.

Basic Example: Movie Review#

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
json_schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string"
        },
        "rating": {
            "type": "number"
        }
    },
    "required": [
        "title",
        "rating"
    ]
}
prompt = (f"Return the title and the rating based on the following movie review according to this JSON schema: {str(json_schema)}.\n"
          f"Review: Inception is a really well made film. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_json": json_schema}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {"title":"Inception", "rating":4.0}

Advanced Example: Product Information#

This example demonstrates a more complex schema for extracting detailed product information:

json_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "features": {
            "type": "array",
            "items": {"type": "string"}
        },
        "availability": {
            "type": "object",
            "properties": {
                "in_stock": {"type": "boolean"},
                "shipping_time": {"type": "string"}
            },
            "required": ["in_stock", "shipping_time"]
        }
    },
    "required": ["product_name", "price", "features", "availability"]
}

prompt = (f"Extract product information from the following description according to this JSON schema: {str(json_schema)}.\n"
          f"Description: The XYZ Smartwatch is our latest offering, priced at $299.99. It features a heart rate monitor, "
          f"GPS tracking, and water resistance up to 50 meters. The product is currently in stock and ships within 2-3 business days.")

messages = [
    {"role": "user", "content": prompt},
]

response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_json": json_schema}},
    stream=False
)

assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {
#     "product_name": "XYZ Smartwatch",
#     "price": 299.99,
#     "features": [
#         "heart rate monitor",
#         "GPS tracking",
#         "water resistance up to 50 meters"
#     ],
#     "availability": {
#         "in_stock": true,
#         "shipping_time": "2-3 business days"
#     }
# }

Example: Nested Structures for Event Planning#

This example showcases how JSON schemas can handle nested structures, which is useful for complex data representations:

json_schema = {
    "type": "object",
    "properties": {
        "event_name": {"type": "string"},
        "date": {"type": "string", "format": "date"},
        "attendees": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"},
                    "confirmed": {"type": "boolean"}
                },
                "required": ["name", "role", "confirmed"]
            }
        },
        "venue": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "address": {"type": "string"},
                "capacity": {"type": "integer"}
            },
            "required": ["name", "address", "capacity"]
        }
    },
    "required": ["event_name", "date", "attendees", "venue"]
}

prompt = (f"Create an event plan based on the following information using this JSON schema: {str(json_schema)}.\n"
          f"Information: We're planning the Annual Tech Conference on 2024-09-15. John Doe (Speaker, confirmed) and Jane Smith (Organizer, confirmed) will attend. "
          f"Alice Johnson (Volunteer, not confirmed yet) might join. The event will be held at Tech Center, 123 Innovation St., with a capacity of 500 people.")

messages = [
    {"role": "user", "content": prompt},
]

response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_json": json_schema}},
    stream=False
)

assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {
#     "event_name": "Annual Tech Conference",
#     "date": "2024-09-15",
#     "attendees": [
#         {"name": "John Doe", "role": "Speaker", "confirmed": true},
#         {"name": "Jane Smith", "role": "Organizer", "confirmed": true},
#         {"name": "Alice Johnson", "role": "Volunteer", "confirmed": false}
#     ],
#     "venue": {
#         "name": "Tech Center",
#         "address": "123 Innovation St.",
#         "capacity": 500
#     }
# }

By using JSON schemas, you can ensure that the LLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.

Regular Expressions#

You can specify a regular expression for the output format using the guided_regex parameter in the nvext extension to the OpenAI schema.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
prompt = (f"Return just the rating based on the following movie review\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_regex": regex}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 4

Choices#

You can specify a list of choices for the output using the guided_choice parameter in the nvext extension to the OpenAI schema.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
prompt = (f"Return the sentiment based on the following movie review. It should be one of {choices}\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_choice": choices}},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Good

Context-free Grammar#

You can specify a context-free grammar in the EBNF format using the guided_grammar parameter in the nvext extension to the OpenAI schema.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
    root   ::= "The movie name is rated " rating " stars."
    rating ::= [1-5]
"""

prompt = (f"Summarize the following movie review:\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"nvext": {"guided_grammar": grammar}},
    stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# The movie name is rated 4 stars.

The default guided decoding backend is XGrammar. If you use outlines instead, you must use a different syntax for the grammar definition. The following is an example grammar for the outlines backend:

grammar = """
    ?start: "The movie name is rated " rating " stars."
    ?rating: /[1-5]/
"""

We recommend the xgrammar backend for reliability and performance. For large grammars, the outlines backend can take a long time to compile on the first inference request.

If the syntax is unrecognized by xgrammar, guided decoding can fall back to use outlines .