Structured Generation with NVIDIA NIM for LLMs#

NIM LLM supports getting structured outputs by specifying a JSON schema, regular expression, context free grammar, or constraining the output to some particular choices. This can be useful where NIM is part of a larger pipeline and the LLM outputs are expected to be in a certain format. Below are some examples of how the outputs can be constrained in different ways.

We recommend that you use guided_json for optimal performance and reliability. guided_json uses the xgrammar backend, which is the fastest option available. However, xgrammar does not support all configurations of guided decoding. If the system falls back to outlines, it can cause performance issues, particularly during the first inference.

Additionally, xgrammar supports a wider range of regular expressions compared to outlines. Using outlines backend for regular expressions might cause the regular expressions to fail to compile.

Note

outlines guided decoding backend is not supported if a model is deployed with sglang profile.

JSON Schema#

You can constrain the output to follow a particular JSON schema by using the guided_json parameter. This approach is particularly useful in the following scenarios:

Ensuring consistent output format for downstream processing
Validating complex data structures
Automating data extraction from unstructured text
Improving reliability in multi-step pipelines

Important

NVIDIA recommends that you specify a JSON schema using the guided_json parameter instead of setting response_format={"type": "json_object"}. Using the response_format parameter with type "json_object" enabless the model to generate any valid JSON, including empty JSON.

Basic Example: Movie Review#

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
json_schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string"
        },
        "rating": {
            "type": "number"
        }
    },
    "required": [
        "title",
        "rating"
    ]
}
prompt = (f"Return the title and the rating based on the following movie review according to this JSON schema: {str(json_schema)}.\n"
          f"Review: Inception is a really well made film. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"guided_json": json_schema},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {"title":"Inception", "rating":4.0}

Advanced Example: Product Information#

This example demonstrates a more complex schema for extracting detailed product information:

json_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "features": {
            "type": "array",
            "items": {"type": "string"}
        },
        "availability": {
            "type": "object",
            "properties": {
                "in_stock": {"type": "boolean"},
                "shipping_time": {"type": "string"}
            },
            "required": ["in_stock", "shipping_time"]
        }
    },
    "required": ["product_name", "price", "features", "availability"]
}

prompt = (f"Extract product information from the following description according to this JSON schema: {str(json_schema)}.\n"
          f"Description: The XYZ Smartwatch is our latest offering, priced at $299.99. It features a heart rate monitor, "
          f"GPS tracking, and water resistance up to 50 meters. The product is currently in stock and ships within 2-3 business days.")

messages = [
    {"role": "user", "content": prompt},
]

response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"guided_json": json_schema},
    stream=False
)

assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {
#     "product_name": "XYZ Smartwatch",
#     "price": 299.99,
#     "features": [
#         "heart rate monitor",
#         "GPS tracking",
#         "water resistance up to 50 meters"
#     ],
#     "availability": {
#         "in_stock": true,
#         "shipping_time": "2-3 business days"
#     }
# }

Example: Nested Structures for Event Planning#

This example showcases how JSON schemas can handle nested structures, which is useful for complex data representations:

json_schema = {
    "type": "object",
    "properties": {
        "event_name": {"type": "string"},
        "date": {"type": "string", "format": "date"},
        "attendees": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"},
                    "confirmed": {"type": "boolean"}
                },
                "required": ["name", "role", "confirmed"]
            }
        },
        "venue": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "address": {"type": "string"},
                "capacity": {"type": "integer"}
            },
            "required": ["name", "address", "capacity"]
        }
    },
    "required": ["event_name", "date", "attendees", "venue"]
}

prompt = (f"Create an event plan based on the following information using this JSON schema: {str(json_schema)}.\n"
          f"Information: We're planning the Annual Tech Conference on 2024-09-15. John Doe (Speaker, confirmed) and Jane Smith (Organizer, confirmed) will attend. "
          f"Alice Johnson (Volunteer, not confirmed yet) might join. The event will be held at Tech Center, 123 Innovation St., with a capacity of 500 people.")

messages = [
    {"role": "user", "content": prompt},
]

response = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=messages,
    extra_body={"guided_json": json_schema},
    stream=False
)

assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# {
#     "event_name": "Annual Tech Conference",
#     "date": "2024-09-15",
#     "attendees": [
#         {"name": "John Doe", "role": "Speaker", "confirmed": true},
#         {"name": "Jane Smith", "role": "Organizer", "confirmed": true},
#         {"name": "Alice Johnson", "role": "Volunteer", "confirmed": false}
#     ],
#     "venue": {
#         "name": "Tech Center",
#         "address": "123 Innovation St.",
#         "capacity": 500
#     }
# }

By using JSON schemas, you can ensure that the LLM’s output adheres to a specific structure, making it easier to process and validate the generated data in your application’s workflow.

Regular Expressions#

You can specify a regular expression for the output format using the guided_regex parameter.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
regex = "[1-5]"
prompt = (f"Return just the rating based on the following movie review\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"guided_regex": regex},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# 4

Choices#

You can specify a list of choices for the output using the guided_choice parameter.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
choices = ["Good", "Bad", "Neutral"]
prompt = (f"Return the sentiment based on the following movie review. It should be one of {choices}\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"guided_choice": choices},
    stream=False
)
assistant_message = response.choices[0].message.content
print(assistant_message)
# Prints:
# Good

Context-free Grammar#

You can specify a context-free grammar in the EBNF format using the guided_grammar parameter.

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
grammar = """
    root   ::= "The movie name is rated " rating " stars."
    rating ::= [1-5]
"""

prompt = (f"Summarize the following movie review:\n"
          f"Review: This movie exceeds expectations. I rate it four stars out of five.")
messages = [
    {"role": "user", "content": prompt},
]
response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    extra_body={"guided_grammar": grammar},
    stream=False
)
completion = response.choices[0].message.content
print(completion)
# Prints:
# The movie name is rated 4 stars.

The default guided decoding backend is XGrammar. If you use outlines instead, you must use a different syntax for the grammar definition. The following is an example grammar for the outlines backend:

grammar = """
    ?start: "The movie name is rated " rating " stars."
    ?rating: /[1-5]/
"""

We recommend the xgrammar backend for reliability and performance. For large grammars, the outlines backend can take a long time to compile on the first inference request.

If the syntax is unrecognized by xgrammar, guided decoding can fall back to use outlines .