Working with Streaming Output#

The microservice supports streaming output for both chat completions and completions.

Configuration#

You must enable streaming support in your guardrails configuration. The key fields are streaming.enabled set to True, and the chunk_size, context_size, and stream_first configurable fields.

config.yml

rails:
  output:
    flows:
      - ...
    streaming:
      enabled: True
      chunk_size: 200
      context_size: 50
      stream_first: True

JSON

"rails": {
  "output": {
    "flows": [
       "..."
    ],
    "streaming": {
      "enabled": "True",
      "chunk_size": 200,
      "context_size": 50,
      "stream_first": "True"
    }
  }
}

For information about the fields, refer to streaming output configuration in the NeMo Guardrails Toolkit documentation.

For information about managing guardrails configuration, refer to the demonstration guardrail configuration, demo-self-check-input-output, from create a guardrail configuration.

Performance Comparison#

The primary purpose of streaming output is to reduce the time-to-first-token (TTFT) of the LLM response. The following table shows some timing results for 20 requests using the GPT 4o model from OpenAI with and without streaming output and a very basic timing script.

Configuration	Mean TTFT	Median TTFT	Stdev	Min TTFT	Max TTFT
Streaming enabled	0.5475	0.5208	0.1248	0.4241	0.9287
Streaming disabled	3.6834	3.6127	1.6949	0.4487	7.4227

The streaming enabled configuration is faster by 85.14%, on average, and has more consistent performance, as shown by the lower standard deviation, 0.1248 versus 1.6949.

The streaming configuration used the default configuration values: chunk_size: 200, context_size: 50, and stream_first: True.

Detecting Blocked Content in Streaming Output#

The microservice applies guardrail checks on chunks of tokens as they are streamed from the LLM. If a chunk of tokens is blocked, the microservice returns a response in the following format:

{"error":{"message":"Blocked by <rail-name>","type":"guardrails_violation","param":"<rails-name>","code":"content_blocked"}}

Chat Completions with Streaming#

Choose one of the following options of running chat completions with streaming output.

Python SDK

Set up a NeMoMicroservices client instance using the base URL of the NeMo Guardrails microservice and perform the task as follows.

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url=os.environ["GUARDRAILS_BASE_URL"],
    inference_base_url=os.environ["NIM_BASE_URL"]
)

response = client.guardrail.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[
        {"role": "user", "content": "Tell me about Cape Hatteras National Seashore in 50 words or less."}
    ],
    guardrails={
        "config_id": "demo-self-check-input-output",
    },
    stream=True,
    max_tokens=100
)
for chunk in response:
    print(chunk)

OpenAI

import os
import json

from openai import OpenAI

x_model_authorization = {"X-Model-Authorization": os.environ["NVIDIA_API_KEY"]}

url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail"

# The api_key argument is required, but is specified in the default_headers argument.
client = OpenAI(
    base_url=url,
    api_key="dummy-value",
    default_headers=x_model_authorization,
)

stream = client.chat.completions.create(
    model = "meta/llama-3.3-70b-instruct",
    messages = [
        {
            "role": "user",
            "content": "Tell me about Cape Hatteras National Seashore in 50 words or less."
        }
    ],
    extra_body = {
        "guardrails": {
            "config_id": "demo-self-check-input-output"
        },
    },
    max_tokens=200,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        # Add a check if content includes {"error": {"message": "Blocked by <rail-name>"...
        print(chunk.choices[0].delta.content, end="", flush=True)

LangChain

import os
import json

from langchain_openai import ChatOpenAI

x_model_authorization = {"X-Model-Authorization": os.environ["NVIDIA_API_KEY"]}

model = ChatOpenAI(
    model_name = "meta/llama-3.3-70b-instruct",
    openai_api_base = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail",
    api_key = "dummy-value",
    default_headers = x_model_authorization,
    extra_body = {
        "guardrails": {
            "config_id": "demo-self-check-input-output"
        }
    },
    max_tokens=200
)

for chunk in model.stream("Tell me about Cape Hatteras National Seashore in 50 words or less."):
    print(chunk)
    print(chunk.content, end="", flush=True)

cURL

Make a POST request to the /v1/guardrail/chat/completions endpoint and specify "stream": true in the request body.

curl -X POST "${GUARDRAILS_BASE_URL}/v1/guardrail/chat/completions" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [
      {"role": "user", "content": "Tell me about Cape Hatteras National Seashore in 50 words or less." }
    ],
    "guardrails": {
      "config_id": "demo-self-check-input-output"
    },
    "stream": true,
    "max_tokens": 100
}'