Executing Input and Output Rails in Parallel
Run input and output rails in parallel to improve the response time of guardrail checks. This tutorial shows how to enable parallel rails using the NeMo Platform Python SDK.
When to Use Parallel Rails Execution
Parallel execution is most effective for the following:
- I/O-bound rails, such as external API calls to models or third-party integrations.
- Independent input or output rails without shared state dependencies.
- Production environments where response latency affects user experience and business metrics.
Input rail mutations can lead to erroneous results during parallel execution because of race conditions that arise from the execution order and timing of parallel operations. This can result in output divergence compared to sequential execution. For such cases, use sequential mode.
When Not to Use Parallel Rails Execution
Sequential execution is recommended for the following:
- CPU-bound rails; it might not improve performance and can introduce overhead.
- Development and testing for debugging and simpler workflows.
Prerequisites
Before you begin:
- You have access to a running NeMo Platform.
NMP_BASE_URLis set to the NeMo Platform base URL.- A
ModelProvideris configured with an LLM provider. Follow Setup if you haven’t done this yet.
This tutorial uses the following NIMs, available on build.nvidia.com:
mainmodel:meta/llama-3.1-8b-instructcontent_safetymodel:nvidia/llama-3.1-nemotron-safety-guard-8b-v3topic_controlmodel:nvidia/llama-3.1-nemoguard-8b-topic-control
Step 1: Configure the Client
Instantiate the platform client.
Step 2: Create a Guardrail Configuration
Create a configuration that enables parallel execution for input rails. This example runs both content safety and topic safety checks in parallel.
You can customize the topic safety check prompt based on your specific use case and allowed topics. The prompt is used as the system prompt for the topic control model to determine if the user message is on-topic or off-topic.
Step 3: Create a VirtualModel
Create a VirtualModel that routes inference through the guardrails middleware. The guardrails configuration is applied as both request and response middleware.
CLI
Python SDK
Step 4: Run Chat Completions via Guardrails
Test the parallel rails configuration by making both safe and off-topic requests. Inference calls go through the standard IGW endpoint using the VirtualModel.
Get a pre-configured OpenAI client from the SDK, then make a safe, on-topic request and verify the response is allowed.
Make an off-topic request that the topic control input rail blocks.
The off-topic request returns the denial message I'm sorry, I can't respond to that.
Step 5: Inspect Activated Rails
Ask the plugin to include rail-activation diagnostics in the response by setting guardrails.options.log.activated_rails. The OpenAI client forwards request fields it doesn’t natively know about through extra_body, so no SDK change is needed.
Inspect guardrails_data.log.activated_rails in the response — it’s a list, with one entry per rail that ran. Each entry carries the rail’s name, type, the decisions it made, and a stop flag indicating whether the rail terminated the request.