Parallel Execution of Input and Output Rails#
You can configure input and output rails to run in parallel. This can improve latency and throughput.
IORails Engine#
The IORails engine is an optimized execution engine that runs NemoGuard input and output rails in parallel with dedicated model management. The IORails engine is an opt-in feature. By default, the NeMo Guardrails library uses the LLMRails engine.
Note
IORails is an early-release feature and currently does not support streaming, reasoning models, and telemetry as in LLMRails.
Supported Flows#
The IORails engine supports the following flows:
content safety check input/content safety check outputtopic safety check inputjailbreak detection model
When IORails is enabled and the configuration uses only these flows, the engine runs them in parallel. Configurations that include custom flows, dialog rails, or other unsupported flows raise an error at initialization.
Enabling IORails#
To enable the IORails engine, set the NEMO_GUARDRAILS_IORAILS_ENGINE environment variable to 1:
NEMO_GUARDRAILS_IORAILS_ENGINE=1 nemoguardrails chat --config examples/configs/content_safety
When using the Python API, import the Guardrails class directly and pass use_iorails=True:
from nemoguardrails import RailsConfig
from nemoguardrails.guardrails.guardrails import Guardrails
config = RailsConfig.from_path("./config")
guardrails = Guardrails(config, use_iorails=True)
YAML-Based Parallel Execution#
You can also configure existing LLMRails flows to run in parallel using the parallel: True
option in the config.yml file. This approach works with any flow type and does not require
the IORails engine.
When to Use#
Use YAML-based parallel execution:
For I/O-bound rails such as external API calls to LLMs or third-party integrations.
If you have two or more independent input or output rails without shared state dependencies.
In production environments where response latency affects user experience and business metrics.
When Not to Use#
Avoid parallel execution:
For CPU-bound rails; it might not improve performance and can introduce overhead.
During development and testing for debugging and simpler workflows.
Configuration Example#
To enable parallel execution, set parallel: True in the rails.input and rails.output sections in the config.yml file.
Note
Input rail mutations can lead to erroneous results during parallel execution because of race conditions arising from the execution order and timing of parallel operations. This can result in output divergence compared to sequential execution. For such cases, use sequential mode.
The following is an example configuration for parallel rails using models from NVIDIA Cloud Functions (NVCF). When you use NVCF models, make sure that you export NVIDIA_API_KEY to access those models.
models:
- type: main
engine: nim
model: meta/llama-3.1-70b-instruct
- type: content_safety
engine: nim
model: nvidia/llama-3.1-nemoguard-8b-content-safety
- type: topic_control
engine: nim
model: nvidia/llama-3.1-nemoguard-8b-topic-control
rails:
input:
parallel: True
flows:
- content safety check input $model=content_safety
- topic safety check input $model=topic_control
output:
parallel: True
flows:
- content safety check output $model=content_safety
- self check output
streaming:
enabled: True
chunk_size: 200
context_size: 50
stream_first: True