Working with Multimodal Data#

Note

The time to complete this tutorial is approximately 30 minutes.

About Support for Multimodal Data#

The support is for input and output guardrails only. Depending on the image reasoning model, you can specify the image to check as a base64 encoded data or as a URL. Refer to the documentation for the model for more information.

The safety check uses the image reasoning model as LLM as-a-judge to determine if the content is safe. The OpenAI, Llama Vision, and Llama Guard models can accept multimodal input and act as a judge model.

You must ensure the image size and prompt length do not exceed the maximum context length of the model.

About the Tutorial#

This tutorial demonstrates how to use multimodal data with NeMo Guardrails to perform safety checks on image data. You configure a guardrail that uses an image-reasoning model and a prompt that instructs the model to check the image for unsafe content.

The tutorial uses the Meta Llama 3.2 90B Vision Instruct model for the application LLM and as an LLM-as-a-judge.

The model performs visual recognition, image reasoning, captioning, and answering general questions about an image. The ability to classify an image as unsafe relies on tuning the instructions and prompt that you supply to the model.

The model is available as a downloadable container from NVIDIA NGC and for interactive use from NVIDIA API Catalog.

Prerequisites#

  • The following procedure uses endpoints from NVIDIA API Catalog to simplify deployment. To use the endpoints, you must have an NVIDIA API key.

  • Refer to Deploying with Docker for information on how to start the NeMo Guardrails container.

Procedure#

  1. Set an environment variable for your NVIDIA API key:

    $ export NVIDIA_API_KEY=<nvapi-...>
    
  2. Start the NeMo Guardrails microservice container:

    docker run \
     -p 7331:7331 \
     -v <path-to-configs-folder>:/config-store \
     -e CONFIG_STORE_PATH=/config-store \
     -e NIM_ENDPOINT_API_KEY="${NVIDIA_API_KEY}" \
     nvcr.io/nvidia/nemo-microservices/guardrails:25.04
    
  3. Create a guardrail configuration:

    curl -X POST "${GUARDRAILS_BASE_URL}/v1/guardrail/configs" \
      -H "Accept: application/json" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "demo-multimodal-stream",
        "namespace": "default",
        "description": "demo configuration for multimodal vision reasoning",
        "data": {
            "prompts": [{
                "task": "content_safety_check_input $model=vision_rails",
                "messages": [{
                    "type": "system",
                    "content": "You are an AI assistant that evaluates if a user'\''s message contains unsafe content.\nCarefully examine any text and images in the user'\''s message.\n\nTask: Check if there is unsafe content in \"User\" messages in conversations according to our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n<END UNSAFE CONTENT CATEGORIES>\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read '\''safe'\'' or '\''unsafe'\'' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories."
                },
                {
                    "type": "user",
                    "content": "{{ user_input }}"
                }],
                "stop": ["<|eot_id|>", "<|eom_id|>"],
                "output_parser": "is_content_safe",
                "max_tokens": "200"
            }],
            "models": [{
                "model": "meta/llama-3.2-90b-vision-instruct",
                "engine": "nim",
                "type": "main"
            },{
                "model": "meta/llama-3.2-90b-vision-instruct",
                "engine": "nim",
                "type": "vision_rails"
            }],
            "rails": {
                "input": {"flows": ["content safety check input $model=vision_rails"]},
                "output": {
                    "streaming": {
                        "enabled": "true",
                        "chunk_size": "200",
                        "context_size": "50"
                    }
                },
                "dialog": {
                    "single_call": {
                        "enabled": "false"
                    }
                }
            }
        }
    }' | jq
    
    import os
    import json
    import requests
    
    url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/configs"
    
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json"
    }
    
    payload = {
        "name": "demo-multimodal-stream",
        "namespace": "default",
        "description": "demo configuration for multimodal vision reasoning",
        "data": {
            "prompts": [{
                "task": "content_safety_check_input $model=vision_rails",
                "messages": [{
                    "type": "system",
                    "content": 'You are an AI assistant that evaluates if a user\'s message contains unsafe content.\nCarefully examine any text and images in the user\'s message.\n\n{%- if categories is not defined -%}\n    {%- set categories = {\n        "S1": "Violent Crimes.",\n        "S2": "Non-Violent Crimes.",\n        "S3": "Sex Crimes.",\n        "S4": "Child Exploitation.",\n        "S5": "Defamation.",\n        "S6": "Specialized Advice.",\n        "S7": "Privacy.",\n        "S8": "Intellectual Property.",\n        "S10": "Hate.",\n        "S11": "Self-Harm.",\n        "S12": "Sexual Content.",\n        "S13": "Elections."\n    } -%}\n{%- endif -%}\n\nTask: Check if there is unsafe content in "User" messages in conversations according to our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\n{%- for key, desc in categories.items() -%}\n    {%- if key not in excluded_category_keys -%}\n{{ key }}: {{ desc }}\n    {%- endif -%}\n{%- endfor -%}\n<END UNSAFE CONTENT CATEGORIES>\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read \'safe\' or \'unsafe\' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories.\n',
                },
                {
                    "type": "user",
                    "content": "{{ user_input  }}"
                }],
                "stop": ["<|eot_id|>", "<|eom_id|>"],
                "output_parser": "is_content_safe",
                "max_tokens": 200,
            }],
            "models": [{
                "model": "meta/llama-3.2-90b-vision-instruct",
                "engine": "nim",
                "type": "main"
            },{
                "model": "meta/llama-3.2-90b-vision-instruct",
                "engine": "nim",
                "type": "vision_rails",
            }],
            "rails": {
                "input": {"flows": ["content safety check input $model=vision_rails"]},
                "output": {
                    "streaming": {
                        "enabled": "true",
                        "chunk_size": 200,
                        "context_size": 50
                    }
                },
                "dialog": {
                    "single_call": {
                        "enabled": "false"
                    }
                }
            }
        }
    }
    
    response = requests.post(url, headers=headers, json=payload)
    print(json.dumps(response.json(), indent=2))
    
    Example Output
    {
      "created_at": "2025-04-14T18:54:43.862326",
      "updated_at": "2025-04-14T18:54:43.862329",
      "name": "demo-multimodal-stream",
      "namespace": "default",
      "description": "demo configuration for multimodal vision reasoning",
      "data": {
        "models": [
          {
            "type": "main",
            "engine": "nim",
            "model": "meta/llama-3.2-90b-vision-instruct",
            "reasoning_config": {
              "remove_thinking_traces": true,
              "start_token": null,
              "end_token": null
            },
            "parameters": {}
          },
          {
            "type": "vision_rails",
            "engine": "nim",
            "model": "meta/llama-3.2-90b-vision-instruct",
            "reasoning_config": {
              "remove_thinking_traces": true,
              "start_token": null,
              "end_token": null
            },
            "parameters": {}
          }
        ],
        "instructions": [
          {
            "type": "general",
            "content": "Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it truthfully says it does not know."
          }
        ],
        "actions_server_url": null,
        "sample_conversation": "user \"Hello there!\"\n  express greeting\nbot express greeting\n  \"Hello! How can I assist you today?\"\nuser \"What can you do for me?\"\n  ask about capabilities\nbot respond about capabilities\n  \"As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences.\"\nuser \"Tell me a bit about the history of NVIDIA.\"\n  ask general question\nbot response for general question\n  \"NVIDIA is a technology company that specializes in designing and manufacturing graphics processing units (GPUs) and other computer hardware. The company was founded in 1993 by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem.\"\nuser \"tell me more\"\n  request more information\nbot provide more information\n  \"Initially, the company focused on developing 3D graphics processing technology for the PC gaming market. In 1999, NVIDIA released the GeForce 256, the world's first GPU, which was a major breakthrough for the gaming industry. The company continued to innovate in the GPU space, releasing new products and expanding into other markets such as professional graphics, mobile devices, and artificial intelligence.\"\nuser \"thanks\"\n  express appreciation\nbot express appreciation and offer additional help\n  \"You're welcome. If you have any more questions or if there's anything else I can help you with, please don't hesitate to ask.\"\n",
        "prompts": [
          {
            "task": "content_safety_check_input $model=vision_rails",
            "content": null,
            "messages": [
              {
                "type": "system",
                "content": "You are an AI assistant that evaluates if a user's message contains unsafe content.\nCarefully examine any text and images in the user's message.\n\n{%- if categories is not defined -%}\n    {%- set categories = {\n        \"S1\": \"Violent Crimes.\",\n        \"S2\": \"Non-Violent Crimes.\",\n        \"S3\": \"Sex Crimes.\",\n        \"S4\": \"Child Exploitation.\",\n        \"S5\": \"Defamation.\",\n        \"S6\": \"Specialized Advice.\",\n        \"S7\": \"Privacy.\",\n        \"S8\": \"Intellectual Property.\",\n        \"S10\": \"Hate.\",\n        \"S11\": \"Self-Harm.\",\n        \"S12\": \"Sexual Content.\",\n        \"S13\": \"Elections.\"\n    } -%}\n{%- endif -%}\n\nTask: Check if there is unsafe content in \"User\" messages in conversations according to our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\n{%- for key, desc in categories.items() -%}\n    {%- if key not in excluded_category_keys -%}\n{{ key }}: {{ desc }}\n    {%- endif -%}\n{%- endfor -%}\n<END UNSAFE CONTENT CATEGORIES>\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read 'safe' or 'unsafe' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories.\n"
              },
              {
                "type": "user",
                "content": "{{ user_input  }}"
              }
            ],
            "models": null,
            "output_parser": "is_content_safe",
            "max_length": 16000,
            "mode": "standard",
            "stop": [
              "<|eot_id|>",
              "<|eom_id|>"
            ],
            "max_tokens": 200
          }
        ],
        "prompting_mode": "standard",
        "lowest_temperature": 0.001,
        "enable_multi_step_generation": false,
        "colang_version": "1.0",
        "custom_data": {},
        "rails": {
          "config": null,
          "input": {
            "flows": [
              "content safety check input $model=vision_rails"
            ]
          },
          "output": {
            "flows": [],
            "streaming": {
              "enabled": true,
              "chunk_size": 200,
              "context_size": 50,
              "stream_first": true
            }
          },
          "retrieval": {
            "flows": []
          },
          "dialog": {
            "single_call": {
              "enabled": false,
              "fallback_to_multiple_calls": true
            },
            "user_messages": {
              "embeddings_only": false,
              "embeddings_only_similarity_threshold": null,
              "embeddings_only_fallback_intent": null
            }
          },
          "actions": {
            "instant_actions": null
          }
        },
        "enable_rails_exceptions": false,
        "passthrough": null
      },
      "files_url": null,
      "schema_version": "1.0",
      "project": null,
      "custom_fields": {},
      "ownership": null
    }
    
  4. Send an image-reasoning request.

    1. Download an image of a street scene, street sign, or other image. You can use a website such as https://commons.wikimedia.org or download the street-scene.jpg file used to develop this documentation.

      A street scene featuring a red octagonal stop sign mounted on a brown pole on the left side of the image.

      Save the image to a file, such as, street-scene.jpg.

    2. Send the image and the request:

      if ! [ -f "street-scene.jpg" ]; then
        echo "street-scene.jpg not found, exiting..."
        exit 1
      fi
      
      image_b64=$( base64 -w 0 street-scene.jpg )
      
      echo '{
        "model": "meta/llama-3.2-90b-vision-instruct",
        "messages": [{
          "role": "user",
          "content": [{
              "type": "text",
              "text": "Is there a traffic sign in this image?"
          }, {
              "type": "image_url",
              "image_url": {
                  "url": "data:image/png;base64,'"$image_b64"'"
              }
          }]
        }],
        "guardrails": {
          "config_id": "demo-multimodal-stream"
        },
        "max_tokens": 512,
        "temperature": 1.00,
        "stream": false
      }' > payload-street-view.json
      
      curl "${GUARDRAILS_BASE_URL}/v1/guardrail/chat/completions" \
        -H "Content-Type: application/json" \
        -H "Accept: text/event-stream" \
        -d @payload-street-view.json | jq '.choices[0].message.content'
      
      import os
      import base64
      import json
      import requests
      
      url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/chat/completions"
      
      headers = {
          "Accept": "text/event-stream",
          "Content-Type": "application/json",
      }
      
      with open("street-scene.jpg", "rb") as f:
          image_b64 = base64.b64encode(f.read()).decode("utf-8")
      
      payload = {
          "model": "meta/llama-3.2-90b-vision-instruct",
          "messages": [{
              "role": "user",
              "content": [
                  {"type": "text", "text": "Is there a traffic sign in this image?"},
                  {"type": "image_url",
                     "image_url": {
                         "url": f"data:image/png;base64,{image_b64}"
                      }
                  }
              ],
          }],
          "guardrails": {
              "config_id": "demo-multimodal-stream"
          },
          "max_tokens": 512,
          "stream": True
      }
      
      with requests.post(url, headers=headers, json=payload, stream=True) as response:
          response.encoding = "utf-8"
          for chunk in response.iter_lines(decode_unicode=True):
              if not chunk:
                  continue
              if chunk == "data: [DONE]":
                  print()
                  break
              data = json.loads(chunk[len("data: "):])
              if data.get("choices", [{}])[0].get("delta", {}).get("content", None) is not None:
                  # Add a check if content includes {"error": {"message": "Blocked by <rail-name>"...
                  print(data["choices"][0]["delta"]["content"], end="", flush=True)
      

      Example Output

      Yes, there is a traffic sign in this image. The traffic sign is a red octagon with the word "STOP"
      written on it in white letters. It is attached to a brown pole and is located on the left side of
      the image.
      
  5. Send a possibly unsafe request using an image depicting car audio theft:

    A person wearing dark clothing reaches through a broken car window, attempting to steal a car stereo system. The image illustrates vehicle break-in and theft.

    curl "${GUARDRAILS_BASE_URL}/v1/guardrail/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Accept: text/event-stream" \
      -d '{
      "model": "meta/llama-3.2-90b-vision-instruct",
      "messages": [{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "How can I get rich doing this?"
        }, {
            "type": "image_url",
            "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Car_audio_theft.jpg/960px-Car_audio_theft.jpg"
            }
        }]
      }],
      "guardrails": {
        "config_id": "demo-multimodal-stream"
      },
      "max_tokens": 512,
      "temperature": 1.00,
      "stream": false
    }' | jq '.choices[0].message.content'
    
    import os
    import base64
    import json
    import requests
    
    url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/chat/completions"
    
    headers = {
        "Accept": "text/event-stream",
        "Content-Type": "application/json",
    }
    
    payload = {
        "model": "meta/llama-3.2-90b-vision-instruct",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "How can I get rich doing this?"},
                {"type": "image_url",
                   "image_url": {
                       "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Car_audio_theft.jpg/960px-Car_audio_theft.jpg"
                    }
                }
            ],
        }],
        "guardrails": {
            "config_id": "demo-multimodal-stream"
        },
        "max_tokens": 512,
        "stream": True
    }
    
    with requests.post(url, headers=headers, json=payload, stream=True) as response:
        response.encoding = "utf-8"
        for chunk in response.iter_lines(decode_unicode=True):
            if not chunk:
                continue
            if chunk == "data: [DONE]":
                print()
                break
            data = json.loads(chunk[len("data: "):])
            if data.get("choices", [{}])[0].get("delta", {}).get("content", None) is not None:
                # Add a check if content includes {"error": {"message": "Blocked by <rail-name>"...
                print(data["choices"][0]["delta"]["content"], end="", flush=True)
    

    Example Output

    I'm sorry, I can't respond to that.