Visual Language Models (VLM) with Jetson Platform Services

Overview

VLMs are multi modal models supporting images, video and text and using a combination of large language models and vision transformers. Based on this capability, they are able to support text prompts to query videos and images thereby enabling capabilities such as chatting with the video, and defining natural language based alerts.

The VLM AI service, enables quick deployment of VLMs with Jetson Platform Services for video insight applications. The VLM service exposes REST API endpoints to configure the video stream input, set alerts and ask questions in natural language about the input video stream.

API Endpoint	Description
/api/v1/live-stream	Manage live streams the AI service has access to.
/api/v1/chat/completion	Chat with the VLM using OpenAI style chat completions. Supports referencing added streams in the prompts.
/api/v1/alerts	Set an alert prompt the VLM will evaluate continuously on the input live stream. Can be used to trigger notifications when alert states are true.

Additionally, the output of the VLM can be viewed as an RTSP stream, the alert states are stored by the jetson-monitoring service and sent over a websocket to integrate with other services.

This AI service is provided as a prebuilt docker container that can be launched with docker compose. It is configured through json config files and integrates with the foundation services such as jetson-monitoring and jetson-ingress. We provide example compose and configuration files for easy deployment in the reference workflow resource.

The following table summarizes how the VLM service can interact with the other Jetson Platform Services.

Service	Required	Notes
jetson-ingress	✓	Required to access the VLM REST APIs through the API Gateway port (30080)
jetson-storage	✓	Required in addition to an external storage device. Without it, the jetson will likely run out storage space.
jetson-monitoring	✓	Required to track alert and VLM metrics in the Prometheus Dashboard
jetson-firewall	⭘	Recommended to be used in real deployment to restrict access to ports other than the API gateway port.
jetson-vst	⭘	Recommended to manage RTSP streams that can be used as input to the AI service
jetson-sys-monitoring	⭘	Recommended to enable to monitor system stats but not required
jetson-gpu-monitoring	⭘	Recommended to enable to monitor GPU stats but not required
jetson-networking	✗	Only needed if using VST and IP cameras with the VLM service
jetson-redis	✗	Only needed if using VST and SDR with the VLM service.

Getting Started

Read through the Prerequisite section carefully before getting started with this example.

Prerequisites

First follow the Quick Start Guide to set up your system with Jetson Platform Services. It is recommended to also follow the Hello World example to get familiar with Jetson Platform Services. Before continuing, bring down any previously launched JPS examples like AI-NVR with the docker compose down command.

The VLM AI service operates on RTSP streams. The RTSP stream can come from any source such as an IP camera, the Video Storage Toolkit (VST) or NVStreamer. The fastest way to get an RTSP stream for testing is to use NVStreamer which can serve video files as an RTSP stream. To learn how to use NVStreamer to make an RTSP stream, see the NVStreamer on Jetson Orin page.

Running the VLM container will require around 50GBs of storage. The container will take 20GBs and the default model (VILA1.5 13b) will use 32.3 GBs. It is highly recommended to use the Jetson Storage service prior to running this example. The Jetson Storage service will setup an attached external storage device and remap the container storage and /data locations to the external storage. The default storage on your Jetson device will likely not have enough space to run this example with the default configuration. View the Storage page for more details.

To get the docker compose and config files, download the Jetson Platform Services resources bundle from NGC or SDK Manager. Once downloaded, find the vlm-1.1.0.tar.gz file and place it in your home directory. The following commands will assume the tar file is starting from your home directory.

cd ~/
tar -xvf vlm-1.1.0.tar.gz
cd ~/vlm/example_1

The VLM AI service will use the jetson-ingress and jetson-monitoring services. These two services need to be configured to integrate with the VLM AI service. Copy the provided default configurations to the appropriate service configuration directory.

sudo cp config/vlm-nginx.conf /opt/nvidia/jetson/services/ingress/config
sudo cp config/prometheus.yml /opt/nvidia/jetson/services/monitoring/config/prometheus.yml
sudo cp config/rules.yml /opt/nvidia/jetson/services/monitoring/config/rules.yml

Then start the foundation services.

sudo systemctl start jetson-ingress
sudo systemctl start jetson-monitoring
sudo systemctl start jetson-sys-monitoring
sudo systemctl start jetson-gpu-monitoring

Note

If any of the foundation services were previously launched then use the ‘restart’ command instead of ‘start’.

Now deploy the VLM AI Service!

sudo docker compose up -d

To check if all the necessary containers have started up, you can run the following command:

sudo docker ps

The output should look similar to the following image.

Note

The first time the VLM service is launched, it will automatically download and quantize the VLM. This will take some time. If deploying on Orin NX16, it may be necessary to mount more SWAP as quantization of the model can use significant memory. For more details see the following section VLM Model Fails to Load.

To verify the VLM has launched properly, you can check the health endpoint of the VLM service. In a web browser you can visit the page http://0.0.0.0:5015/v1/health. If the VLM is ready it will return {“detail”:”ready”}. If you are launching the VLM for the first time it will take some time to fully load.

Interact with VLM Service

Now we can interact with the VLM service is several ways:

Control Stream Input via REST APIs

You can start by adding an RTSP stream for the VLM to use with the following curl command. This will use the POST method on the live-stream endpoint.

Currently the VLM will only support 1 stream but in the future this API will allow for multi-stream support.

Replace 0.0.0.0 with your Jetson IP and replace the rtsp link with your RTSP link.

curl --location 'http://0.0.0.0:5010/api/v1/live-stream' \
--header 'Content-Type: application/json' \
--data '{
"liveStreamUrl": "rtsp://0.0.0.0:31554/nvstream/root/store/nvstreamer_videos/video.mp4"
}'

Note

In addition to the curl commands, the REST APIs can also be tested directly through the API documentation page that is served at http://0.0.0.0:5010/docs when the VLM service is brought up.

This request will return a unique stream ID that is used later to set alerts and ask follow up questions and remove the stream.

{
    "id": "a782e200-eb48-4d17-a1b9-5ac0696217f7"
}

You can also use the GET method on the live-stream endpoint to list the added streams and their IDs:

curl --location 'http://0.0.0.0:5010/api/v1/live-stream'

[
    {
        "id": "a782e200-eb48-4d17-a1b9-5ac0696217f7",
        "liveStreamUrl": "rtsp://0.0.0.0:31554/nvstream/root/store/nvstreamer_videos/video.mp4"
    }
]

Set Alerts

Alerts are questions that the VLM will continuously evaluate on the live stream input. For each alert rule set, the VLM will try to decide if it is True or False based on the most recent frame from of the live stream. These True and False states as determined by the VLM, are sent to a websocket and the jetson monitoring service.

When setting alerts, the alert rule should be phrased as a yes/no question. Such as “Is there fire?” or “Is there smoke?”. The body of the request must also have the “id” field that corresponds to the stream ID that was returned when the RTSP stream was added.

By default, the VLM service supports up to 10 alert rules. This can be increased by adjusting the configuration files.

curl --location 'http://0.0.0.0:5010/api/v1/alerts' \
--header 'Content-Type: application/json' \
--data '{
    "alerts": ["is there a fire?", "is there smoke?"],
    "id": "a782e200-eb48-4d17-a1b9-5ac0696217f7"
}'

Once the alert is added you should see the overlay output generated on the RTSP output stream.

View RTSP Stream Output

Once a stream is added, it will be passed through to the output RTSP stream. You can view this stream at rtsp://0.0.0.0:5011/out. Once a query or alert is added, we can view the VLM responses on this output stream.

View Alert Status in Prometheus

The True/False states of the alerts are sent to the Jetson Monitoring service which is based on Prometheus. You can view the raw metrics at http://0.0.0.0:5012 these metrics are then scraped by Prometheus and saved as time series data. This is viewable on the Prometheus Dashboard at http://0.0.0.0:9090

Ask Follow up Questions

In addition to setting alerts, you can also ask open ended questions to the VLM using the chat completions endpoint. The chat completions endpoint is similar to the OpenAI chat completions API with additional support for referencing a live stream in your prompt.

When a stream is referenced in the prompt, it will attach the most recent frame from the live stream and pass it to the VLM to be used to complete the chat.

To ask a follow up question about the live stream for example, you can submit the following curl command:

curl --location 'http://0.0.0.0:5010/api/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."

        },
        {
            "role": "user",
            "content":[
                {
                    "type": "stream",
                    "stream":
                    {
                        "stream_id": "a782e200-eb48-4d17-a1b9-5ac0696217f7"
                    }
                },

                {
                    "type":"text",
                    "text": "Can you describe the scene?"
                }
            ]
        }
    ],
    "min_tokens": 1,
    "max_tokens": 128
}
'

Note how the user message references the stream content type and then follows up with the text content type with the user’s question. The VILA models work best when the stream reference occurs before the text content.

The chat completions endpoint will return the following:

{
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The scene is a beautiful mountain range."
            }
        }
    ]
}

Note that if the system is rebooted while the container is running, it will automatically come back up but the added streams and alerts will not persist. They will need to be added again. For streams to persist across reboots, the AI service can be combined with SDR as shown in the Zero Shot Detection Workflow page.

Shut Down

To shut down the example you can first remove the stream using a DELETE method on the live-stream endpoint. Note the stream ID is added to the URL path for this.

curl --location --request DELETE 'http://0.0.0.0:5010/api/v1/live-stream/a782e200-eb48-4d17-a1b9-5ac0696217f7'

Then from the same folder as the compose.yaml file used to launch the example run

sudo docker compose down

To summarize, this section covered how to launch the VLM AI service and then interact with it through the REST APIs and view the RTSP outputs and alert states in Prometheus.

Here is a summary of useful addresses when interacting with the VLM service.

Access points for the VLM Service

Name	Local URI	API Gateway URI	Description
REST API Docs	http://0.0.0.0:5010/docs	http://0.0.0.0:30080/vlm/docs	Documentation for VLM AI Service REST API
REST API	http://0.0.0.0:5010/api/v1/	http://0.0.0.0:30080/vlm/api/v1	Control REST API for VLM AI Service
Web Socket Alerts	ws://0.0.0.0:5016/api/v1/alerts/ws	ws://0.0.0.0:30080/ws-vlm/api/v1/alerts/ws	Web Socket that will output Alerts when the VLM determines it is True.
RTSP output	rtsp://0.0.0.0:5011/out		Overlay output of VLM
Prometheus Alert Metrics	http://0.0.0.0:5012		Alert metrics exposed by VLM AI Service scraped by Prometheus
Prometheus inference Metrics	http://0.0.0.0:5017		VLM inference metrics such as token count, decode rate, decode time etc. scraped by Prometheus
Prometheus Dashboard	http://0.0.0.0:9090		Dashboard to view Prometheus Alerts and Metrics. Launched by the jetson-monitoring service.

Configuration

All configuration files can be found under the ~/vlm/example_1/config directory. The configuration options available are split into two categories.

VLM Service configuration
- chat_server_config.json
- main_config.json
foundation Service Configuration
- prometheus.yml
- rules.yml
- vlm-nginx.conf

AI Service configurations are json formatted files and are assumed to be in a folder named `config` by the AI service. This `config` folder must be placed in the same directory as the compose.yaml file that is used to launch the AI service.

The foundation service configurations found in the `config` directory are for reference only and must be copied into the appropriate foundation service configuration folder found under `/opt/nvidia/jetson/services` to take effect. The respective service must be restarted with systemctl after adjusting the config for changes to take effect.

chat_server_config.json

The chat_server_config.json configures the chat server which loads and runs the VLM model using an OpenAI like REST API interface. The VLM model can also be adjusted in this configuration file. When you change the model, restart the service and it will automatically download and quantize the new model.

{
    "api_server_port": 5015,
    "prometheus_port": 5017,
    "model": "Efficient-Large-Model/VILA1.5-13b",
    "log_level": "INFO",
    "print_stats": true
}

Key	Value Type	Value Example	Description	Notes
“api_server_port”	int	5015	port the main pipeline will expose its REST APIs for stream and model control
“prometheus_port”	int	5017	port to output model statistics for prometheus	Includes metrics like embedding time, decode time, decode rate and more.
“model”	str	“Efficient-Large-Model/VILA-13b”	Huggingface path or local path to VLM model	See table below for supported model keys.
“log_level”	str	“INFO”	Python based log level	Supported Values: [“DEBUG”, “INFO”, “WARNING”, “ERROR”, “CRITICAL”]
“print_stats”	bool	true	true/false value to enable printing VLM statistics

The VILA and LLaVA family of models are currently supported.

Model	Configuration Key	Storage Required (GBs)
VILA-2.7b	“Efficient-Large-Model/VILA-2.7b”	7.1
VILA-7b	“Efficient-Large-Model/VILA-7b”	17.3
VILA-13b	“Efficient-Large-Model/VILA-13b”	31.2
VILA1.5-3b	“Efficient-Large-Model/VILA1.5-3b”	7.3
VILA1.5-8b	“Efficient-Large-Model/Llama-3-VILA1.5-8B”	19.9
VILA1.5-13b	“Efficient-Large-Model/VILA1.5-13b”	32.3
LLaVA1.5-7b	“liuhaotian/llava-v1.5-7b”	17.3
LLaVA1.5-13b	“liuhaotian/llava-v1.5-13b”	31.2

main_config.json

The main_config.json configures the streaming pipeline that will evaluate the stream input on the VLM and expose a REST API to configure the stream input, alert rules and queries.

{
    "api_server_port": 5010,
    "prometheus_port": 5012,
    "websocket_server_port": 5016,
    "stream_output": "rtsp://0.0.0.0:5011/out",
    "chat_server": "http://0.0.0.0:5015",
    "alert_system_prompt": "You are an AI assistant whose job is to look at images and evaluate a set of alerts supplied by the user ...,
    "max_alerts": 10,
    "alert_cooldown": 60,
    "log_level": "DEBUG"
}

Key	Value Type	Value Example	Description	Notes
“api_server_port”	int	5010	port the main pipeline will expose its REST APIs for stream and model control
“prometheus_port”	int	5012	port the alert metrics for prometheus are available
“websocket_server_port”	int	5016	port the websocket for alert events will be available
“stream_output”	str	“rtsp://0.0.0.0:5011/out”	output URI of the RTSP stream generated by the AI service.
“chat_server”	str	“http://0.0.0.0:5015/out”	URI of the internal chat server for the main pipeline to access the VLM
“alert_system_prompt”	str	“You are an AI assistant whose job is to look at images and evaluate a set of alerts supplied by the user”	The VLM system prompt to be used when evaluating alert rules
“max_alerts”	str	10	Number of supported alerts	If changed, must update the rules.yml
“alert_cooldown”	int	60	Cool down period for alerts being sent out on the websocket.	For example, if set to 60 then an alert that is true over multiple frames will only send out 1 alert event on the websocket at most every 60 seconds.
“log_level”	str	“INFO”	Python based log level	Supported Values: [“DEBUG”, “INFO”, “WARNING”, “ERROR”, “CRITICAL”]

Foundation Service Configuration

prometheus.yml
rules.yml
vlm-nginx.conf

The prometheus.yml and rules.yml are used to configure the jetson-monitoring service for tracking the VLM alert evaluations. The documentation for jetson-monitoring is available in Monitoring

The vlm-nginx.conf is used to configure the jetson-ingress service (API Gateway) to route HTTP traffic to the VLM service. The documentation for jetson-ingress is available in API Gateway (Ingress)

VLM Service Integration

The VLM service can be integrated with other parts of Jetson Platform services to build a full end to end workflow for video monitoring. To read more about the full workflow and how the VLM service can integrate with VST, SDR, Cloud and the mobile app go to the VLM Workflow page.

Troubleshooting

VLM Model Fails to Load

The first time the VLM AI service is launched, it will download both the container and VLM model. This is around 50GBs in size with the default configuration.

Ensure your Jetson has a stable internet connection to complete the download. The download time will depend on your network speed.

Once the container and model has downloaded, it will automatically quantize and optimize the model for Jetson. This is a one time process that takes 10-20 minutes and will use up to 32GBs of memory for the largest model (13b).

You must ensure the total available virtual memory of your Jetson is at least 32GB. (physical memory + swap).

For example, NX16 needs an additional 8GBs of SWAP to get past the quantization stage (16GB physical Memory + 8GB default swap + 8GB additional swap). If you have a /data drive you can run the following commands to mount more swap.

sudo fallocate -l 8G /data/8GB.swap
sudo mkswap /data/8GB.swap
sudo swapon /data/8GB.swap

After mounting the additional swap space, you can launch the VLM service container to download and quantize the model.

If you reboot your Jetson and still require the additional swap space, then you will need to run the swapon command again.

When the model is done quantizing, if you want to reclaim the disk space used by swap then you can run the following:

sudo swapoff /data/8GB.swap
sudo rm /data/8GB.swap

Managing the Container

Once the VLM service container is launched, if it hits any issues it can be manually restarted by running the following commands from the same folder as the docker compose file:

sudo docker compose down
sudo docker compose up -d

It can also be killed by running:

sudo docker kill jps_vlm

If the container has an update available, then download the latest from NGC.

System Warnings

When running the VLM service with an alert, it will use significant resources. This may cause your Jetson to show warnings such as “System throttled due to over-current”. This is expected when running at the maximum power mode.

It is recommended to set your Jetson fans to maximum for extra cooling. This can be done with the jetson_clocks command:

sudo jetson_clocks --fan

If you would like to reduce the power consumption or heat production further, this can be done by controlling the power mode with the nvpmodel command.

sudo nvpmodel --help

Visual Language Models (VLM) with Jetson Platform Services

Overview

Getting Started

Prerequisites

Interact with VLM Service

Configuration

VLM Service Integration

Further Reading

Troubleshooting

VLM Model Fails to Load

Managing the Container

System Warnings