Real-Time VLM Microservice#

Overview#

The Real-Time VLM Server is a FastAPI-based REST API service that provides real-time video understanding capabilities using Vision Language Models (VLM). It enables users to generate captions, process live video streams, and manage live streams through a comprehensive set of REST endpoints.

The server translates between HTTP requests/responses and the underlying Real-Time VLM Microservice components, providing a clean API interface for live stream processing operations.

Key Features#

  • VLM Caption Generation: Generate captions and alerts using Vision Language Models for live streams

  • Live Stream Support: Process RTSP live streams for real-time caption generation and alert detection

  • Streaming Responses: Server-Sent Events (SSE) for streaming output or Kafka messages

  • Asset Management: Comprehensive stream lifecycle management

  • Sampling Rate Management: Manage the image frames sampling for VLM inference of the live stream

  • Health Monitoring: Health check endpoints for service monitoring

  • Metrics: Prometheus metrics endpoint for observability

Architecture#

The following diagram illustrates the DeepStream Accelerated Pipeline architecture for the RTVI VLM Server:

DeepStream Accelerated Pipeline Architecture

The pipeline processes camera streams through the following stages:

  • RTSP + Decode: Receives and decodes camera streams

  • Video Segment Creation: Creates video segments for processing

  • Infer: Performs inference using VLM models (with vLLM inference engine)

  • MsgBroker (NvSchema): Publishes messages to Kafka

The system can be configured and controlled via:

  • REST API: HTTP-based API for configuration and control

Models Supported#

The RTVI VLM Microservice supports local vLLM-compatible checkpoints, NGC model artifacts, and remote OpenAI-compatible endpoints. Use MODEL_PATH=git:<Hugging Face URL> for Hugging Face checkpoints, MODEL_PATH=ngc:<org>/<team>/<model>:<version> for NGC model artifacts, or VLM_MODEL_TO_USE=openai-compat with VIA_VLM_ENDPOINT for a remote endpoint.

The following model families are supported.

Cosmos Reason2 Family (Default)#

  • Cosmos Reason2 8B with 0303-fp8-dynamic-kv8. Use VLM_MODEL_TO_USE=cosmos-reason2 and MODEL_PATH=ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8.

  • Cosmos Reason2 8B hf-0303. Use VLM_MODEL_TO_USE=cosmos-reason2 and MODEL_PATH=ngc:nim/nvidia/cosmos-reason2-8b:hf-0303.

  • Cosmos Reason2 8B 0303-fp4-dynamic-kv8. Use VLM_MODEL_TO_USE=cosmos-reason2 and MODEL_PATH=ngc:nim/nvidia/cosmos-reason2-8b:0303-fp4-dynamic-kv8. Do not use this FP4/NVFP4 variant on GB200.

Cosmos3 Family#

Nemotron Omni Family#

Qwen Family#

  • Qwen3-VL-30B-A3B-Instruct. Use VLM_MODEL_TO_USE=vllm-compatible and MODEL_PATH=git:https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct.

  • Qwen3-Omni-30B-A3B-Instruct. Use VLM_MODEL_TO_USE=vllm-compatible and MODEL_PATH=git:https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct.

  • Qwen3.5-27B. Use VLM_MODEL_TO_USE=vllm-compatible and MODEL_PATH=git:https://huggingface.co/Qwen/Qwen3.5-27B.

Cosmos Reason1#

RT-VLM supports video and text prompts for caption generation. Remote VLM model endpoints are tested with NVIDIA NIM or OpenAI-compatible chat/completions endpoints.

What’s New in 3.2.0#

Breaking Changes#

  • Endpoint rename: /v1/generate_captions_alerts has been renamed to /v1/generate_captions. Update any 3.1.0-era client code accordingly.

  • Duplicate live-stream / camera IDs are now rejected with HTTP 409 (DuplicateCameraId / DuplicateStreamId) instead of being silently overwritten. Remove the prior stream first, or surface the 409 to the caller.

  • Repeated RTSP caption requests now start independent jobs. Repeated /v1/generate_captions calls for the same RTSP stream ID no longer reconnect to the existing live caption request. Each call returns a distinct request ID. DELETE /v1/generate_captions/{stream_id} or deleting the stream stops all active caption jobs for that stream ID.

New Models#

  • Cosmos Reason 3 reasoning VLMs. Two checkpoints are supported: Cosmos3-Nano-Reasoner and Cosmos3-Super-Reasoner. Set VLM_MODEL_TO_USE=cosmos-reason3 and MODEL_PATH=git:<HF URL>.

  • Nemotron Nano Omni with native video + audio understanding. Enable with VLM_TRUST_REMOTE_CODE=true and (for audio) VLM_MODEL_SUPPORTS_AUDIO=true. Use INSTALL_PROPRIETARY_CODECS=true when inputs contain proprietary audio codecs.

  • Qwen 3.5 and Qwen 3.5 MoE VLMs. The service auto-detects Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration and runs them against the bundled vLLM runtime. The MoE variant uses the triton MoE backend (auto-selected on B200, configurable via VLLM_MOE_BACKEND).

  • Default model is now cosmos-reason2 (ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8).

New API Capabilities#

  • URL-based processing on /v1/generate_captions and /v1/files: http://, https://, s3://, file:// (with allowlist via FILE_URL_ALLOWED_DIRS). New request fields url, media_type, creation_time, url_headers; new response field chunk_id (zero-based).

  • Native audio for Omni models via enable_audio: true; audio_transcript is returned per chunk.

  • Reasoning via enable_reasoning: true; reasoning_description is returned per chunk.

  • Reasoning propagation to Kafka/protobuf: parsed VLM reasoning is added to VisionLLM.info and Incident.info as both reasoning and reasoningDescription.

  • Alert categorization via the alert_category field on incidents.

  • CV-compatible stream API: POST /v1/stream/add, POST /v1/stream/remove, GET /v1/stream/get-stream-info alongside the existing /v1/streams/* endpoints.

  • Text-only chat completions via /v1/chat/completions (no media required); multi-turn conversations and token-level SSE streaming.

  • POST /v1/files accepts an HTTP/S URL for server-side fetch.

  • Proprietary audio codec support for Omni models via INSTALL_PROPRIETARY_CODECS=true.

Optimizations#

  • Efficient Video Sampling (EVS) via VLM_VIDEO_PRUNING_RATE. Supported on Nemotron Nano VL and Qwen 2.5 VL (not Cosmos Reason).

  • GOP-aware decode optimization via RTVI_ENABLE_GOP_DECODE_OPT (default true; file decode only).

  • New vLLM tuning knobs: VLLM_MM_PROCESSOR_CACHE_GB, VLLM_ENFORCE_EAGER, VLLM_MAX_NUM_BATCHED_TOKENS, VLLM_DISABLE_MM_PREPROCESSOR_CACHE.

New Environment Variables#

  • VLM_MAX_GENERATION_TOKENS — Cap on tokens generated per request (default 16384).

  • VLM_PROMPT_MAX_LENGTH — Maximum user-prompt length in characters (default 10240).

  • VLM_SYSTEM_PROMPT_MAX_LENGTH — Maximum system-prompt length in characters (default 10240).

  • VLM_MODEL_SUPPORTS_AUDIO — Enable native audio decoding for Omni models (default false).

  • VLM_TRUST_REMOTE_CODE — Enable trust_remote_code in vLLM and AutoProcessor. Required for Omni and some custom models (default false).

  • VLM_VIDEO_PRUNING_RATE — EVS video-token pruning rate (0.01.0). Supported on Nemotron Nano VL and Qwen 2.5 VL; not on Cosmos Reason.

  • VLLM_MOE_BACKEND — Override vLLM MoE backend (e.g., triton). Auto-defaults to triton on B200.

  • VLLM_MM_PROCESSOR_CACHE_GB — vLLM multimodal preprocessor cache size in GB (default 1).

  • VLLM_ENFORCE_EAGER — Force eager execution in vLLM (A100/Ampere CR2 FP8 workaround) (default false).

  • VLLM_MAX_NUM_BATCHED_TOKENS — Cap on the prefill batch token size to improve decode throughput (default 16384).

  • RTVI_ENABLE_GOP_DECODE_OPT — GOP-aware decode optimization for file decode (default true; no effect on RTSP).

  • KAFKA_ASYNC_SEND_QUEUE_MAXSIZE — Maximum queued in-flight async Kafka sends (default 1024).

  • FILE_URL_ALLOWED_DIRS — Comma-separated absolute directories allowlisted for file:// URL ingestion. file:// is rejected when unset.

  • ASSET_DOWNLOAD_SSL_SKIP_VERIFY_DOMAINS — Domains for which SSL verification is skipped on URL download.

  • ASSET_DOWNLOAD_MAX_REDIRECTS — Maximum redirect hops on URL download (default 0 disables; max 10).

  • ASSET_DOWNLOAD_MAX_FILE_SIZE_GB — Maximum size (GB) of files fetched via URL (default 8).

  • ASSET_DOWNLOAD_AUTH_TOKENS — Per-domain auth headers (domain1=Bearer xxx;domain2=Basic yyy).

  • ASSET_MAX_AGE_HOURS — Auto-evict uploaded assets older than this many hours (default 0 disables).

Image Tag#

  • Docker Compose x86 default: nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0.

  • Docker Compose DGX Spark/SBSA image: nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0-sbsa.

  • Helm chart default: nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0. Set image.tag to the required release tag for your deployment package.

API Reference#

For complete API documentation including all endpoints, request/response schemas, and interactive examples, see the Real-Time VLM API Reference.

The API is organized into the following categories:

  • Captions: Generate VLM captions and alerts for videos and live streams

  • Files: Upload and manage video/image files

  • Live Stream: Add, list, and manage RTSP live streams

  • Stream: CV-compatible stream add, remove, and list APIs

  • Models: List available VLM models

  • Health Check: Service health and readiness probes

  • Metrics: Prometheus metrics endpoint

  • Metadata: Service metadata and version information

  • NIM Compatible: OpenAI-compatible endpoints for interoperability

Endpoint

Method

Description

/v1/metrics

GET

Get Prometheus-format RTVI metrics

/v1/ready

GET

Get RTVI VLM Microservice readiness status

/v1/live

GET

Get RTVI VLM Microservice liveness status

/v1/startup

GET

Get RTVI VLM Microservice startup status

/v1/assets/stats

GET

Get asset storage statistics

/v1/metadata

GET

Get RTVI VLM Microservice metadata

/v1/files

POST

Upload a media file or register media by URL/path

/v1/files

GET

List uploaded files

/v1/files/{file_id}

DELETE

Delete a file

/v1/files/{file_id}

GET

Get file information

/v1/files/{file_id}/content

GET

Get file content

/v1/streams/add

POST

Add one or more live streams

/v1/streams/get-stream-info

GET

List live streams

/v1/streams/delete/{stream_id}

DELETE

Remove a live stream

/v1/streams/delete-batch

DELETE

Remove multiple live streams

/v1/stream/add

POST

Add a stream using the CV-compatible format

/v1/stream/remove

POST

Remove a stream using the CV-compatible format

/v1/stream/get-stream-info

GET

List streams using the CV-compatible format

/v1/models

GET

List available models

/v1/generate_captions

POST

Generate VLM captions and audio transcripts

/v1/generate_captions/{stream_id}

DELETE

Stop live stream caption generation

/v1/chat/completions

POST

OpenAI-compatible chat completion endpoint

/v1/completions

POST

OpenAI-compatible completions endpoint

/v1/version

GET

Get release and API versions

/v1/license

GET

Get license information

/v1/manifest

GET

Get service manifest information

/v1/health/live

GET

NIM-compatible liveness check

/v1/health/ready

GET

NIM-compatible readiness check

All endpoints are prefixed with /v1. The API is available at http://<host>:8000.

Deployment#

Prerequisites#

Validated GPUs:

The RTVI VLM Microservice has been validated and tested on the following NVIDIA GPUs:

  • NVIDIA H100

  • NVIDIA RTX PRO 6000 Blackwell

  • NVIDIA L40S

  • NVIDIA DGX SPARK

  • NVIDIA IGX Thor

  • NVIDIA AGX Thor

Software:

  • OS: Ubuntu 24.04 or compatible Linux distribution (x86); DGX OS 7.4.0 (DGX Spark); Jetson Linux BSP Rel 38.4/38.5 (Jetson Thor)

  • Docker: Version 28.2+ and earlier than 29.5.0

  • Docker Compose: Version 2.36+

  • NVIDIA Driver: 580+

  • NVIDIA Container Toolkit: Latest version

  • Git LFS: For large file handling

Quick Start#

Clone the blueprint repository and use the deployment assets shipped with the RT-VLM service.

Docker Compose Deployment#

  1. Clone the repository and change into the service Docker directory:

    git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
    cd video-search-and-summarization/services/rtvi/rt-vlm/docker
    
  2. Create a .env file with your configuration:

    cat > .env << EOF
    BACKEND_PORT=8000
    RTVI_IMAGE=nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0
    # For DGX Spark/SBSA platforms:
    #RTVI_IMAGE=nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0-sbsa
    VLM_MODEL_TO_USE=cosmos-reason2
    MODEL_PATH=ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8
    KAFKA_ENABLED=true
    #KAFKA_BOOTSTRAP_SERVERS=<Kafka_server_ip:port>
    KAFKA_TOPIC=mdx-vlm-captions
    KAFKA_INCIDENT_TOPIC=mdx-vlm-incidents
    NGC_API_KEY=nvapi-XXXXXX
    VLM_BATCH_SIZE=128
    NVIDIA_VISIBLE_DEVICES=0
    EOF
    
  3. Start the service:

    docker compose up
    

Note

The standalone Compose stack starts RT-VLM, Kafka, and Redis. To use a different Kafka server, set KAFKA_BOOTSTRAP_SERVERS. To use a different Redis server, set REDIS_HOST.

If the launch fails due to Out of Memory error, see Troubleshooting (adjust VLLM_GPU_MEMORY_UTILIZATION, reduce VLM_MAX_MODEL_LEN, or set NVIDIA_VISIBLE_DEVICES=<gpuid>).

The APIs are available at http://<HOST_IP>:<BACKEND_PORT>/docs.

Standalone Helm Chart Deployment#

Use the standalone Helm chart when running only RT-VLM on Kubernetes. The chart is in deploy/helm/services/rtvi/charts/rtvi-vlm in the video-search-and-summarization repository.

  1. Clone the repository and change into the chart directory:

    git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
    cd video-search-and-summarization/deploy/helm/services/rtvi/charts/rtvi-vlm
    
  2. Create the namespace and secrets:

    kubectl create namespace vss-rtvi
    kubectl create secret docker-registry ngc-image-pull-secret \
      --docker-server=nvcr.io \
      --docker-username='$oauthtoken' \
      --docker-password="$NGC_API_KEY" \
      -n vss-rtvi
    kubectl create secret generic ngc-api \
      --from-literal=NGC_API_KEY="$NGC_API_KEY" \
      -n vss-rtvi
    # Required when MODEL_PATH points to a gated Hugging Face checkpoint.
    kubectl create secret generic hf-token-secret \
      --from-literal=HF_TOKEN="$HF_TOKEN" \
      -n vss-rtvi
    
  3. Install the chart with the standalone override:

    helm upgrade --install vss-rtvi-vlm . \
      -n vss-rtvi \
      -f overrides_rtvi_vlm.yaml
    

    When using the hf-token-secret secret, set hfTokenSecret.name=hf-token-secret and hfTokenSecret.key=HF_TOKEN in your values file or with --set.

  4. Expose the API for local testing:

    kubectl port-forward -n vss-rtvi svc/vss-rtvi-vlm 8000:8000
    

The API is available at http://localhost:8000/docs.

Source Code Updates and Custom Container Build#

Use this workflow only when you have changed RT-VLM source files under services/rtvi/rt-vlm/src and need those changes inside the runtime container. Custom container testing is supported with the standalone Docker Compose deployment.

  1. Build the custom image from the RT-VLM service directory:

    cd video-search-and-summarization/services/rtvi/rt-vlm
    docker build -f docker/Dockerfile -t <registry>/<repo>/vss-rt-vlm:3.2.0-custom .
    
  2. Test the image with Docker Compose by setting RTVI_IMAGE in docker/.env:

    RTVI_IMAGE=<registry>/<repo>/vss-rt-vlm:3.2.0-custom
    

    Then restart the service:

    cd docker
    docker compose down
    docker compose up
    

For DGX Spark/SBSA Docker Compose testing, build an ARM64/SBSA image from the rt-vlm/ directory and load it into the local Docker image store. Export IS_SBSA=true on the host shell before the build command so the Dockerfile resolves the -sbsa base image automatically:

export IS_SBSA=true
docker buildx build --platform linux/arm64 \
  --build-arg IS_SBSA \
  -f docker/Dockerfile \
  -t <registry>/<repo>/vss-rt-vlm:3.2.0-custom-sbsa \
  --load .

For Jetson AGX Thor / IGX Thor (ARM64 but not SBSA), do not set IS_SBSA. The default base image (nvcr.io/nvidia/vss-core/vss-rt-vlm:3.2.0) is multi-arch, so a linux/arm64 build pulls the Thor-compatible arm64 variant automatically:

docker buildx build --platform linux/arm64 \
  -f docker/Dockerfile \
  -t <registry>/<repo>/vss-rt-vlm:3.2.0-custom-thor \
  --load .

Required Packages for Usage Examples#

The Python examples require the following packages. Install with:

pip install requests sseclient-py
  • requests – HTTP client for all examples

  • sseclient-py – Server-Sent Events client (required only for streaming examples)

The RTVI CLI client uses additional packages (tabulate, tqdm, pyyaml) for enhanced output formatting; the examples below are self-contained and need only requests and sseclient-py.

Usage Examples#

Note

Discovering the loaded model ID: The examples below use cosmos-reason2 as the model field. When running a locally deployed model, the service loads it under its full NGC artifact ID (e.g., nim_nvidia_cosmos-reason2-8b_0303-fp8-dynamic-kv8). Use GET /v1/models to retrieve the exact ID before making inference requests:

import requests
response = requests.get("http://localhost:8000/v1/models")
model_id = response.json()["data"][0]["id"]
print(f"Loaded model: {model_id}")  # use this as the "model" field

Replace cosmos-reason2 in the examples below with the returned id.

Dense Captioning Example#

The following example demonstrates dense captioning for a stored video file. A video is uploaded, split into temporal chunks based on chunk_duration and chunk_overlap_duration, and each chunk is processed by the VLM.

Workflow: For a video of duration D seconds with chunk_duration=60 and chunk_overlap_duration=10, the video is chunked into overlapping segments (e.g., a 120-second video yields approximately 2 chunks). Each chunk is decoded, frames are sampled, and the VLM generates a caption. With stream=False (default), all chunks are processed and returned together in a single JSON response.

import requests
import json

BASE_URL = "http://localhost:8000/v1"

# Step 1: Upload a video file
with open("video.mp4", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/files",
        files={"file": f},
        data={
            "purpose": "vision",
            "media_type": "video",
            "creation_time": "2024-06-09T18:32:11.123Z"
        }
    )
response.raise_for_status()
file_info = response.json()
file_id = file_info["id"]
print(f"Uploaded file: {file_id}")

# Step 2: Generate captions (chunks returned in full response)
caption_request = {
    "id": file_id,
    "prompt": "Describe what is happening in this video",
    "model": "cosmos-reason2",
    "chunk_duration": 60,
    "chunk_overlap_duration": 10,
    "enable_audio": False
}
response = requests.post(
    f"{BASE_URL}/generate_captions",
    json=caption_request
)
response.raise_for_status()
captions = response.json()
print(f"Generated {len(captions['chunk_responses'])} caption chunks")

# Step 3: Process results
for chunk in captions["chunk_responses"]:
    print(f"[{chunk['start_time']} - {chunk['end_time']}]: {chunk['content']}")

# Step 4: Clean up
requests.delete(f"{BASE_URL}/files/{file_id}")
print("File deleted")

Dense Captioning (Streaming) Example#

The following example demonstrates dense captioning for a live RTSP stream with streaming output. As the stream is decoded in real time, it is split into chunks (e.g., 60-second segments with 10-second overlap). Each chunk is processed by the VLM as it becomes available.

Workflow: With stream=True, chunks are returned incrementally via Server-Sent Events (SSE). Each SSE event contains chunk_responses for chunks that have completed VLM inference. The client receives captions as they are generated instead of waiting for the entire stream to finish. The stream terminates with a [DONE] message.

Concurrent RTSP caption requests: each /v1/generate_captions request for an RTSP stream ID gets its own request ID, prompt, SSE response stream, and private server-side live decode pipeline. Multiple clients can caption the same RTSP stream concurrently, and repeated requests no longer reconnect to an existing live caption request. Because each request uses a separate decoder, many concurrent requests for the same stream can reduce performance. DELETE /v1/generate_captions/{stream_id} or deleting the stream stops all active caption jobs for that stream ID.

import requests
import json
import sseclient

BASE_URL = "http://localhost:8000/v1"

# Step 1: Add live stream
stream_request = {
    "streams": [{
        "liveStreamUrl": "rtsp://example.com/stream",
        "description": "Main warehouse camera"
    }]
}
response = requests.post(
    f"{BASE_URL}/streams/add",
    json=stream_request
)
response.raise_for_status()
stream_info = response.json()
if stream_info.get("errors"):
    raise RuntimeError(f"Failed to add stream: {stream_info['errors']}")
stream_id = stream_info["results"][0]["id"]
print(f"Added stream: {stream_id}")

# Step 2: Start caption generation with streaming
caption_request = {
    "id": stream_id,
    "prompt": "Describe what is happening in this live stream",
    "model": "cosmos-reason2",
    "stream": True,
    "chunk_duration": 60,
    "chunk_overlap_duration": 10
}
response = requests.post(
    f"{BASE_URL}/generate_captions",
    json=caption_request,
    stream=True
)
response.raise_for_status()

# Step 3: Process streaming responses (chunks as they complete)
client = sseclient.SSEClient(response)
for event in client.events():
    data = event.data.strip() if event.data else ""
    if data == "[DONE]":
        break
    if not data:
        continue
    result = json.loads(data)
    if "chunk_responses" in result:
        for chunk in result["chunk_responses"]:
            print(f"[{chunk['start_time']}]: {chunk['content']}")

# Step 4: Stop processing and remove stream
requests.delete(f"{BASE_URL}/generate_captions/{stream_id}")
requests.delete(f"{BASE_URL}/streams/delete/{stream_id}")
print("Stream removed")

Listing Live Streams#

Use GET /v1/streams/get-stream-info to list all currently registered live streams. This is useful for verifying that a stream was added successfully or for retrieving the stream ID of an existing stream.

import requests

BASE_URL = "http://localhost:8000/v1"

response = requests.get(f"{BASE_URL}/streams/get-stream-info")
response.raise_for_status()
streams = response.json()  # list of LiveStreamInfo objects
for stream in streams:
    print(f"ID: {stream['id']}, URL: {stream['liveStreamUrl']}")

CV-Compatible Stream API#

In addition to the /v1/streams/* endpoints, the RTVI VLM service exposes a CV-schema-compatible stream API for cross-service interoperability with RTVI-CV pipelines: POST /v1/stream/add, POST /v1/stream/remove, and GET /v1/stream/get-stream-info. The request and response payloads follow the CV schema (camera_id, url, metadata).

If metadata.prompt is supplied on POST /v1/stream/add, the service auto-starts caption inference for the stream at add time; otherwise the stream is registered without inference and can be driven later.

import requests

BASE_URL = "http://localhost:8000/v1"

# Register a stream with auto-inference enabled
payload = {
    "camera_id": "camera-entrance-east-01",
    "url": "rtsp://example.com/stream",
    "metadata": {
        "prompt": "Describe what is happening in this live stream"
    }
}
response = requests.post(f"{BASE_URL}/stream/add", json=payload)
response.raise_for_status()

# Remove the stream
requests.post(
    f"{BASE_URL}/stream/remove",
    json={"camera_id": "camera-entrance-east-01"}
)

VLM Alert Example#

The following example demonstrates alert/anomaly detection using the VLM. Use a prompt that expects a structured Yes/No response for anomaly detection. When anomalies are detected, incidents can be published to the Kafka incident topic (see Kafka Integration). The Incident message payload is defined in Incident Messages.

import requests

BASE_URL = "http://localhost:8000/v1"

# Upload video or use existing stream ID
file_id = "your-file-id"  # or stream_id for live streams

caption_request = {
    "id": file_id,
    "prompt": (
        "You are a warehouse monitoring system focused on safety and "
        "efficiency. Analyze the situation to detect any anomalies such as "
        "workers not wearing safety gear, leaving items unattended, or "
        "wasting time. Respond in the following structured format:\n"
        "Anomaly Detected: Yes/No\n"
        "Reason: [Brief explanation]"
    ),
    "system_prompt": "Answer the user's question correctly in yes or no",
    "model": "cosmos-reason2",
    "chunk_duration": 60,
    "chunk_overlap_duration": 10
}
response = requests.post(
    f"{BASE_URL}/generate_captions",
    json=caption_request
)
captions = response.json()

for chunk in captions["chunk_responses"]:
    content = chunk.get("content", "")
    if "Anomaly Detected: Yes" in content:
        print(f"ALERT [{chunk['start_time']} - {chunk['end_time']}]: {content}")

Chat Completions API Example#

The OpenAI-compatible /v1/chat/completions endpoint supports three input modes for video/image: pre-uploaded file ID, HTTP/HTTPS URL, and base64 data URL.

1. Using pre-uploaded file ID (id field):

import requests

BASE_URL = "http://localhost:8000/v1"
file_id = "your-uploaded-file-uuid"  # From POST /v1/files

response = requests.post(
    f"{BASE_URL}/chat/completions",
    json={
        "model": "cosmos-reason2",
        "id": file_id,
        "messages": [{"role": "user", "content": "Describe what is happening in this video."}],
    },
)
result = response.json()
print(result["choices"][0]["message"]["content"])

2. Using HTTP URL (image_url or video_url in message content):

import requests

BASE_URL = "http://localhost:8000/v1"

response = requests.post(
    f"{BASE_URL}/chat/completions",
    json={
        "model": "cosmos-reason2",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this image?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": "https://example.com/image.png"},
                    },
                ],
            }
        ],
    },
)
result = response.json()
print(result["choices"][0]["message"]["content"])

# For video, use "video_url" instead:
# {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}}

3. Using base64 data URL (inline media in message content):

import requests
import base64

BASE_URL = "http://localhost:8000/v1"

# Read and encode image as base64
with open("image.png", "rb") as f:
    b64_data = base64.b64encode(f.read()).decode("utf-8")

data_url = f"data:image/png;base64,{b64_data}"

response = requests.post(
    f"{BASE_URL}/chat/completions",
    json={
        "model": "cosmos-reason2",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image."},
                    {"type": "image_url", "image_url": {"url": data_url}},
                ],
            }
        ],
    },
)
result = response.json()
print(result["choices"][0]["message"]["content"])

# For video: data:video/mp4;base64,<base64_string>

URL-Based Processing#

POST /v1/generate_captions and POST /v1/files accept a url field for direct server-side ingestion without a prior upload step. Supported schemes: http://, https://, s3://, and file://.

New request fields on /v1/generate_captions:

  • url — media URL (any supported scheme).

  • media_type"video" (default) or "image".

  • creation_time — ISO 8601 timestamp used as the frame-time offset baseline (useful for live-correlated archive playback).

  • url_headers — per-request authorization headers (dict). HTTPS only; cross-domain redirects strip these headers.

New response field: each chunk in chunk_responses includes a zero-based chunk_id for ordering.

Security:

  • file:// URLs are rejected unless FILE_URL_ALLOWED_DIRS is set (comma-separated absolute directories). Paths are resolved via realpath to prevent traversal escapes.

  • HTTP/HTTPS URLs are validated against SSRF rules. Per-domain SSL skip is opt-in via ASSET_DOWNLOAD_SSL_SKIP_VERIFY_DOMAINS. Redirect hops are disabled by default; raise the cap via ASSET_DOWNLOAD_MAX_REDIRECTS (max 10). Server-level auth tokens can be set per domain via ASSET_DOWNLOAD_AUTH_TOKENS (format: domain1=Bearer token1;domain2=Basic xyz).

import requests

BASE_URL = "http://localhost:8000/v1"

caption_request = {
    "url": "https://example.com/video.mp4",
    "media_type": "video",
    "creation_time": "2024-06-09T18:32:11.123Z",
    "prompt": "Describe what is happening in this video",
    "model": "cosmos-reason2",
    "chunk_duration": 60,
    "chunk_overlap_duration": 10
}
response = requests.post(
    f"{BASE_URL}/generate_captions",
    json=caption_request
)
response.raise_for_status()
for chunk in response.json()["chunk_responses"]:
    print(f"[chunk {chunk['chunk_id']}] {chunk['content']}")

Implementation Details#

Server Class#

The RT VLM Server implements the FastAPI application and manages all API endpoints.

Timestamp Handling#

The server handles timestamps differently based on media type, for live stream uses NTP timestamps (ISO 8601 format).

Streaming Implementation#

For streaming responses, the server uses Server-Sent Events (SSE):

  • Events are sent as JSON objects

  • Each event contains chunk responses as they become available

  • Final event contains usage statistics (if requested)

  • Stream terminates with [DONE] message

  • Multiple clients can caption the same RTSP stream concurrently. Each /v1/generate_captions call gets its own request ID, prompt, SSE response stream, and private server-side live decode pipeline. Multiple decoders for the same stream can reduce performance.

Kafka Integration#

The RTVI VLM Server can publish messages to Kafka topics for downstream processing. This enables integration with other microservices, analytics pipelines, and real-time alerting systems.

Kafka Topics#

The server publishes to the following Kafka topics:

  • VisionLLM Messages: Contains VisionLLM protobuf messages with VLM caption results.

  • Incidents: Contains Incident protobuf messages when anomalies or incidents are detected.

Configuration:

Kafka integration is controlled by the following environment variables:

  • KAFKA_ENABLED: Enable/disable Kafka integration (true / false). Default: true

  • KAFKA_BOOTSTRAP_SERVERS: Comma-separated list of Kafka broker addresses (e.g., localhost:9092 or kafka:9092 for Docker)

  • KAFKA_TOPIC: Topic for VisionLLM messages.

  • KAFKA_INCIDENT_TOPIC: Topic for incident messages.

How Alerts and Incidents Are Sent to Kafka#

When generate_captions processes video chunks, the server publishes messages to Kafka as each chunk completes VLM inference:

  1. VisionLLM messages are always sent to KAFKA_TOPIC. Each message contains the VLM caption, frame metadata, sensor info, and an incidentDetected flag in the info map ("true" or "false"). When reasoning is enabled and parsed from the VLM output, the reasoning is also added to the info map as both reasoning and reasoningDescription.

  2. Incident messages are sent to KAFKA_INCIDENT_TOPIC only when an anomaly is detected. The server detects incidents by checking the VLM response for trigger phrases such as "yes" or "true" (case-insensitive). Use prompts that expect structured Yes/No answers (see Enabling Incidents). For the full Incident message payload (protobuf schema and example), see Incident Messages. When reasoning is available, it is also added to Incident.info as both reasoning and reasoningDescription.

  3. Publishing flow (per chunk):

    • After VLM inference completes, the server builds a VisionLLM protobuf and optionally an Incident protobuf if the response triggers an alert.

    • Each message is sent asynchronously with a key {request_id}:{chunk_idx} for partitioning and ordering.

    • A message_type header ("vision_llm" or "incident") identifies the payload type for consumers.

    • VisionLLM and Incident messages are published independently; an incident is sent only when the VLM output indicates an anomaly.

  4. Requirement: Set KAFKA_ENABLED=true and configure KAFKA_BOOTSTRAP_SERVERS for Kafka publishing to occur.

Message Formats#

Incident Messages#

Incident messages are serialized as Protocol Buffer (protobuf) messages using the Incident message type from the protobuf schema.

Message Header:

  • message_type: "incident"

Message Structure:

message Incident {
  string sensorId = 1;
  google.protobuf.Timestamp timestamp = 2;
  google.protobuf.Timestamp end = 3;
  repeated string objectIds = 4;
  repeated string frameIds = 5;
  Place place = 6;
  AnalyticsModule analyticsModule = 7;
  string category = 8;
  repeated Embedding embeddings = 9;
  bool isAnomaly = 10;
  LLM llm = 12;
  map<string, string> info = 11;
}

Key Fields:

  • sensorId: Identifier of the sensor/stream

  • timestamp: Start timestamp of the incident

  • end: End timestamp of the incident

  • objectIds: Array of object IDs involved in the incident

  • category: Category of the incident (e.g., "safety_non_compliance")

  • isAnomaly: Boolean indicating if the incident is an anomaly

  • llm: LLM query and response information

  • info: Additional metadata map containing fields like: * request_id: Request ID associated with the incident * chunk_idx: Chunk index where the incident was detected * incident_detected: Alert flag * reasoning: Parsed VLM reasoning, when available * reasoningDescription: Parsed VLM reasoning, when available * priority: Priority level (e.g., "high")

Example Incident JSON (for reference):

{
  "sensorId": "camera-entrance-east-01",
  "timestamp": "2025-11-19T06:22:20Z",
  "end": "2025-11-19T06:22:32Z",
  "objectIds": [],
  "frameIds": ["frame-10512", "frame-10518"],
  "place": {
    "id": "dock-entrance-east",
    "name": "Dock Entrance - East",
    "type": "warehouse-bay"
  },
  "analyticsModule": {
    "id": "inc-activity-detector",
    "description": "Forklift safety compliance detector",
    "source": "VLM",
    "version": "2.0.0"
  },
  "category": "safety_non_compliance",
  "isAnomaly": true,
  "info": {
    "priority": "high",
    "request_id": "req_1234567890",
    "chunk_idx": "5"
  },
  "llm": {
    "queries": [{
      "response": "Operator entered the high-risk loading area without a high-visibility vest while a forklift was active."
    }]
  }
}

VisionLLM Messages#

VisionLLM messages contain VLM caption results and are serialized as Protocol Buffer messages using the VisionLLM message type.

Message Header:

  • message_type: "vision_llm" (default, if not specified)

Message Structure:

See the protobuf schema documentation for complete VisionLLM message structure. Key fields include:

  • version: Message version

  • timestamp: Start timestamp

  • end: End timestamp

  • startFrameId: Start frame identifier

  • endFrameId: End frame identifier

  • sensor: Sensor information

  • llm: LLM queries, responses, and embeddings

  • info: Additional metadata map. When reasoning is enabled and parsed from the VLM output, this map includes both reasoning and reasoningDescription.

Redis Error Messages#

By default, error messages are sent to Kafka. To use Redis for error messages, set the following environment variables in your .env file:

ENABLE_REDIS_ERROR_MESSAGES=true
REDIS_HOST=redis.example.com
REDIS_PORT=6379
REDIS_DB=0
REDIS_PASSWORD=your_password  # Optional, only if Redis requires authentication
ERROR_MESSAGE_TOPIC=vision-llm-errors  # Redis channel name for error messages

Error messages will be published to the Redis channel specified in ERROR_MESSAGE_TOPIC. The message format remains JSON with the following fields: streamId, timestamp, type, source, event.

Using Remote Endpoints#

The RTVI VLM Microservice supports using remote endpoints with NVIDIA NIM or OpenAI-compatible models:

NVIDIA NIM:

VLM_MODEL_TO_USE=openai-compat
OPENAI_API_KEY=nvapi-XXXXXXX
VIA_VLM_ENDPOINT="https://integrate.api.nvidia.com/v1"
VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME="nvidia/nemotron-nano-12b-v2-vl"

For local deployments, update VIA_VLM_ENDPOINT to point to your local deployment.

GPT-4o:

OPENAI_API_KEY=<openai key>
VLM_MODEL_TO_USE=openai-compat
VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME="gpt-4o"

Hugging Face Models Locally#

For models downloaded from Hugging Face and served locally via vLLM:

Qwen3-VL:

If HF_TOKEN is already set:

VLM_MODEL_TO_USE=vllm-compatible
MODEL_PATH=git:https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct

If HF_TOKEN is not set, embed the token directly in the URL:

VLM_MODEL_TO_USE=vllm-compatible
MODEL_PATH=git:https://user:<huggingface_token>@huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct

Qwen3.5 / Qwen 3.5 MoE:

The service detects the Qwen3.5 architecture from the model config and selects the bundled vLLM runtime automatically. No extra configuration is required beyond MODEL_PATH.

VLM_MODEL_TO_USE=vllm-compatible
MODEL_PATH=git:https://huggingface.co/Qwen/Qwen3.5-27B

Cosmos Reason 3 Reasoner:

Cosmos Reason 3 Reasoner is served by setting VLM_MODEL_TO_USE=cosmos-reason3 and pointing MODEL_PATH at the Hugging Face checkpoint with the git: prefix. Two checkpoint repositories are supported:

# Nano Reasoner
VLM_MODEL_TO_USE=cosmos-reason3
MODEL_PATH=git:https://huggingface.co/nvidia/Cosmos3-Nano-Reasoner
# Super Reasoner
VLM_MODEL_TO_USE=cosmos-reason3
MODEL_PATH=git:https://huggingface.co/nvidia/Cosmos3-Super-Reasoner

Omni Native Audio Support#

Omni models, including Nemotron Nano Omni, provide native single-model video + audio understanding. When VLM_MODEL_SUPPORTS_AUDIO=true is set, audio is decoded and consumed natively by the model alongside video frames.

Required environment:

VLM_MODEL_TO_USE=vllm-compatible
MODEL_PATH=<HF or NGC path to Nemotron Nano Omni>
VLM_TRUST_REMOTE_CODE=true
VLM_MODEL_SUPPORTS_AUDIO=true   # omit for video-only
INSTALL_PROPRIETARY_CODECS=true # required for proprietary audio codecs

Note

Set INSTALL_PROPRIETARY_CODECS=true when audio inputs may contain proprietary audio codecs. Without it, the service falls back to the default codec set and may fail to decode the audio track of such inputs. The flag has no effect when audio is disabled.

Request side: pass enable_audio: true on POST /v1/generate_captions to use native audio. For reasoning checkpoints, set enable_reasoning: true; the response includes a reasoning_description field in each chunk.

Response side: each chunk in chunk_responses includes an audio_transcript field (when enable_audio is set). Native audio understanding is available only on Omni models; other VLM models support video and text input only.

For LVS integration, see Optional: Use Nemotron Nano V3 Omni (audio-aware VLM).

Efficient Video Sampling (EVS)#

EVS prunes redundant video tokens at the VLM input to lower per-chunk inference cost. Configure with VLM_VIDEO_PRUNING_RATE=<0.0-1.0> (leave unset to disable).

EVS is supported on Nemotron Nano VL and Qwen 2.5 VL. It is not supported on Cosmos Reason1 or Cosmos Reason2 — setting the variable on those models has no effect.

Environment Variables Reference#

The shared list below covers variables common to the standalone Docker Compose stack and standalone Helm override.

Docker Compose and Helm Variables#

The table lists variables in standalone Docker Compose stack and the standalone Helm override.

Variable

Description

Standalone default

MODEL_PATH (Helm: modelPath)

Model source

ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8

MODEL_IMPLEMENTATION_PATH

Custom model implementation path

Empty

NGC_API_KEY

NGC API key

Compose: Empty; Helm: ngc-api/NGC_API_KEY secret

HF_TOKEN

Hugging Face token

Compose: Empty; Helm: hf-token-secret/HF_TOKEN secret

NVIDIA_API_KEY

NVIDIA API key for hosted endpoints

NOAPIKEYSET

NVIDIA_VISIBLE_DEVICES

GPU device IDs exposed to the container

all

OPENAI_API_KEY

OpenAI-compatible API key

NOAPIKEYSET

OPENAI_API_VERSION

Azure OpenAI API version

Empty

VIA_VLM_API_KEY

OpenAI-compatible VLM API key

Compose: Empty; Helm: ngc-api/NGC_API_KEY secret

VLM_MODEL_TO_USE

Backend selector

cosmos-reason2

VLM_BATCH_SIZE

VLM inference batch size

Empty

NUM_VLM_PROCS

Number of VLM inference processes

Empty

NUM_GPUS

Number of GPUs to use

Compose: Empty; Helm: 1

VSS_NUM_GPUS_PER_VLM_PROC

Number of GPUs per VLM process

Empty

VLM_INPUT_WIDTH

Input frame width

Empty

VLM_INPUT_HEIGHT

Input frame height

Empty

VLM_DEFAULT_NUM_FRAMES_PER_SECOND_OR_FIXED_FRAMES_CHUNK

Frame sampling rate or fixed frames per chunk

30

VLM_SYSTEM_PROMPT

Default system prompt

Empty

VLM_PROMPT_MAX_LENGTH

Maximum user-prompt length in characters

10240

RTVI_VLM_MAX_GENERATION_TOKENS (Helm env: VLM_MAX_GENERATION_TOKENS)

Maximum generated tokens

16384

VLM_MODEL_SUPPORTS_AUDIO

Enable native audio support for Omni models

false

VLM_TRUST_REMOTE_CODE

Enable trust of model-supplied remote code

false

INSTALL_PROPRIETARY_CODECS

Install proprietary codecs at container startup

false

FORCE_SW_AV1_DECODER

Force software AV1 decode

Empty

LOG_LEVEL

Service logging verbosity

Compose: Empty; Helm: INFO

RTVI_EXTRA_ARGS

Additional RT-VLM runtime arguments

Empty

RTVI_RTSP_LATENCY

RTSP latency override

Empty

RTVI_RTSP_TIMEOUT

RTSP timeout override

Empty

RTVI_RTSP_RECONNECTION_INTERVAL

Time to wait between RTSP reconnection attempts

5

RTVI_RTSP_RECONNECTION_WINDOW

RTSP reconnection window in seconds

60

RTVI_RTSP_RECONNECTION_MAX_ATTEMPTS

Maximum RTSP reconnection attempts

10

RTVI_RTPJITTERBUFFER_DROP_ON_LATENCY

GStreamer jitterbuffer drop-on-latency setting

false

RTVI_RTPJITTERBUFFER_FASTSTART_MIN_PACKETS

GStreamer jitterbuffer fast-start packet threshold

2

RTVI_ENABLE_LIVE_TIMESTAMP_FILTER

Enable timestamp filtering for live streams

false

RTVI_ENABLE_FILE_TIMESTAMP_FILTER

Enable timestamp filtering for file streams

true

RTVI_ADD_TIMESTAMP_TO_VLM_PROMPT

Add timestamp metadata to VLM prompts

Empty

RTVI_EMPTY_CUDA_CACHE_ON_RESULT

Empty CUDA cache after result handling

false

RTVI_STREAM_DELETE_BLOCKING_TIMEOUT_SEC

Blocking timeout for stream deletion cleanup

300

RTVI_ENABLE_GOP_DECODE_OPT

Enable GOP-aligned decode optimization

true

VSS_SKIP_INPUT_MEDIA_VERIFICATION

Skip input media validation

Empty

VLLM_GPU_MEMORY_UTILIZATION

vLLM GPU memory utilization fraction

Empty

VLM_VIDEO_PRUNING_RATE

Efficient Video Sampling pruning rate

Compose: Empty; Helm: 0.0

RTVI_VLLM_MOE_BACKEND (Helm env: VLLM_MOE_BACKEND)

vLLM MoE backend override

Empty

RTVI_VLLM_MM_PROCESSOR_CACHE_GB (Helm env: VLLM_MM_PROCESSOR_CACHE_GB)

Multimodal processor cache size

1

VLLM_MM_TENSOR_IPC

vLLM multimodal tensor IPC setting

Empty

VLLM_MULTIMODAL_TENSOR_IPC

vLLM multimodal tensor IPC setting

Empty

VLLM_MM_ENCODER_ATTN_BACKEND

vLLM multimodal encoder attention backend

Empty

VLLM_ROOT

vLLM package root used by runtime patches

/usr/local/lib/python3.12/dist-packages/vllm

VLLM_USE_NVFP4_CT_EMULATIONS

Enable NVFP4 CT emulation

0

KAFKA_ENABLED

Enable Kafka publishing

Compose: true; Helm: false

KAFKA_BOOTSTRAP_SERVERS

Kafka bootstrap servers

Compose: kafka:9092; Helm: 127.0.0.1:9092

KAFKA_TOPIC

VisionLLM message topic

mdx-vlm-captions

KAFKA_INCIDENT_TOPIC

Incident topic

mdx-vlm-incidents

RTVI_VLM_KAFKA_ASYNC_SEND_QUEUE_MAXSIZE (Helm env: KAFKA_ASYNC_SEND_QUEUE_MAXSIZE)

Bounded queue size for async Kafka sends

1024

ERROR_MESSAGE_TOPIC

Kafka topic or Redis channel for error messages

vision-llm-errors

ENABLE_REDIS_ERROR_MESSAGES

Publish errors to Redis instead of Kafka

false

REDIS_HOST

Redis host

Compose: redis; Helm: 127.0.0.1

REDIS_PORT

Redis application port

6379

REDIS_DB

Redis database number

0

ENABLE_OTEL_MONITORING

Enable OpenTelemetry

false

OTEL_RESOURCE_ATTRIBUTES

OpenTelemetry resource attributes

Empty

OTEL_TRACES_EXPORTER

OpenTelemetry traces exporter

otlp

OTEL_EXPORTER_OTLP_ENDPOINT

OpenTelemetry OTLP endpoint

http://otel-collector:4318

OTEL_METRIC_EXPORT_INTERVAL

OpenTelemetry metric export interval in milliseconds

60000

Additional Helm Chart Values#

The chart injects KAFKA_BOOTSTRAP_SERVERS, REDIS_HOST, REDIS_PORT, REDIS_DB, MODEL_PATH, NGC_API_KEY, and VIA_VLM_API_KEY from chart values or Kubernetes secrets. When useSharedNim=true, it also injects VIA_VLM_ENDPOINT and VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME.

Top-level Helm values

  • enabled: Enable the chart. Default: false; standalone override: true.

  • image.repository, image.tag, and image.pullPolicy: Image settings. Defaults: nvcr.io/nvidia/vss-core/vss-rt-vlm, 3.2.0, and IfNotPresent.

  • replicas: Number of replicas. Default: 1.

  • useSharedNim: Use a shared OpenAI-compatible NIM instead of loading a model in this pod. Default: false.

  • modelPath: Model path when useSharedNim=false. Standalone override: ngc:nim/nvidia/cosmos-reason2-8b:0303-fp8-dynamic-kv8.

  • sharedNimService, sharedNimPort, vlmBaseUrl, vlmName, viaVlmOpenAiModelDeploymentName, and nims.enabled: Shared or remote NIM routing values.

  • ngcApiSecret.name, ngcApiSecret.key, hfTokenSecret.name, and hfTokenSecret.key: Secret references for NGC and Hugging Face tokens. The standalone override uses hfTokenSecret.name=hf-token-secret.

  • global.imagePullSecrets, global.ngcApiSecret.name, global.ngcApiSecret.key, and global.useReleaseNamePrefix: Umbrella chart global values.

  • kafkaBootstrapServers, waitForKafka.enabled, waitForKafka.image.repository, waitForKafka.image.tag, waitForKafka.imagePullPolicy, waitForKafka.timeoutSeconds, and waitForKafka.topics: Kafka connection and startup ordering values.

  • redisHost, redisPort, redisDb, and redisPassword: Redis connection values.

  • service.port and service.type: Kubernetes service settings. Defaults: 8000 and ClusterIP.

  • securityContext.runAsUser, securityContext.runAsGroup, shmSize, resources.requests.nvidia.com/gpu, and resources.limits.nvidia.com/gpu: Pod security, shared memory, and GPU settings.

  • env: List of RT-VLM environment variables.

  • nodeSelector and tolerations: Scheduling controls.

Frame Selection Modes#

RTVI VLM supports two frame selection modes for sampling frames from video chunks:

FPS-based Selection:

  • Enable --use-fps-for-chunking flag

  • Set --num-frames-per-second-or-fixed-frames-chunk to the desired frames per second (e.g., 0.05 for 0.05 FPS)

  • The system will sample frames at the specified rate based on chunk duration

Fixed Frame Selection (default):

  • Do not set --use-fps-for-chunking flag (disabled by default)

  • Set --num-frames-per-second-or-fixed-frames-chunk to the desired number of frames per chunk (e.g., 8 for 8 frames)

  • The system will sample a fixed number of equally-spaced frames from each chunk

Enabling Incidents#

To enable incident detection, set appropriate --prompt or --system-prompt with clear Yes or No expectation. Incidents will be pushed to the incident Kafka topic.

Example prompt:

--prompt "You are a warehouse monitoring system focused on safety and
efficiency. Analyze the situation to detect any anomalies such as workers
not wearing safety gear, leaving items unattended, or wasting time.
Respond in the following structured format:
Anomaly Detected: Yes/No
Reason: [Brief explanation]"
--system-prompt "Answer the user's question correctly in yes or no"

Troubleshooting#

Common Issues#

Container fails to start

  • Check docker logs <container_name> for error messages

  • Verify GPU access: Ensure NVIDIA Container Toolkit is installed and nvidia-smi works

Out of Memory error

For example, larger models such as Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 may need these settings on NVIDIA RTX 4500, L40S, or Thor systems.

  • Set NVIDIA_VISIBLE_DEVICES=<gpuid> to a free GPU

  • Reduce batch size: Try VLM_BATCH_SIZE=32 Default is auto-calculated

  • Tune GPU memory utilization: Lower VLLM_GPU_MEMORY_UTILIZATION when model startup OOMs. Increase it only when KV-cache capacity is too low and free GPU memory remains.

  • Reduce max sequences: Lower VLLM_MAX_NUM_SEQS

  • Reduce concurrent processes: Lower NUM_VLM_PROCS

  • Reduce max model length on lower-memory GPUs: Reduce VLM_MAX_MODEL_LEN

  • Cap generation length: Lower VLM_MAX_GENERATION_TOKENS (default 16384) to bound peak per-request memory.

  • On Jetson Thor or DGX systems, ensure that the cache cleaner script is running.

Thor workaround for huge-model OOM

When running huge models such as Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on AGX Thor or IGX Thor, allocate host swap before starting the RTVI VLM service:

sudo fallocate -l 96G /swapfile-nemotron
sudo chmod 600 /swapfile-nemotron
sudo mkswap /swapfile-nemotron
sudo swapon --priority 10 /swapfile-nemotron

Use conservative vLLM memory settings in Docker Compose .env or Helm env values:

VLM_MAX_MODEL_LEN=2560
VLLM_GPU_MEMORY_UTILIZATION=0.55

Also ensure that the cache cleaner script is running on Thor.

System memory grows over time

For long-running workloads with many distinct videos or streams, keep the multimodal preprocessor cache disabled.

Docker Compose .env:

VLLM_DISABLE_MM_PREPROCESSOR_CACHE=true

Helm chart env values:

env:
  - name: VLLM_DISABLE_MM_PREPROCESSOR_CACHE
    value: "true"

Restart the service after changing these values. Re-enable the multimodal preprocessor cache only after validating memory behavior for your workload.

Port conflicts

  • Change BACKEND_PORT if port 8000 is already in use

Kafka Connection Issues

  • Use kafka:9092 as bootstrap server when connecting from within Docker network

  • Verify Kafka is running: docker ps | grep kafka

Health Check Failures

  • Check logs with docker compose logs rtvi-vlm

Known Issues#

High concurrency with 8K vision tokens on Jetson Thor / DGX Spark

When running more than 2 concurrent live streams with a high vision-token budget (~8K vision tokens, i.e., 448×448 at 80 frames with Cosmos Reason2), the device may reboot or the microservice may crash due to memory pressure on resource-constrained edge platforms.

  • Recommendation: Limit concurrent live streams to 2 or fewer when using 8K vision tokens on Jetson Thor or DGX Spark.

  • Alternatively, reduce the vision-token budget by lowering the input resolution from 448×448 to 372×372 at 30 frames (~2K vision tokens) using VLM_INPUT_WIDTH=372, VLM_INPUT_HEIGHT=372, VLM_DEFAULT_NUM_FRAMES_PER_SECOND_OR_FIXED_FRAMES_CHUNK=30 to support higher concurrency.

Stream deletion latency under high concurrent load

When multiple live streams are being processed concurrently and the VLM inference latency exceeds the chunk duration (e.g., processing takes longer than the configured chunk_duration), deleting a stream via DELETE /v1/streams/delete/{id} may take longer than expected. The delete operation waits for the in-flight VLM inference request to complete before the stream resources are released.

  • This is expected behavior — the server ensures the current inference cycle completes cleanly before tearing down the stream.

  • The delay is proportional to the VLM inference time at the current load. Reducing concurrency or increasing chunk_duration will reduce delete latency.

Duplicate live-stream / camera IDs rejected with HTTP 409

In 3.2.0, adding a stream with a camera_id (CV API) or streamId that is already registered returns HTTP 409 (DuplicateCameraId / DuplicateStreamId). In 3.1.0 the existing registration was silently overwritten. Update client code to either remove the prior stream first or surface the 409 to the caller.

Version Information#

The API version is v1. Check the service version using the health check endpoints or by examining the OpenAPI schema at /docs.

API Reference