Search Workflow#

The Search Workflow enables natural language queries across video archives to locate specific events, objects, or actions.

Use Cases

Event retrieval from large video archives
Cross-video search for specific objects or actions
Forensic analysis of recorded footage

Estimated Deployment Time: 15-20 minutes

The following diagram illustrates the search workflow architecture:

Key Features of the Vision Agent with Search:

Upload videos to the agent for search.
Semantic search of videos for key actions, events, and object attributes using embedding-based video indexing.
Natural language query support (e.g., “find all instances of forklifts”).
Automatic query decomposition for chat requests. The search agent extracts refined query text, source names, time windows, visual attributes, action intent, image-search context, and result-count hints.
Four search routes: embed search for actions/events, attribute search for visual descriptors, fusion search for queries that combine actions and visual descriptors, and Search by Image for finding visually similar objects from selected bounding boxes on paused video frames.
Filter and retrieve timestamped results using similarity scores, time range, video/stream names, and source types.

What’s being deployed#

VSS Agent: Agent service that uses a configured LLM endpoint to route requests and orchestrate tool calls to VSS microservices and model endpoints (LLM/VLM NIMs) to answer questions and generate outputs
VSS Agent UI: Web UI with chat, video upload, and different views
VSS Video IO & Storage (VIOS): Video ingestion, recording, and playback services used by the agent for video access and management
Nemotron LLM (NIM): LLM inference service used for reasoning, tool selection, and response generation
Cosmos Reason2 8B VLM (NIM): VLM inference service used by the critic agent to verify search results (enabled by default)
Phoenix: Observability and telemetry service for agent workflow monitoring
ELK: Elasticsearch, Logstash and Kibana stack to index and search embeddings of video clips
Kafka: A real-time message bus to publish embeddings, to be consumed and indexed by ELK for search
RTVI-Embed: Real Time Video Intelligence Embed Microservice to generate action/event embeddings for videos and text, based on Cosmos-Embed1-448p-Anomaly-Detection
RTVI-CV: Real Time Video Intelligence Computer Vision Microservice to generate object attribute embeddings for videos
Behavior Analytics: Behavior Analytics microservice to perform sequential frame analysis for object detection and tracking in videos/streams.

Search Workflow Data Flow#

The search workflow has two related paths: ingestion and query execution.

During ingestion, uploaded videos and RTSP streams are registered with VSS Video IO & Storage (VIOS). The agent then sends the media to RTVI-Embed to generate video/action embeddings and to RTVI-CV/Behavior Analytics to generate object-level behavior embeddings. Elasticsearch stores these embeddings in separate indices:

mdx-embed-filtered-* for video/action embeddings used by embed search.
mdx-behavior-* for object behavior embeddings used by attribute search and Search by Image.
mdx-raw-* for raw frame/object data used when frame lookup is enabled for attribute search.

During query execution, the UI can call either the direct search API or the Vision Agent chat interface:

The Search text box calls /api/v1/search and is optimized for direct embedding search with explicit filters.
The Vision Agent chat interface calls the streaming search_agent through /chat/stream. This mode uses an LLM to decompose the user’s natural language query and automatically select the best search route.

The Vision Agent chat path can run:

Embed search: Generates a text embedding with RTVI-Embed/Cosmos-Embed1-448p-Anomaly-Detection and searches video/action embeddings in Elasticsearch.
Attribute search: Generates text embeddings with RTVI-CV and searches behavior embeddings for visual descriptors such as clothing, color, or carried objects.
Fusion search: Runs embed search first, then runs attribute search over each candidate window and reranks with a fusion score.
Search by Image: When the user starts from a selected bounding box on a paused video frame, the workflow uses the selected object’s embedding from the behavior index to search for visually similar objects.

Prerequisites#

Before you begin, ensure all of the prerequisites are met. See Prerequisites for more details.

Service Endpoints#

Once deployed, the following services are available:

Service Endpoints#
Service	URL
VSS UI	`http://<HOST_IP>:7777`
Kibana UI	`http://<HOST_IP>:7777/kibana/app/home#/`
NVStreamer UI	`http://<HOST_IP>:31000/#/dashboard`
VST UI	`http://<HOST_IP>:30888/vst/#/dashboard`
Phoenix UI	`http://<HOST_IP>:7777/phoenix/projects`

Using Skills#

As an alternative to running the workflow manually, you can use VSS Agent Skills from a coding agent such as Claude Code, Codex, or NemoClaw.

First install the skills as described in Installing Skills and make it accessible to your coding agent.

Step 1: Deploy the Search Agent#

Note

The host must meet the same deployment requirements listed above (supported GPU/hardware for that profile) and must meet the Prerequisites.

Once the agent is loaded with the VSS skills, you can use it to deploy the VSS search agent:

Deploy prompt structure#

Deploy the VSS search profile.

Note

The skill will choose the <platform> to match your system, as detailed in Development Profile GPU Requirements.

What the agent does:

Reads the skill vss-deploy-profile and its search profile reference to determine the deployment recipe, GPU layout, sizing, and service list.
Runs pre-flight checks — auto-detects the repo path, validates GPU/Docker/NVIDIA runtime, and probes NGC_CLI_API_KEY against NGC auth.
Prepares the environment — copies the source .env to generated.env, writes overrides (HOST_IP, EXTERNAL_IP, NGC_CLI_API_KEY, VSS_DATA_DIR, VSS_APPS_DIR), creates the data directory tree with correct permissions.
Generates and normalizes the resolved compose — runs docker compose config, then normalize_resolved_yml.py to strip dangling dependencies, and validates the output.
Deploys with docker compose up -d and monitors until all services report healthy.

Step 2: Video Management#

Note

For instructions on downloading sample data, see Download Sample Data From NGC in the Quickstart guide.

The downloaded videos (following the Quickstart guide) are available at ./sample-data/dev-profile-sample-data/. For this example, you will ask the agent to upload the video warehouse_sample.mp4 by providing the path to the file:

Add the video warehouse_sample.mp4 to search agent

What the agent does:

Reads the skill vss-search-archive to determine the agent-backend ingestion flow (not bare VIOS).
Gets the upload URL from the agent (POST /api/v1/videos) and POSTs the video file to that URL (chunked-upload).
Calls /api/v1/videos/{sensorId}/complete which fans out to RTVI-CV + RTVI-Embed, generating searchable embeddings in Elasticsearch.
Verifies the sensor is registered and online in VIOS.

It may take up to a few minutes depending on the size of the video(s).

Step 3: Search with a simple query#

Your agent can search videos given natural-language queries, without requiring a chat UI/interface. Use its search capability to find all instances of forklifts in the uploaded video warehouse_sample.mp4 by using the following prompt:

Simple search prompt#

Find all instances of forklifts in the sample warehouse video.

What the agent does:

Reads the skill to determine the search API.
Checks if the video is already registered in VIOS (sensor/list). If not, ingests it using the agent backend’s chunked-upload handshake.
Fires the natural-language query via POST /generate on the agent backend, which decomposes it into embed search and optionally attribute search.
The agent’s critic (VLM) verifies top results against decomposed criteria and returns confirmed/rejected verdicts.
Presents results as a table with time ranges, similarity scores, critic verdicts, and per-criteria breakdown.

The Critic column shows the critic agent’s verdict for each result — confirmed or rejected — while the Criteria column lists the individual conditions extracted from the query and whether each was satisfied.

Step 4: Search with additional filters#

To search with additional filters:

I have a video sample-warehouse-ladder.mp4. I need to find instances of a person climbing a ladder with source type video_file, video source sample-warehouse-ladder.mp4, and top 5 results.

Follow-up questions related to the search results can also be asked in the same coding-agent conversation. For example:

What are the durations of the top 5 results?

Step 5: Delete Videos#

To remove an uploaded video:

Delete a particular video#

Delete the video sample-warehouse-ladder.mp4.

Step 6: Teardown the agent#

To teardown the search agent, issue the following prompt:

Teardown the VSS search profile.

This will stop and remove the agent containers.

Changing Embedding Models#

Real-Time Embedding#

The Real-Time Embedding microservice (RT-Embed) supports the Cosmos-Embed1 model, a joint video-text embedder. Cosmos-Embed1-448p-anomaly-detection is the default variant deployed with the VSS search profile.

To use a different model, for example, Cosmos-Embed1-448p set the environment RTVI_EMBED_MODEL and MODEL_PATH variables as follows:

RTVI_EMBED_MODEL=cosmos-embed1-448p
MODEL_PATH=git:https://huggingface.co/nvidia/Cosmos-Embed1-448p

RTVI_EMBED_MODEL — Identifier for the embedding model used by RT-Embed
MODEL_PATH — HuggingFace repository URL from which RT-Embed downloads the model at startup

Real-Time Object Embedding#

The Real-Time Video Intelligence CV microservice (RTVI-CV) supports CLIP-style image (vision encoder) and text embedder models. SigLIP2 is the default model deployed with the VSS search profile.

To use a different model, for example, RADIO-CLIP set the environment VISION_ENCODER_MODEL and VISION_ENCODER_VERSION variables as follows:

VISION_ENCODER_MODEL=radio-clip
VISION_ENCODER_VERSION=v1.0

VISION_ENCODER_MODEL — Identifier for the vision encoder model used by RTVI-CV
VISION_ENCODER_VERSION — The release version on NGC catalog

Output Embedding Dimension

The object embeddings are stored in elasticsearch as vector embeddings. The dimension must be specified at the time of deployment. The output embedding model for SigLIP2 is 1152, while it’s 1536 for radio-clip; can be changed using the environment variable ELASTICSEARCH_RTVI_CV_EMBEDDINGS_DIM:

ELASTICSEARCH_RTVI_CV_EMBEDDINGS_DIM=1536

ELASTICSEARCH_RTVI_CV_EMBEDDINGS_DIM — Used by RTVI-CV for object embeddings (default: SigLIP v2, 1152). If you change the RTVI-CV model, set this to the new model’s embedding dimension.

Known Issues#

A race condition between RTVI-embed and LLM NIM during deployment can result in an unhealthy state for the RTVI-embed container. To resolve this:
1. Stop the LLM NIM.
2. Wait for RTVI-embed to become healthy.
3. Restart LLM NIM.
Queries with negative intent (e.g., “people without a yellow hat”) may return the same results as positive intent queries (e.g., “people with a yellow hat”).
Sometimes, the agent may also return false positive results (i.e., results that are not relevant to the query).
Queries with a single word (e.g., “person”) may return no results.
The duration of video clips in search results may be longer than the displayed duration.
‘Description’ is empty in the response generated by the Vision Agent chat interface.
Uploading a video through the Agent chat sidebar in the Search profile returns an Internal Server Error. During upload through chat sidebar, query is sent to the agent, and the top agent for search currently does not have the necessary support to handle this.
By default, the timestamps for uploaded videos start from 2025-01-01 00:00:00.
Renaming an uploaded RTSP stream after it has been added is not supported. We would need to delete the existing entry from the Video Management tab and re-add the RTSP stream with the new name.
Deleting an RTSP stream that has ended, may subsequently fail new stream addition or a new video upload.
When a source video is removed from NVStreamer, the corresponding RTSP stream continues to appear as a live input in the VSS UI Video Management tab.
When the critic agent is enabled and VLM service is not available, the search results may not appear in the main window.
For RTSP streams with H265 encoding that have been removed, thumbnail may not be visible in the VSS UI. See Image capture failure for more details.
An ‘Index not found’ error may occur, when there are no videos corresponding to the source type selected.
When uploading a video to VIOS, if the video is larger than the maximum upload size, the upload may fail. See Why do large file uploads to VIOS fail? for more details.

Search Workflow#

What’s being deployed#

Search Workflow Data Flow#

Prerequisites#

Manual Workflow#

Step 1: Deploy the Search Agent#

Step 2: Video Management#

Upload a Video#

Add an RTSP Stream#

Create an RTSP Stream from a Video Using NVStreamer#

Step 3: Search with a simple query#

Step 4: Search with additional filters#

Step 5: Vision Agent Chat#

Using the Chat Interface#

Result Returned by the Agent#

Critic Agent Overview#

Step 6: Delete Videos or Streams#

Step 7: Teardown the agent#

Service Endpoints#

Using Skills#

Step 1: Deploy the Search Agent#

Step 2: Video Management#

Step 3: Search with a simple query#

Step 4: Search with additional filters#

Step 5: Delete Videos#

Step 6: Teardown the agent#

Changing Embedding Models#

Real-Time Embedding#

Real-Time Object Embedding#

Known Issues#