Video Reasoning Annotation#

The Video Reasoning Annotation pipeline generates reasoning-style annotations for a collection of videos. Instead of bounding boxes or masks, it produces question-answer pairs with reasoning traces that describe what happens in each video, why it happens, and when. The pipeline targets video question-answering and video reasoning datasets, and it emits its output in the tao-vl-reason-v1.0 format, a schema defined by the Dataset Annotation Format Toolkit (DAFT).

The pipeline drives two model backends. A vision-language model (VLM) handles the steps that read video frames, such as filtering, captioning, and highlight extraction. A language model (LLM) handles the text-only steps, such as description synthesis and question-answer generation. You configure the two backends independently, so you can pair a video-capable model with a separate text model.

Quickstart with a TAO Skill#

The quickest way to run this pipeline is through the data/tao-generate-video-reasoning-annotations skill, which an agent runs for you. The skill collects the video source, the target domain, and the vision-language and language model endpoints, including Gemini, NVIDIA NIM, a self-hosted TAO inference microservice, and vLLM, then runs the full pipeline. For example:

“Run the data/tao-generate-video-reasoning-annotations skill on the videos under /data/videos , using Gemini, and write the question-answer dataset to /data/out .”

Refer to data/tao-generate-video-reasoning-annotations/SKILL.md for the supported endpoints, workflow steps, and specification keys. The rest of this page describes the configuration and the command-line path for when you need finer control.

Data Input for Video Reasoning Annotation#

The pipeline reads videos from a directory, from a list of JSONL files, or from both. You must provide at least one of data.video_root or data.input_jsonl_files; when you provide both, the pipeline merges the two video lists.

When you use input_jsonl_files, each file holds one JSON object per line with a video_path field (the video key is also accepted):

{"video_path": "/absolute/path/to/video.mp4"}

Additional fields are allowed. When you set data.filter_field, the pipeline includes only the entries whose named boolean field is true.

Backend Configuration#

The pipeline reads two backend blocks: vlm for the video steps and llm for the text steps. Both blocks share the same structure as the 2D Grounding pipelines: a backend field that selects gemini or openai, and matching gemini and openai sub-blocks. For the full field reference, refer to VLM Backend Configuration.

The default backend for both blocks is gemini with the gemini-3.1-flash-lite-preview model. The vlm block defaults its media_resolution to MEDIA_RESOLUTION_LOW, which keeps video token usage low; raise it when you need finer visual detail.

Workflow Configuration#

The workflow block controls which steps run, how the pipeline classifies each video, and how it chunks and samples long videos.

Parameter	Datatype	Description
`steps`	list	Pipeline steps to run, in order. Default: `0`, `1a`, `1b`, `1c`, `2`, `3`, and `4`
`mode`	string	Pipeline mode; `auto` lets the VLM classify each video, `anomaly` treats every video as containing an anomaly, and `normal` treats every video as normal activity. Default: `auto`
`max_workers`	int	Maximum number of concurrent workers for video processing. Default: 4
`max_video_length_sec`	int	Maximum video length, in seconds, that the pipeline processes. Default: 300
`chunk_duration_options`	list	Candidate chunk durations, in seconds, that the pipeline chooses from when splitting a video. Default: 5, 10, 15, 20, and 30
`max_chunks`	int	Maximum number of chunks per video. Default: 10
`highlight_before_sec`	float	Seconds to include before the anomaly timestamp in a highlight clip. Default: 3.0
`highlight_after_sec`	float	Seconds to include after the anomaly timestamp in a highlight clip. Default: 3.0
`long_video_threshold_sec`	int	Duration, in seconds, above which the pipeline samples a video as frames instead of passing it whole. Default: 60
`long_video_sample_fps`	float	Frame sampling rate for long videos, in frames per second. Default: 0.5
`long_video_max_frames`	int	Maximum number of frames to sample from a long video. Default: 60
`qa_types`	list	Question-answer task types to generate; refer to the list below

By default, the pipeline generates eight question-answer task types: mcq (multiple choice), bcq (binary choice), open_qa (open-ended), causal_linkage, temporal_localization, temporal_event_desc, scene_description, and event_summary. Trim the qa_types list to generate a subset.

Output Configuration#

The following top-level fields of the video_reasoning_annotation block control the output metadata and let you override the prompt templates.

Parameter	Datatype	Description
`license`	string	License string written to `metadata.license` in the `tao-vl-reason-v1.0` output, such as `CC-BY-4.0`. Default: empty
`description_extra`	string	Extra text appended to the per-task description in the output metadata; useful for naming the dataset or its source. Default: empty
`prompts_module`	string	Optional Python module path that supplies custom prompt templates. Default: empty

How the Pipeline Works#

The pipeline moves a video through a sequence of vision-language and language model steps, as the following figure shows.

../../../_images/pipeline.svg — Video reasoning annotation pipeline, from video filtering through per-task output#

The pipeline runs its steps in order and writes the per-step output under results_dir:

Step 0, filtering and classification: The VLM checks whether each video is suitable for analysis. In auto mode, it also classifies whether the video contains an anomaly or normal activity.
Step 1a, captioning: The VLM generates global and dense captions for the video.
Step 1b, chunking: The pipeline splits the video into chunks and captions each chunk.
Step 1c, highlight: The pipeline extracts and captions the anomaly moment as a highlight clip.
Step 2, description synthesis: The LLM combines the captions into structured descriptions.
Step 3, question-answer generation: The LLM generates the configured question-answer task types, each with a reasoning trace.
Step 4, output parsing: The pipeline parses the question-answer output and writes one tao-vl-reason-v1.0 JSON file per task type under step_4_output.

In auto mode, the pipeline always runs step 0 so that it can classify each video, and it runs step 1c after step 1b so that it can capture any anomaly highlight.

Output Schema#

Step 4 writes one JSON file per task type, each in the tao-vl-reason-v1.0 envelope:

{
  "format": "tao-vl-reason-v1.0",
  "metadata": {
    "type": "annotation",
    "task": "<task>",
    "date": "...",
    "description": "...",
    "license": "..."
  },
  "media_root": "<video_root>",
  "items": [
    {"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}
  ]
}

The metadata block records the task type, the generation date, the description (with any description_extra text appended), and the license string. Each entry in items holds the source video_id, the generated question and answer, and the reasoning trace. Because tao-vl-reason-v1.0 is a DAFT schema, you can validate or convert the output with the nvidia-tao-daft toolkit.

Running Video Reasoning Annotation#

The pipeline runs through the generate task:

export GOOGLE_API_KEY=<your-api-key>
auto_label generate \
    -e /path/to/video_reasoning_annotation.yaml \
    results_dir=/path/to/results

The following example shows a complete specification file:

results_dir: /path/to/results
autolabel_type: "video_reasoning_annotation"

video_reasoning_annotation:
  vlm:
    backend: "gemini"
    gemini:
      api_key: ""                       # Set the GOOGLE_API_KEY environment variable or fill this in
      model: "gemini-3.1-flash-lite-preview"
      media_resolution: "MEDIA_RESOLUTION_LOW"
      temperature: 0.3
      max_output_tokens: 8192
      timeout: 120
  llm:
    backend: "gemini"
    gemini:
      api_key: ""
      model: "gemini-3.1-flash-lite-preview"
      temperature: 0.3
      max_output_tokens: 8192
      timeout: 120
  workflow:
    steps: ["0", "1a", "1b", "1c", "2", "3", "4"]
    mode: "auto"
    max_workers: 4
    max_video_length_sec: 300
    chunk_duration_options: [5, 10, 15, 20, 30]
    max_chunks: 10
    qa_types: ["mcq", "bcq", "open_qa", "causal_linkage", "temporal_localization", "temporal_event_desc", "scene_description", "event_summary"]
  data:
    video_root: /path/to/videos        # Provide video_root, input_jsonl_files, or both
    input_jsonl_files: []
    filter_field: null
  license: ""
  description_extra: ""
  prompts_module: ""

Note

The pipeline calls hosted vision-language and language models. Set the GOOGLE_API_KEY environment variable, or the API key for your OpenAI-compatible endpoint, before you run the generate task.