2D Grounding#

The 2D Grounding pipelines use a vision-language model (VLM) to annotate images with referring expressions and the bounding boxes that ground them. A referring expression is a short phrase that identifies one object in an image, such as “the red car on the left.” These pipelines target vision-language grounding datasets, where each annotation pairs a natural-language phrase with the image region that it describes.

The Auto-Label service provides two complementary 2D Grounding pipelines:

  • image_grounding starts from an image and a caption, extracts the referring expressions contained in the caption, and grounds each expression to a bounding box.

  • image_referring_expression starts from an image and its bounding boxes, generates a referring expression for each region, and verifies that each expression grounds back to the correct box.

Both pipelines write a single annotations.jsonl file in a shared schema, so you can mix their outputs in the same dataset. Select a pipeline through the autolabel_type field, then configure the matching image_grounding or image_referring_expression block.

Quickstart with a TAO Skill#

The quickest way to run either pipeline is through its TAO skill, which an agent runs for you. The skill collects the input files and the vision-language model endpoint, including Gemini, NVIDIA NIM, a self-hosted TAO inference microservice, and vLLM, then runs the pipeline end to end. Use the data/tao-generate-image-grounding skill for the image grounding pipeline and the data/tao-generate-referring-expressions skill for the image referring expression pipeline. For example:

“Run the data/tao-generate-image-grounding skill on captions.jsonl with images under /data/images , using Gemini, and write the grounded annotations to /data/out .”

Refer to data/tao-generate-image-grounding/SKILL.md and data/tao-generate-referring-expressions/SKILL.md for the supported endpoints, workflow steps, and specification keys. The rest of this page describes the configuration and the command-line path for when you need finer control.

VLM Backend Configuration#

Both pipelines call a vision-language model through the vlm block. You choose the backend with the vlm.backend field and then populate the matching sub-block. The backend field accepts gemini for the Google Gemini API or openai for any OpenAI-compatible endpoint.

Parameter

Datatype

Description

backend

string

VLM backend to use; valid options are gemini and openai. Default: gemini

gemini

collection

Google Gemini API configuration; refer to the Gemini table below

openai

collection

OpenAI-compatible endpoint configuration; refer to the OpenAI table below

The gemini sub-block configures the Google Gemini backend.

Parameter

Datatype

Description

api_key

string

Google Gemini API key; if empty, the pipeline reads the GOOGLE_API_KEY environment variable. Default: empty

model

string

Gemini model name. Default: gemini-3.1-flash-lite-preview

media_resolution

string

Media resolution for image input; set MEDIA_RESOLUTION_HIGH for grounding accuracy. Default: MEDIA_RESOLUTION_LOW

temperature

float

Sampling temperature; lower values produce more deterministic output. Default: 0.3

max_output_tokens

int

Maximum number of tokens in the model response. Default: 8192

timeout

int

Request timeout in seconds. Default: 120

The openai sub-block configures any endpoint that exposes an OpenAI-compatible API.

Parameter

Datatype

Description

api_key

string

API key for the OpenAI-compatible endpoint. Default: empty

base_url

string

Base URL of the OpenAI-compatible endpoint. Default: empty

model_name

string

Model name to request from the endpoint. Default: empty

temperature

float

Sampling temperature; lower values produce more deterministic output. Default: 0.7

max_tokens

int

Maximum number of tokens in the model response. Default: 4096

timeout

int

Request timeout in seconds. Default: 60

Image Grounding#

The image_grounding pipeline turns image-caption pairs into grounded referring expressions. It is the right choice when you already have captions for your images and want bounding boxes for the phrases inside those captions.

Data Input for Image Grounding#

The pipeline reads an input JSONL file with one JSON object per line. Each object must contain an image_path and a caption field. The width, height, and image_id fields are optional; the pipeline fills them in when they are missing.

Configuring the Specification File for Image Grounding#

Set autolabel_type to image_grounding and populate the image_grounding block. The block contains the shared vlm configuration described in VLM Backend Configuration, plus the workflow and data blocks described here.

The workflow block controls pipeline execution.

Parameter

Datatype

Description

steps

list

Pipeline steps to run; 0 extracts expressions and 1 grounds them. Default: 0 and 1

max_workers

int

Maximum number of concurrent workers for per-sample API calls. Default: 4

force_reprocess

bool

Whether to ignore cached step outputs and reprocess from scratch. Default: False

The data block specifies the input file.

Parameter

Datatype

Description

input_jsonl

string

Path to the input JSONL file with image_path and caption fields

image_root

string

Optional prefix used to resolve relative image_path values. Default: empty

How the Image Grounding Pipeline Works#

The image grounding pipeline chains two vision-language model steps, as the following figure shows.

../../../_images/image_grounding_pipeline.svg

Two-step image grounding pipeline#

The pipeline runs the two steps in sequence and enriches one annotation record per image as it goes:

  1. Step 0, expression extraction: The VLM reads each image and its caption and extracts the referring expressions that the caption contains.

  2. Step 1, phrase grounding: The VLM grounds each extracted expression to one or more bounding boxes in the image.

After the final step, the pipeline copies the enriched records to annotations.jsonl in results_dir.

Image Referring Expression#

The image_referring_expression pipeline generates referring expressions for images that already have bounding boxes. It is the right choice when you have detection labels and want natural-language descriptions that ground back to each box.

Data Input for Image Referring Expression#

The pipeline reads images from a directory and their bounding boxes from matching KITTI-format label files. You can also resume a run by supplying a previously generated annotations file through input_annotations_jsonl.

Configuring the Specification File for Image Referring Expression#

Set autolabel_type to image_referring_expression and populate the image_referring_expression block. The block contains the shared vlm configuration described in VLM Backend Configuration, plus the workflow and data blocks described here.

The workflow block controls pipeline execution.

Parameter

Datatype

Description

steps

list

Pipeline steps to run; 0 extracts region expressions, 1 captions the image, 2 grounds the expressions, and 3 double-checks them. Default: 0, 1, 2, and 3

max_workers

int

Maximum number of concurrent workers for per-image API calls within each step. Default: 4

force_reprocess

bool

Whether to ignore cached step outputs and reprocess from scratch. Default: False

output_format

string

Output format to write; jsonl writes only the unified schema, legacy writes only the per-image text files, and both writes both. Default: jsonl

The data block specifies the input images and labels.

Parameter

Datatype

Description

image_dir

string

Directory that contains the input images (.jpg or .png)

kitti_label_dir

string

Directory that contains the KITTI-format bounding box labels

input_annotations_jsonl

string

Optional unified annotations file used to seed the pipeline when resuming or running on precomputed regions. Default: empty

How the Image Referring Expression Pipeline Works#

The image referring expression pipeline runs the region and caption steps in parallel, then merges them before an optional verification pass, as the following figure shows.

../../../_images/image_referring_expression_pipeline.svg

Image referring expression pipeline, with steps 0 and 1 running in parallel#

The pipeline seeds one annotation record per image from the image directory and the KITTI labels, then runs the following steps:

  1. Step 0, region expression, and step 1, image caption: The VLM describes each bounding box region with a candidate expression in step 0, and writes a caption for the whole image in step 1. The pipeline runs these two steps in parallel, because both depend only on the seed records.

  2. Step 2, grounding expression: The pipeline merges the region descriptions and the image caption, then calls the VLM to produce the final grounded expressions.

  3. Step 3, double check: The VLM verifies each expression against its region and updates the expression in place when it does not match.

After the final step, the pipeline writes annotations.jsonl to results_dir in the format selected by workflow.output_format.

Output Schema#

Both pipelines converge on the same unified schema, so downstream tools can consume their outputs interchangeably. The pipeline writes one JSON object per image to annotations.jsonl, with these fields:

  • image_id, image_path, width, and height identify the image.

  • caption and cleaned_caption hold the original and normalized captions.

  • regions lists the bounding boxes, each with a bbox, a type, a color, and a description.

  • expressions lists the referring expressions, each with an expression_id, the expression text, the grounded instances (bounding boxes with scores), and a verified flag.

  • source records which pipeline produced the record, and pipeline_steps records which steps ran.

When output_format is legacy or both, the image referring expression pipeline also writes per-image text files that match the original 2D Data Engine layout, so existing downstream consumers continue to work without changes.

Running 2D Grounding#

Both pipelines run through the generate task. The following example runs the image grounding pipeline:

export GOOGLE_API_KEY=<your-api-key>
auto_label generate \
    -e /path/to/image_grounding.yaml \
    results_dir=/path/to/results

The following example shows a complete specification file for the image grounding pipeline:

results_dir: /path/to/results
autolabel_type: "image_grounding"

image_grounding:
  vlm:
    backend: "gemini"
    gemini:
      api_key: ""                       # Set the GOOGLE_API_KEY environment variable or fill this in
      model: "gemini-3.1-flash-lite-preview"
      media_resolution: "MEDIA_RESOLUTION_HIGH"
      temperature: 0.3
      max_output_tokens: 8192
      timeout: 120
  workflow:
    steps: ["0", "1"]
    max_workers: 4
    force_reprocess: false
  data:
    input_jsonl: /path/to/input.jsonl   # One object per line with image_path and caption
    image_root: ""                      # Optional prefix for relative image_path values

To run the image referring expression pipeline, set autolabel_type to image_referring_expression, replace the image_grounding block with an image_referring_expression block, and point data.image_dir and data.kitti_label_dir at your images and labels.

Note

Both pipelines call a hosted vision-language model. Set the GOOGLE_API_KEY environment variable, or the API key for your OpenAI-compatible endpoint, before you run the generate task.