Co-DETR#

Warning

Co-DETR is an experimental model in TAO 7.0.1. It is not extensively tested and its configuration, behavior, and supported features may change in a future release. Co-DETR is a PyTorch-only model: there is no ONNX export, TensorRT engine generation, or DeepStream deployment path for this model.

Co-DETR (Collaborative DETR) is an experimental object-detection model included in TAO. It augments a DETR-style detector with auxiliary “collaborative” heads (for example, an ATSS-style dense head) that provide additional supervision during training to improve the convergence and accuracy of the DETR query head. At inference time, only the primary DETR query head is used to produce detections.

Co-DETR supports the following tasks:

  • train

  • evaluate

  • inference

Each task is explained in detail in the following sections.

Note

Because Co-DETR is PyTorch-only, the export, gen_trt_engine, and DeepStream deployment tasks documented for other DETR-family models are not supported.

The codetr console command is provided by the TAO PyTorch container. Launch it with the tao_pt runner (or the TAO Launcher) before running any of the commands on this page. Refer to the TAO Toolkit Quick Start for instructions on installing the launcher and pulling the container.

Getting Pretrained Weights#

Co-DETR is trained by fine-tuning from a pretrained backbone. Two model fields control where the starting weights come from:

  • model.pretrained_backbone_path: an optional path to a backbone-only checkpoint that initializes the selected model.backbone.

  • train.pretrained_model_path: an optional path to a full Co-DETR (or compatible DETR-family) checkpoint to continue fine-tuning from.

The ViT spec files use the dedicated vit_large_codetr backbone, which is derived from the same ViT-Large NV-DINOv2/DINOv2 weights used by DINO. Download a supported backbone checkpoint from the TAO pretrained models on NGC and point model.pretrained_backbone_path at the downloaded .pth file. For example:

ngc registry model download-version \
    nvidia/tao/pretrained_dinov2_classification_imagenet:vit_large_patch14_dinov2 \
    --dest /path/to/weights

The downloaded checkpoint can then be referenced from the spec file:

model:
  backbone: vit_large_codetr
  pretrained_backbone_path: /path/to/weights/model.pth

Note

The exact backbone artifact and version depend on the backbone you select. Use a ViT-Large NV-DINOv2/DINOv2 backbone checkpoint for the vit_large_codetr backbone, and the matching Swin, FAN, ResNet, or EfficientViT checkpoint for the other supported backbones. If pretrained_backbone_path is left as null, the backbone is initialized randomly, which typically requires substantially longer training.

Data Input for Co-DETR#

Co-DETR reuses the DINO data pipeline and expects directories of images for training or validation along with annotated JSON files in COCO format.

Note

The category_id from your COCO JSON file should start from 1 because 0 is set as a background class. In addition, dataset.num_classes should be set to max class_id + 1. For instance, even though there are only 80 classes used in COCO, the largest class_id is 90, so dataset.num_classes should be set to 91. When dataset.contiguous_labels is set to True, the category IDs are remapped to a contiguous range and num_classes reflects the number of foreground classes (for example, 80 for COCO).

Creating an Experiment Specification File#

The training experiment specification file for Co-DETR includes model, train, and dataset parameters. Co-DETR reuses DINO’s dataset and train schemas, so those sections follow the same format as the DINO documentation.

The commands on this page refer to two environment variables that you should set to your own paths before running them:

# Path to the experiment specification (YAML) file shown below.
export DEFAULT_SPEC=/path/to/codetr_train.yaml
# Directory where training, evaluation, and inference outputs are written.
export RESULTS_DIR=/path/to/results

Here is a complete, copy-pasteable example specification file for training a Co-DETR model with a vit_large_codetr backbone on a COCO-format dataset. Replace the /path/to/... values with the locations of your data, backbone weights, and output directory.

results_dir: /path/to/results
model:
  backbone: vit_large_codetr
  pretrained_backbone_path: /path/to/weights/model.pth
  num_queries: 1500
  num_feature_levels: 5
  return_interm_indices: [0, 1, 2, 3, 4]
  two_stage_type: standard
  num_co_heads: 2
  co_head_num_convs: 1
  hidden_dim: 256
  nheads: 8
  enc_layers: 6
  dec_layers: 6
  dim_feedforward: 2048
  num_select: 1000
  soft_nms_enabled: true
  soft_nms_method: linear
  soft_nms_iou_threshold: 0.8
dataset:
  dataset_type: default
  num_classes: 80
  contiguous_labels: true
  batch_size: 2
  augmentation:
    fixed_padding: true
    pad_size_divisor: 32
    fixed_random_crop: 1536
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    test_random_resize: 1280
    random_resize_max_size: 2048
  train_data_sources:
    - image_dir: /path/to/coco/images/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.json
  val_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      json_file: /path/to/coco/annotations/instances_val2017.json
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    classmap: /path/to/coco/annotations/coco_classmap.txt
train:
  num_gpus: 1
  num_epochs: 12
  checkpoint_interval: 1
  validation_interval: 1
  precision: fp32
  activation_checkpoint: true
  clip_grad_norm: 0.1
  optim:
    optimizer: AdamW
    lr: 2.0e-4
    lr_backbone: 2.0e-5
    lr_linear_proj_mult: 0.1
    weight_decay: 1.0e-4
    lr_scheduler: MultiStep
    lr_steps: [11]
    lr_decay: 0.1
    layer_decay_rate: 0.65
evaluate:
  checkpoint: /path/to/results/train/codetr_model.pth
  conf_threshold: 0.0
inference:
  checkpoint: /path/to/results/train/codetr_model.pth
  conf_threshold: 0.5
  input_width: 640
  input_height: 640

Note

The dataset block above sets num_classes: 80 together with contiguous_labels: true, which remaps the COCO category IDs to the 80 contiguous foreground classes. If you instead leave contiguous_labels at its default (False), set num_classes to max class_id + 1 (91 for COCO). See the data-input note for details.

The following sections describe each parameter group in detail.

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture

dataset

dict config

The configuration of the dataset

train

dict config

The configuration of the training task

evaluate

dict config

The configuration of the evaluation task

inference

dict config

The configuration of the inference task

encryption_key

string

None

The encryption key to encrypt and decrypt model files

results_dir

string

/results

The directory where experiment results are saved

model#

The model parameter provides options to change the Co-DETR architecture.

model:
  backbone: vit_large_codetr
  num_queries: 1500
  num_feature_levels: 5
  return_interm_indices: [0, 1, 2, 3, 4]
  two_stage_type: standard
  num_co_heads: 2
  co_head_num_convs: 1
  hidden_dim: 256
  nheads: 8
  enc_layers: 6
  dec_layers: 6
  dim_feedforward: 2048
  num_select: 1000
  soft_nms_enabled: True
  soft_nms_method: linear
  soft_nms_iou_threshold: 0.8

Parameter

Datatype

Default

Description

Supported Values

pretrained_backbone_path

string

None

The optional path to the pretrained backbone file

string to the path

backbone









string









swin_large_patch4_window7_224









The backbone name of the model. Swin, FAN, ResNet 34/50, EfficientViT, and ViT (NV-DINOv2/DINOv2) backbones
are supported. Co-DETR also provides a dedicated vit_large_codetr backbone used by the ViT spec files.








swin_large_224_22k,
swin_large_384_22k,
swin_large_patch4_window7_224,
swin_large_patch4_window12_384,
swin_base_224_22k, swin_base_384_22k,
swin_tiny_224_1k, fan_tiny, fan_small,
fan_base, fan_large, resnet_34,
resnet_50, efficientvit_b0/b1/b2/b3,
vit_large_nvdinov2, vit_large_dinov2,
vit_large_codetr

train_backbone

bool

True

A flag specifying whether to train the backbone or not

True, False

num_feature_levels

unsigned int

4

The number of feature levels to use in the model

1, 2, 3, 4, 5

return_interm_indices

int list

[1, 2, 3, 4]

The index of feature levels to use in the model. The length must match num_feature_levels.

[0, 1, 2, 3, 4], [1, 2, 3, 4], [1, 2,
3], [1, 2], [1]

hidden_dim

unsigned int

256

The dimension of the transformer hidden units

>0

nheads

unsigned int

8

The number of attention heads

>0

dec_layers

unsigned int

6

The number of decoder layers in the transformer

>0

enc_layers

unsigned int

6

The number of encoder layers in the transformer

>0

num_queries

unsigned int

900

The number of object queries (detection slots)

1 ~ 2000

dim_feedforward

unsigned int

2048

The dimension of the feedforward network

>0

dec_n_points

unsigned int

4

The number of deformable reference points in the decoder

>0

enc_n_points

unsigned int

4

The number of deformable reference points in the encoder

>0

num_select

unsigned int

300

The number of top-K predictions selected during post-processing

>0

pre_norm

bool

False

A flag specifying whether to add a LayerNorm before the encoder

True, False

two_stage_type

string

standard

The two-stage detection type

standard, no

decoder_sa_type

string

sa

The type of decoder self-attention

sa, ca_label, ca_content

embed_init_tgt

bool

True

A flag specifying whether to add a target embedding

True, False

fix_refpoints_hw

signed int

-1

If this value is -1, width and height are learned separately for each box. If this value is -2, a shared
width and height are learned. A value greater than 0 specifies learning with a fixed number.
>0, -1, -2

pe_temperatureH

unsigned int

20

The temperature applied to the height dimension of the positional sine embedding

>0

pe_temperatureW

unsigned int

20

The temperature applied to the width dimension of the positional sine embedding

>0

dropout_ratio

float

0.0

The probability to drop hidden units

0.0 ~ 1.0

dilation

bool

False

A flag to enable dilation in the backbone (ResNet only)

True, False

use_dn

bool

True

A flag specifying whether to enable contrastive de-noising training

True, False

dn_number

unsigned int

100

The number of de-noising queries

>0

dn_box_noise_scale

float

1.0

The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied.

>=0

dn_label_noise_ratio

float

0.5

The scale of noise applied to labels during contrastive de-noising. If this value is 0, noise is not applied.

>=0

cls_loss_coef

float

2.0

The relative weight of the classification error in the matching cost

>=0.0

bbox_loss_coef

float

5.0

The relative weight of the L1 error of the bounding box coordinates in the matching cost

>=0.0

giou_loss_coef

float

2.0

The relative weight of the GIoU loss of the bounding box in the matching cost

>=0.0

focal_alpha

float

0.25

The alpha value in the focal loss

>0.0

aux_loss

bool

True

A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer)

True, False

interm_loss_coef

float

1.0

The coefficient for the intermediate (encoder) outputs loss

>=0.0

no_interm_box_loss

bool

False

A flag to disable the intermediate bounding box loss

True, False

loss_types

string list

[‘labels’, ‘boxes’]

The loss types to apply

labels, boxes

The following parameters are specific to Co-DETR’s collaborative auxiliary heads and to its post-processing.

Parameter

Datatype

Default

Description

Supported Values

num_co_heads

unsigned int

1

The number of collaborative auxiliary (ATSS) heads used to provide extra supervision during training

1 ~ 3

co_head_loss_weight

float

1.0

The loss weight applied to the collaborative auxiliary head losses

>=0.0

co_head_num_convs

unsigned int

4

The number of convolution layers in the collaborative ATSS head towers

>0

soft_nms_enabled

bool

False

A flag to apply per-class soft-NMS after top-K selection during post-processing

True, False

soft_nms_method

string

linear

The soft-NMS decay method

linear, gaussian

soft_nms_iou_threshold

float

0.8

The IoU threshold for linear soft-NMS; boxes with an IoU <= threshold are not suppressed

0.0 ~ 1.0

soft_nms_sigma

float

0.5

The Gaussian sigma for soft-NMS score decay (gaussian method only)

>=0.01

train#

The train parameter defines the hyperparameters of the training process. Co-DETR reuses DINO’s training schema, including the optim subsection. Refer to the DINO train and DINO optim documentation for the full list of supported fields.

train:
  num_gpus: 1
  num_nodes: 1
  num_epochs: 12
  checkpoint_interval: 1
  validation_interval: 1
  precision: fp32
  distributed_strategy: ddp
  activation_checkpoint: True
  clip_grad_norm: 0.1
  pretrained_model_path: null
  resume_training_checkpoint_path: null
  freeze: []
  optim:
    optimizer: AdamW
    lr: 2.0e-4
    lr_backbone: 2.0e-5
    lr_linear_proj_mult: 0.1
    weight_decay: 1.0e-4
    lr_scheduler: MultiStep
    lr_steps: [11]
    lr_decay: 0.1
    layer_decay_rate: 0.65

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPUs to use for distributed training

num_nodes

unsigned int

1

The number of nodes. If the value is larger than 1, multi-node is enabled

>0

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

12

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

/results/train

The directory to save training results

optim

dict config

The config for the optimizer, including the learning rate, learning scheduler, and weight decay

clip_grad_norm

float

0.1

The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping

>=0

precision

string

fp32

Specifying “fp16” enables mixed-precision training, which can help save GPU memory

fp32, fp16

distributed_strategy

string

ddp

The multi-GPU training strategy. DDP (Distributed Data Parallel) and FSDP
(Fully Sharded Data Parallel) are supported.
ddp, fsdp

activation_checkpoint

bool

True

A True value instructs train to recompute activations in the backward pass to save GPU
memory, rather than storing them. (See note below for an automatic override.)
True, False

pretrained_model_path

string

The path to a pretrained model checkpoint to load for fine-tuning

freeze

string list

[]

The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”]

Note

When activation_checkpoint is True but the model uses fewer than four feature levels (a smaller model) and more than one GPU is used, activation checkpointing is automatically disabled at runtime.

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation. Co-DETR reuses DINO’s dataset schema; refer to the DINO dataset documentation for the complete list of supported fields, including the augmentation subsection.

dataset:
  dataset_type: default
  num_classes: 80
  contiguous_labels: true
  batch_size: 2
  train_data_sources:
    - image_dir: /path/to/coco/images/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.json
  val_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      json_file: /path/to/coco/annotations/instances_val2017.json
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    classmap: /path/to/coco/annotations/coco_classmap.txt
  augmentation:
    fixed_padding: true
    pad_size_divisor: 32
    fixed_random_crop: 1536
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    test_random_resize: 1280
    random_resize_max_size: 2048

Parameter

Datatype

Default

Description

Supported Values

train_data_sources

list dict



The training data sources. Each entry contains an image_dir and a json_file in
COCO format.


val_data_sources

list dict



The validation data sources. Each entry contains an image_dir and a json_file in
COCO format.


test_data_sources
dict

The test data sources for evaluation, containing an image_dir and a json_file.

infer_data_sources
dict

The inference data sources, containing an image_dir and a classmap .txt file.

num_classes

unsigned int

91

The number of classes in the training data

>0

contiguous_labels

bool

False

A flag to remap category IDs to a contiguous range before training

True, False

batch_size

unsigned int

4

The batch size for training and validation

>0

workers

unsigned int

8

The number of parallel workers processing data

>0

dataset_type

string

serialized

If set to default, the standard CocoDetection dataset structure is used. If set to
serialized, the data is serialized to reduce CPU memory usage.
serialized, default

augmentation

dict config

The parameters that define the augmentation method

The classmap file#

Inference (infer_data_sources.classmap) requires a plain-text .txt file that lists the class names, one name per line, in category_id order. The first line corresponds to category_id 1 (the first foreground class), the second line to category_id 2, and so on. The file maps the model’s numeric predictions to human-readable names, which are then used as the keys of inference.color_map and as the values referenced by inference.category_mapping.

For a COCO-trained model, the classmap lists the 80 foreground class names:

person
bicycle
car
motorcycle
airplane
bus
train
truck
...
toothbrush

The number of lines in the classmap must match the number of foreground classes the model predicts. When dataset.contiguous_labels is True, this is the number of foreground classes (for example, 80 for COCO); otherwise it corresponds to the non-background category_id values in your COCO annotations.

Training the Model#

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

Use the following command to run Co-DETR training:

codetr train -e <experiment_spec_file>
             results_dir=<results_dir>
             [model.<model_option>=<model_option_value>]

Required Arguments#

  • -e, --experiment_spec_file: The path to the experiment specification file.

Optional Arguments#

You can override any value in the experiment specification file using Hydra-style overrides of the form <field>=<value>. For example, results_dir=<results_dir> sets the output directory, overriding the results_dir value in the spec file.

Note

The output directory can be supplied either as the results_dir field in the spec file or as a results_dir=<path> Hydra override on the command line. The same applies to the evaluate and inference commands. The short -r / --results_dir flag used by some other TAO models is not used for Co-DETR; always use the Hydra-style override.

Here is an example of using the Co-DETR training command:

codetr train -e $DEFAULT_SPEC results_dir=$RESULTS_DIR

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

The path to the PyTorch model to evaluate

results_dir

string

/results/evaluate

The directory to save evaluation results

num_gpus

unsigned int

1

The number of GPUs to use for distributed evaluation

>0

gpu_ids

List[int]

[0]

The indices of the GPUs to use for distributed evaluation

conf_threshold

float

0.0

The confidence threshold to filter predictions

>=0

input_width

unsigned int

None

The width of the input image tensor

>0

input_height

unsigned int

None

The height of the input image tensor

>0

Use the following command to run Co-DETR evaluation:

codetr evaluate -e <experiment_spec_file>
                evaluate.checkpoint=<model_to_evaluate>

Required Arguments#

  • -e, --experiment_spec_file: The path to the experiment specification file.

  • evaluate.checkpoint: The .pth model to be evaluated.

Running Inference with a Co-DETR Model#

inference#

The inference parameter defines the hyperparameters of the inference process. The inference tool for Co-DETR models can be used to visualize bounding boxes and generate frame-by-frame labels in KITTI format on a directory of images.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  input_width: 640
  input_height: 640
  save_annotated_images: True
  color_map:
    person: red
    car: blue
  category_mapping:
    bicycle:   ["bicycle", "motorcycle"]
    car:       ["car", "bus", "train", "truck"]
    person:    ["person"]

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

The path to the PyTorch model to use for inference

results_dir

string

/results/inference

The directory to save inference results

num_gpus

unsigned int

1

The number of GPUs to use for distributed inference

>0

gpu_ids

List[int]

[0]

The indices of the GPUs to use for distributed inference

conf_threshold

float

0.5

The confidence threshold to filter predictions

>=0

input_width

unsigned int

640

The width of the input image tensor

>=32

input_height

unsigned int

640

The height of the input image tensor

>=32

outline_width

unsigned int

3

The width in pixels of the bounding box outline

>=1

save_annotated_images

bool

True

If True, write annotated JPEGs alongside the KITTI label files. Set to False to write only label
files (faster, with no image decode/encode).
True, False

color_map

dict

The color map of the bounding boxes for each class

string dict

category_mapping


dict





An optional grouping of classmap categories into output categories, applied after the model
forward pass. Detections whose original class is not in any group are dropped. When soft-NMS is
enabled, an additional per-output-category soft-NMS pass suppresses duplicates within a group.
string-to-list dict


Note

Set input_width and input_height only when you want to run inference (or evaluation) at a fixed resolution that differs from the dataset augmentation resize settings. When input_width/input_height are left at their evaluation defaults (None), the image size is governed by the augmentation resize parameters instead.

Use the following command to run Co-DETR inference:

codetr inference -e <experiment_spec_file>
                 inference.checkpoint=<model_to_infer>

Required Arguments#

  • -e, --experiment_spec_file: The path to the experiment specification file.

  • inference.checkpoint: The .pth model to use for inference.