Co-DETR#
Warning
Co-DETR is an experimental model in TAO 7.0.1. It is not extensively tested and its configuration, behavior, and supported features may change in a future release. Co-DETR is a PyTorch-only model: there is no ONNX export, TensorRT engine generation, or DeepStream deployment path for this model.
Co-DETR (Collaborative DETR) is an experimental object-detection model included in TAO. It augments a DETR-style detector with auxiliary “collaborative” heads (for example, an ATSS-style dense head) that provide additional supervision during training to improve the convergence and accuracy of the DETR query head. At inference time, only the primary DETR query head is used to produce detections.
Co-DETR supports the following tasks:
train
evaluate
inference
Each task is explained in detail in the following sections.
Note
Because Co-DETR is PyTorch-only, the export, gen_trt_engine, and DeepStream deployment
tasks documented for other DETR-family models are not supported.
The codetr console command is provided by the TAO PyTorch container. Launch it with the
tao_pt runner (or the TAO Launcher) before running any of the commands on this page. Refer to
the TAO Toolkit Quick Start for instructions on installing the
launcher and pulling the container.
Getting Pretrained Weights#
Co-DETR is trained by fine-tuning from a pretrained backbone. Two model fields control where the starting weights come from:
model.pretrained_backbone_path: an optional path to a backbone-only checkpoint that initializes the selectedmodel.backbone.train.pretrained_model_path: an optional path to a full Co-DETR (or compatible DETR-family) checkpoint to continue fine-tuning from.
The ViT spec files use the dedicated vit_large_codetr backbone, which is derived from the same
ViT-Large NV-DINOv2/DINOv2 weights used by DINO. Download a supported backbone
checkpoint from the TAO pretrained models on NGC and point
model.pretrained_backbone_path at the downloaded .pth file. For example:
ngc registry model download-version \
nvidia/tao/pretrained_dinov2_classification_imagenet:vit_large_patch14_dinov2 \
--dest /path/to/weights
The downloaded checkpoint can then be referenced from the spec file:
model:
backbone: vit_large_codetr
pretrained_backbone_path: /path/to/weights/model.pth
Note
The exact backbone artifact and version depend on the backbone you select. Use a ViT-Large
NV-DINOv2/DINOv2 backbone checkpoint for the vit_large_codetr backbone, and the matching Swin,
FAN, ResNet, or EfficientViT checkpoint for the other supported backbones. If
pretrained_backbone_path is left as null, the backbone is initialized randomly,
which typically requires substantially longer training.
Data Input for Co-DETR#
Co-DETR reuses the DINO data pipeline and expects directories of images for training or validation along with annotated JSON files in COCO format.
Note
The category_id from your COCO JSON file should start from 1 because 0 is set as a
background class. In addition, dataset.num_classes should be set to
max class_id + 1. For instance, even though there are only 80 classes used in COCO, the
largest class_id is 90, so dataset.num_classes should be set to 91. When
dataset.contiguous_labels is set to True, the category IDs are remapped to a
contiguous range and num_classes reflects the number of foreground classes (for example,
80 for COCO).
Creating an Experiment Specification File#
The training experiment specification file for Co-DETR includes model, train, and
dataset parameters. Co-DETR reuses DINO’s dataset and train schemas, so those
sections follow the same format as the DINO documentation.
The commands on this page refer to two environment variables that you should set to your own paths before running them:
# Path to the experiment specification (YAML) file shown below.
export DEFAULT_SPEC=/path/to/codetr_train.yaml
# Directory where training, evaluation, and inference outputs are written.
export RESULTS_DIR=/path/to/results
Here is a complete, copy-pasteable example specification file for training a Co-DETR model with a
vit_large_codetr backbone on a COCO-format dataset. Replace the /path/to/... values with the
locations of your data, backbone weights, and output directory.
results_dir: /path/to/results
model:
backbone: vit_large_codetr
pretrained_backbone_path: /path/to/weights/model.pth
num_queries: 1500
num_feature_levels: 5
return_interm_indices: [0, 1, 2, 3, 4]
two_stage_type: standard
num_co_heads: 2
co_head_num_convs: 1
hidden_dim: 256
nheads: 8
enc_layers: 6
dec_layers: 6
dim_feedforward: 2048
num_select: 1000
soft_nms_enabled: true
soft_nms_method: linear
soft_nms_iou_threshold: 0.8
dataset:
dataset_type: default
num_classes: 80
contiguous_labels: true
batch_size: 2
augmentation:
fixed_padding: true
pad_size_divisor: 32
fixed_random_crop: 1536
input_mean: [0.485, 0.456, 0.406]
input_std: [0.229, 0.224, 0.225]
test_random_resize: 1280
random_resize_max_size: 2048
train_data_sources:
- image_dir: /path/to/coco/images/train2017/
json_file: /path/to/coco/annotations/instances_train2017.json
val_data_sources:
- image_dir: /path/to/coco/images/val2017/
json_file: /path/to/coco/annotations/instances_val2017.json
test_data_sources:
image_dir: /path/to/coco/images/val2017/
json_file: /path/to/coco/annotations/instances_val2017.json
infer_data_sources:
image_dir: /path/to/coco/images/val2017/
classmap: /path/to/coco/annotations/coco_classmap.txt
train:
num_gpus: 1
num_epochs: 12
checkpoint_interval: 1
validation_interval: 1
precision: fp32
activation_checkpoint: true
clip_grad_norm: 0.1
optim:
optimizer: AdamW
lr: 2.0e-4
lr_backbone: 2.0e-5
lr_linear_proj_mult: 0.1
weight_decay: 1.0e-4
lr_scheduler: MultiStep
lr_steps: [11]
lr_decay: 0.1
layer_decay_rate: 0.65
evaluate:
checkpoint: /path/to/results/train/codetr_model.pth
conf_threshold: 0.0
inference:
checkpoint: /path/to/results/train/codetr_model.pth
conf_threshold: 0.5
input_width: 640
input_height: 640
Note
The dataset block above sets num_classes: 80 together with
contiguous_labels: true, which remaps the COCO category IDs to the 80 contiguous foreground
classes. If you instead leave contiguous_labels at its default (False), set
num_classes to max class_id + 1 (91 for COCO). See the
data-input note for details.
The following sections describe each parameter group in detail.
Parameter |
Data Type |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuration of the model architecture |
|
|
dict config |
– |
The configuration of the dataset |
|
|
dict config |
– |
The configuration of the training task |
|
|
dict config |
– |
The configuration of the evaluation task |
|
|
dict config |
– |
The configuration of the inference task |
|
|
string |
None |
The encryption key to encrypt and decrypt model files |
|
|
string |
/results |
The directory where experiment results are saved |
model#
The model parameter provides options to change the Co-DETR architecture.
model:
backbone: vit_large_codetr
num_queries: 1500
num_feature_levels: 5
return_interm_indices: [0, 1, 2, 3, 4]
two_stage_type: standard
num_co_heads: 2
co_head_num_convs: 1
hidden_dim: 256
nheads: 8
enc_layers: 6
dec_layers: 6
dim_feedforward: 2048
num_select: 1000
soft_nms_enabled: True
soft_nms_method: linear
soft_nms_iou_threshold: 0.8
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
None |
The optional path to the pretrained backbone file |
string to the path |
backbone |
string
|
swin_large_patch4_window7_224
|
The backbone name of the model. Swin, FAN, ResNet 34/50, EfficientViT, and ViT (NV-DINOv2/DINOv2) backbones
are supported. Co-DETR also provides a dedicated
vit_large_codetr backbone used by the ViT spec files. |
swin_large_224_22k,
swin_large_384_22k,
swin_large_patch4_window7_224,
swin_large_patch4_window12_384,
swin_base_224_22k, swin_base_384_22k,
swin_tiny_224_1k, fan_tiny, fan_small,
fan_base, fan_large, resnet_34,
resnet_50, efficientvit_b0/b1/b2/b3,
vit_large_nvdinov2, vit_large_dinov2,
vit_large_codetr
|
|
bool |
True |
A flag specifying whether to train the backbone or not |
True, False |
|
unsigned int |
4 |
The number of feature levels to use in the model |
1, 2, 3, 4, 5 |
return_interm_indices |
int list
|
[1, 2, 3, 4]
|
The index of feature levels to use in the model. The length must match
num_feature_levels. |
[0, 1, 2, 3, 4], [1, 2, 3, 4], [1, 2,
3], [1, 2], [1]
|
|
unsigned int |
256 |
The dimension of the transformer hidden units |
>0 |
|
unsigned int |
8 |
The number of attention heads |
>0 |
|
unsigned int |
6 |
The number of decoder layers in the transformer |
>0 |
|
unsigned int |
6 |
The number of encoder layers in the transformer |
>0 |
|
unsigned int |
900 |
The number of object queries (detection slots) |
1 ~ 2000 |
|
unsigned int |
2048 |
The dimension of the feedforward network |
>0 |
|
unsigned int |
4 |
The number of deformable reference points in the decoder |
>0 |
|
unsigned int |
4 |
The number of deformable reference points in the encoder |
>0 |
|
unsigned int |
300 |
The number of top-K predictions selected during post-processing |
>0 |
|
bool |
False |
A flag specifying whether to add a LayerNorm before the encoder |
True, False |
|
string |
standard |
The two-stage detection type |
standard, no |
|
string |
sa |
The type of decoder self-attention |
sa, ca_label, ca_content |
|
bool |
True |
A flag specifying whether to add a target embedding |
True, False |
fix_refpoints_hw |
signed int
|
-1
|
If this value is -1, width and height are learned separately for each box. If this value is -2, a shared
width and height are learned. A value greater than 0 specifies learning with a fixed number.
|
>0, -1, -2
|
|
unsigned int |
20 |
The temperature applied to the height dimension of the positional sine embedding |
>0 |
|
unsigned int |
20 |
The temperature applied to the width dimension of the positional sine embedding |
>0 |
|
float |
0.0 |
The probability to drop hidden units |
0.0 ~ 1.0 |
|
bool |
False |
A flag to enable dilation in the backbone (ResNet only) |
True, False |
|
bool |
True |
A flag specifying whether to enable contrastive de-noising training |
True, False |
|
unsigned int |
100 |
The number of de-noising queries |
>0 |
|
float |
1.0 |
The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied. |
>=0 |
|
float |
0.5 |
The scale of noise applied to labels during contrastive de-noising. If this value is 0, noise is not applied. |
>=0 |
|
float |
2.0 |
The relative weight of the classification error in the matching cost |
>=0.0 |
|
float |
5.0 |
The relative weight of the L1 error of the bounding box coordinates in the matching cost |
>=0.0 |
|
float |
2.0 |
The relative weight of the GIoU loss of the bounding box in the matching cost |
>=0.0 |
|
float |
0.25 |
The alpha value in the focal loss |
>0.0 |
|
bool |
True |
A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer) |
True, False |
|
float |
1.0 |
The coefficient for the intermediate (encoder) outputs loss |
>=0.0 |
|
bool |
False |
A flag to disable the intermediate bounding box loss |
True, False |
|
string list |
[‘labels’, ‘boxes’] |
The loss types to apply |
labels, boxes |
The following parameters are specific to Co-DETR’s collaborative auxiliary heads and to its post-processing.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of collaborative auxiliary (ATSS) heads used to provide extra supervision during training |
1 ~ 3 |
|
float |
1.0 |
The loss weight applied to the collaborative auxiliary head losses |
>=0.0 |
|
unsigned int |
4 |
The number of convolution layers in the collaborative ATSS head towers |
>0 |
|
bool |
False |
A flag to apply per-class soft-NMS after top-K selection during post-processing |
True, False |
|
string |
linear |
The soft-NMS decay method |
linear, gaussian |
|
float |
0.8 |
The IoU threshold for linear soft-NMS; boxes with an IoU <= threshold are not suppressed |
0.0 ~ 1.0 |
|
float |
0.5 |
The Gaussian sigma for soft-NMS score decay (gaussian method only) |
>=0.01 |
train#
The train parameter defines the hyperparameters of the training process. Co-DETR reuses
DINO’s training schema, including the optim subsection. Refer to the DINO train and DINO optim documentation for the full list of supported fields.
train:
num_gpus: 1
num_nodes: 1
num_epochs: 12
checkpoint_interval: 1
validation_interval: 1
precision: fp32
distributed_strategy: ddp
activation_checkpoint: True
clip_grad_norm: 0.1
pretrained_model_path: null
resume_training_checkpoint_path: null
freeze: []
optim:
optimizer: AdamW
lr: 2.0e-4
lr_backbone: 2.0e-5
lr_linear_proj_mult: 0.1
weight_decay: 1.0e-4
lr_scheduler: MultiStep
lr_steps: [11]
lr_decay: 0.1
layer_decay_rate: 0.65
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
List[int] |
[0] |
The indices of the GPUs to use for distributed training |
|
|
unsigned int |
1 |
The number of nodes. If the value is larger than 1, multi-node is enabled |
>0 |
|
unsigned int |
1234 |
The random seed for random, NumPy, and torch |
>0 |
|
unsigned int |
12 |
The total number of epochs to run the experiment |
>0 |
|
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
||
|
string |
/results/train |
The directory to save training results |
|
|
dict config |
The config for the optimizer, including the learning rate, learning scheduler, and weight decay |
||
|
float |
0.1 |
The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping |
>=0 |
|
string |
fp32 |
Specifying “fp16” enables mixed-precision training, which can help save GPU memory |
fp32, fp16 |
distributed_strategy |
string
|
ddp
|
The multi-GPU training strategy. DDP (Distributed Data Parallel) and FSDP
(Fully Sharded Data Parallel) are supported.
|
ddp, fsdp
|
activation_checkpoint |
bool
|
True
|
A True value instructs train to recompute activations in the backward pass to save GPU
memory, rather than storing them. (See note below for an automatic override.)
|
True, False
|
|
string |
The path to a pretrained model checkpoint to load for fine-tuning |
||
|
string list |
[] |
The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”] |
Note
When activation_checkpoint is True but the model uses fewer than four feature levels
(a smaller model) and more than one GPU is used, activation checkpointing is automatically
disabled at runtime.
dataset#
The dataset parameter defines the dataset source, training batch size, and augmentation.
Co-DETR reuses DINO’s dataset schema; refer to the DINO dataset documentation
for the complete list of supported fields, including the augmentation subsection.
dataset:
dataset_type: default
num_classes: 80
contiguous_labels: true
batch_size: 2
train_data_sources:
- image_dir: /path/to/coco/images/train2017/
json_file: /path/to/coco/annotations/instances_train2017.json
val_data_sources:
- image_dir: /path/to/coco/images/val2017/
json_file: /path/to/coco/annotations/instances_val2017.json
test_data_sources:
image_dir: /path/to/coco/images/val2017/
json_file: /path/to/coco/annotations/instances_val2017.json
infer_data_sources:
image_dir: /path/to/coco/images/val2017/
classmap: /path/to/coco/annotations/coco_classmap.txt
augmentation:
fixed_padding: true
pad_size_divisor: 32
fixed_random_crop: 1536
input_mean: [0.485, 0.456, 0.406]
input_std: [0.229, 0.224, 0.225]
test_random_resize: 1280
random_resize_max_size: 2048
Parameter |
Datatype |
Default |
Description |
Supported Values |
train_data_sources |
list dict
|
The training data sources. Each entry contains an
image_dir and a json_file inCOCO format.
|
||
val_data_sources |
list dict
|
The validation data sources. Each entry contains an
image_dir and a json_file inCOCO format.
|
||
test_data_sources |
dict
|
The test data sources for evaluation, containing an
image_dir and a json_file. |
||
infer_data_sources |
dict
|
The inference data sources, containing an
image_dir and a classmap .txt file. |
||
|
unsigned int |
91 |
The number of classes in the training data |
>0 |
|
bool |
False |
A flag to remap category IDs to a contiguous range before training |
True, False |
|
unsigned int |
4 |
The batch size for training and validation |
>0 |
|
unsigned int |
8 |
The number of parallel workers processing data |
>0 |
dataset_type |
string
|
serialized
|
If set to
default, the standard CocoDetection dataset structure is used. If set toserialized, the data is serialized to reduce CPU memory usage. |
serialized, default
|
|
dict config |
The parameters that define the augmentation method |
The classmap file#
Inference (infer_data_sources.classmap) requires a plain-text .txt file that lists the
class names, one name per line, in category_id order. The first line corresponds to
category_id 1 (the first foreground class), the second line to category_id 2, and so
on. The file maps the model’s numeric predictions to human-readable names, which are then used as the
keys of inference.color_map and as the values referenced by inference.category_mapping.
For a COCO-trained model, the classmap lists the 80 foreground class names:
person
bicycle
car
motorcycle
airplane
bus
train
truck
...
toothbrush
The number of lines in the classmap must match the number of foreground classes the model predicts.
When dataset.contiguous_labels is True, this is the number of foreground classes (for
example, 80 for COCO); otherwise it corresponds to the non-background category_id values in
your COCO annotations.
Training the Model#
Checkpointing and Resuming Training
At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth.
Checkpoints are saved in train.results_dir, like this:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
Use the following command to run Co-DETR training:
codetr train -e <experiment_spec_file>
results_dir=<results_dir>
[model.<model_option>=<model_option_value>]
Required Arguments#
-e, --experiment_spec_file: The path to the experiment specification file.
Optional Arguments#
You can override any value in the experiment specification file using Hydra-style overrides of the
form <field>=<value>. For example, results_dir=<results_dir> sets the output
directory, overriding the results_dir value in the spec file.
Note
The output directory can be supplied either as the results_dir field in the spec file or as
a results_dir=<path> Hydra override on the command line. The same applies to the
evaluate and inference commands. The short -r / --results_dir flag used by
some other TAO models is not used for Co-DETR; always use the Hydra-style override.
Here is an example of using the Co-DETR training command:
codetr train -e $DEFAULT_SPEC results_dir=$RESULTS_DIR
Evaluating the Model#
evaluate#
The evaluate parameter defines the hyperparameters of the evaluation process.
evaluate:
checkpoint: /path/to/model.pth
conf_threshold: 0.0
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
The path to the PyTorch model to evaluate |
||
|
string |
/results/evaluate |
The directory to save evaluation results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed evaluation |
>0 |
|
List[int] |
[0] |
The indices of the GPUs to use for distributed evaluation |
|
|
float |
0.0 |
The confidence threshold to filter predictions |
>=0 |
|
unsigned int |
None |
The width of the input image tensor |
>0 |
|
unsigned int |
None |
The height of the input image tensor |
>0 |
Use the following command to run Co-DETR evaluation:
codetr evaluate -e <experiment_spec_file>
evaluate.checkpoint=<model_to_evaluate>
Required Arguments#
-e, --experiment_spec_file: The path to the experiment specification file.evaluate.checkpoint: The.pthmodel to be evaluated.
Running Inference with a Co-DETR Model#
inference#
The inference parameter defines the hyperparameters of the inference process. The inference
tool for Co-DETR models can be used to visualize bounding boxes and generate frame-by-frame labels in
KITTI format on a directory of images.
inference:
checkpoint: /path/to/model.pth
conf_threshold: 0.5
input_width: 640
input_height: 640
save_annotated_images: True
color_map:
person: red
car: blue
category_mapping:
bicycle: ["bicycle", "motorcycle"]
car: ["car", "bus", "train", "truck"]
person: ["person"]
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
string |
The path to the PyTorch model to use for inference |
||
|
string |
/results/inference |
The directory to save inference results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed inference |
>0 |
|
List[int] |
[0] |
The indices of the GPUs to use for distributed inference |
|
|
float |
0.5 |
The confidence threshold to filter predictions |
>=0 |
|
unsigned int |
640 |
The width of the input image tensor |
>=32 |
|
unsigned int |
640 |
The height of the input image tensor |
>=32 |
|
unsigned int |
3 |
The width in pixels of the bounding box outline |
>=1 |
save_annotated_images |
bool
|
True
|
If True, write annotated JPEGs alongside the KITTI label files. Set to False to write only label
files (faster, with no image decode/encode).
|
True, False
|
|
dict |
The color map of the bounding boxes for each class |
string dict |
|
category_mapping |
dict
|
An optional grouping of classmap categories into output categories, applied after the model
forward pass. Detections whose original class is not in any group are dropped. When soft-NMS is
enabled, an additional per-output-category soft-NMS pass suppresses duplicates within a group.
|
string-to-list dict
|
Note
Set input_width and input_height only when you want to run inference (or evaluation)
at a fixed resolution that differs from the dataset augmentation resize settings. When
input_width/input_height are left at their evaluation defaults (None), the
image size is governed by the augmentation resize parameters instead.
Use the following command to run Co-DETR inference:
codetr inference -e <experiment_spec_file>
inference.checkpoint=<model_to_infer>
Required Arguments#
-e, --experiment_spec_file: The path to the experiment specification file.inference.checkpoint: The.pthmodel to use for inference.