Co-DETR#

Warning

Co-DETR is an experimental model in TAO 7.0.1. It is not extensively tested and its configuration, behavior, and supported features may change in a future release. Co-DETR is a PyTorch-only model: there is no ONNX export, TensorRT engine generation, or DeepStream deployment path for this model.

Co-DETR (Collaborative DETR) is an experimental object-detection model included in TAO. It augments a DETR-style detector with auxiliary “collaborative” heads (for example, an ATSS-style dense head) that provide additional supervision during training to improve the convergence and accuracy of the DETR query head. At inference time, only the primary DETR query head is used to produce detections.

Co-DETR supports the following tasks:

train
evaluate
inference

Each task is explained in detail in the following sections.

Note

Because Co-DETR is PyTorch-only, the export, gen_trt_engine, and DeepStream deployment tasks documented for other DETR-family models are not supported.

The codetr console command is provided by the TAO PyTorch container. Launch it with the tao_pt runner (or the TAO Launcher) before running any of the commands on this page. Refer to the TAO Toolkit Quick Start for instructions on installing the launcher and pulling the container.

Getting Pretrained Weights#

Co-DETR is trained by fine-tuning from a pretrained backbone. Two model fields control where the starting weights come from:

model.pretrained_backbone_path: an optional path to a backbone-only checkpoint that initializes the selected model.backbone.
train.pretrained_model_path: an optional path to a full Co-DETR (or compatible DETR-family) checkpoint to continue fine-tuning from.

The ViT spec files use the dedicated vit_large_codetr backbone, which is derived from the same ViT-Large NV-DINOv2/DINOv2 weights used by DINO. Download a supported backbone checkpoint from the TAO pretrained models on NGC and point model.pretrained_backbone_path at the downloaded .pth file. For example:

ngc registry model download-version \
    nvidia/tao/pretrained_dinov2_classification_imagenet:vit_large_patch14_dinov2 \
    --dest /path/to/weights

The downloaded checkpoint can then be referenced from the spec file:

model:
  backbone: vit_large_codetr
  pretrained_backbone_path: /path/to/weights/model.pth

Note

The exact backbone artifact and version depend on the backbone you select. Use a ViT-Large NV-DINOv2/DINOv2 backbone checkpoint for the vit_large_codetr backbone, and the matching Swin, FAN, ResNet, or EfficientViT checkpoint for the other supported backbones. If pretrained_backbone_path is left as null, the backbone is initialized randomly, which typically requires substantially longer training.

Data Input for Co-DETR#

Co-DETR reuses the DINO data pipeline and expects directories of images for training or validation along with annotated JSON files in COCO format.

Note

The category_id from your COCO JSON file should start from 1 because 0 is set as a background class. In addition, dataset.num_classes should be set to max class_id + 1. For instance, even though there are only 80 classes used in COCO, the largest class_id is 90, so dataset.num_classes should be set to 91. When dataset.contiguous_labels is set to True, the category IDs are remapped to a contiguous range and num_classes reflects the number of foreground classes (for example, 80 for COCO).

Creating an Experiment Specification File#

The training experiment specification file for Co-DETR includes model, train, and dataset parameters. Co-DETR reuses DINO’s dataset and train schemas, so those sections follow the same format as the DINO documentation.

The commands on this page refer to two environment variables that you should set to your own paths before running them:

# Path to the experiment specification (YAML) file shown below.
export DEFAULT_SPEC=/path/to/codetr_train.yaml
# Directory where training, evaluation, and inference outputs are written.
export RESULTS_DIR=/path/to/results

Here is a complete, copy-pasteable example specification file for training a Co-DETR model with a vit_large_codetr backbone on a COCO-format dataset. Replace the /path/to/... values with the locations of your data, backbone weights, and output directory.

results_dir: /path/to/results
model:
  backbone: vit_large_codetr
  pretrained_backbone_path: /path/to/weights/model.pth
  num_queries: 1500
  num_feature_levels: 5
  return_interm_indices: [0, 1, 2, 3, 4]
  two_stage_type: standard
  num_co_heads: 2
  co_head_num_convs: 1
  hidden_dim: 256
  nheads: 8
  enc_layers: 6
  dec_layers: 6
  dim_feedforward: 2048
  num_select: 1000
  soft_nms_enabled: true
  soft_nms_method: linear
  soft_nms_iou_threshold: 0.8
dataset:
  dataset_type: default
  num_classes: 80
  contiguous_labels: true
  batch_size: 2
  augmentation:
    fixed_padding: true
    pad_size_divisor: 32
    fixed_random_crop: 1536
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    test_random_resize: 1280
    random_resize_max_size: 2048
  train_data_sources:
    - image_dir: /path/to/coco/images/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.json
  val_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      json_file: /path/to/coco/annotations/instances_val2017.json
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    classmap: /path/to/coco/annotations/coco_classmap.txt
train:
  num_gpus: 1
  num_epochs: 12
  checkpoint_interval: 1
  validation_interval: 1
  precision: fp32
  activation_checkpoint: true
  clip_grad_norm: 0.1
  optim:
    optimizer: AdamW
    lr: 2.0e-4
    lr_backbone: 2.0e-5
    lr_linear_proj_mult: 0.1
    weight_decay: 1.0e-4
    lr_scheduler: MultiStep
    lr_steps: [11]
    lr_decay: 0.1
    layer_decay_rate: 0.65
evaluate:
  checkpoint: /path/to/results/train/codetr_model.pth
  conf_threshold: 0.0
inference:
  checkpoint: /path/to/results/train/codetr_model.pth
  conf_threshold: 0.5
  input_width: 640
  input_height: 640

Note

The dataset block above sets num_classes: 80 together with contiguous_labels: true, which remaps the COCO category IDs to the 80 contiguous foreground classes. If you instead leave contiguous_labels at its default (False), set num_classes to max class_id + 1 (91 for COCO). See the data-input note for details.

The following sections describe each parameter group in detail.

Parameter	Data Type	Default	Description	Supported Values
`model`	dict config	–	The configuration of the model architecture
`dataset`	dict config	–	The configuration of the dataset
`train`	dict config	–	The configuration of the training task
`evaluate`	dict config	–	The configuration of the evaluation task
`inference`	dict config	–	The configuration of the inference task
`encryption_key`	string	None	The encryption key to encrypt and decrypt model files
`results_dir`	string	/results	The directory where experiment results are saved

model#

The model parameter provides options to change the Co-DETR architecture.

model:
  backbone: vit_large_codetr
  num_queries: 1500
  num_feature_levels: 5
  return_interm_indices: [0, 1, 2, 3, 4]
  two_stage_type: standard
  num_co_heads: 2
  co_head_num_convs: 1
  hidden_dim: 256
  nheads: 8
  enc_layers: 6
  dec_layers: 6
  dim_feedforward: 2048
  num_select: 1000
  soft_nms_enabled: True
  soft_nms_method: linear
  soft_nms_iou_threshold: 0.8

Parameter	Datatype	Default	Description	Supported Values
`pretrained_backbone_path`	string	None	The optional path to the pretrained backbone file	string to the path
`backbone`	string	swin_large_patch4_window7_224	The backbone name of the model. Swin, FAN, ResNet 34/50, EfficientViT, and ViT (NV-DINOv2/DINOv2) backbones are supported. Co-DETR also provides a dedicated `vit_large_codetr` backbone used by the ViT spec files.	swin_large_224_22k, swin_large_384_22k, swin_large_patch4_window7_224, swin_large_patch4_window12_384, swin_base_224_22k, swin_base_384_22k, swin_tiny_224_1k, fan_tiny, fan_small, fan_base, fan_large, resnet_34, resnet_50, efficientvit_b0/b1/b2/b3, vit_large_nvdinov2, vit_large_dinov2, vit_large_codetr
`train_backbone`	bool	True	A flag specifying whether to train the backbone or not	True, False
`num_feature_levels`	unsigned int	4	The number of feature levels to use in the model	1, 2, 3, 4, 5
`return_interm_indices`	int list	[1, 2, 3, 4]	The index of feature levels to use in the model. The length must match `num_feature_levels`.	[0, 1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3], [1, 2], [1]
`hidden_dim`	unsigned int	256	The dimension of the transformer hidden units	>0
`nheads`	unsigned int	8	The number of attention heads	>0
`dec_layers`	unsigned int	6	The number of decoder layers in the transformer	>0
`enc_layers`	unsigned int	6	The number of encoder layers in the transformer	>0
`num_queries`	unsigned int	900	The number of object queries (detection slots)	1 ~ 2000
`dim_feedforward`	unsigned int	2048	The dimension of the feedforward network	>0
`dec_n_points`	unsigned int	4	The number of deformable reference points in the decoder	>0
`enc_n_points`	unsigned int	4	The number of deformable reference points in the encoder	>0
`num_select`	unsigned int	300	The number of top-K predictions selected during post-processing	>0
`pre_norm`	bool	False	A flag specifying whether to add a LayerNorm before the encoder	True, False
`two_stage_type`	string	standard	The two-stage detection type	standard, no
`decoder_sa_type`	string	sa	The type of decoder self-attention	sa, ca_label, ca_content
`embed_init_tgt`	bool	True	A flag specifying whether to add a target embedding	True, False
`fix_refpoints_hw`	signed int	-1	If this value is -1, width and height are learned separately for each box. If this value is -2, a shared width and height are learned. A value greater than 0 specifies learning with a fixed number.	>0, -1, -2
`pe_temperatureH`	unsigned int	20	The temperature applied to the height dimension of the positional sine embedding	>0
`pe_temperatureW`	unsigned int	20	The temperature applied to the width dimension of the positional sine embedding	>0
`dropout_ratio`	float	0.0	The probability to drop hidden units	0.0 ~ 1.0
`dilation`	bool	False	A flag to enable dilation in the backbone (ResNet only)	True, False
`use_dn`	bool	True	A flag specifying whether to enable contrastive de-noising training	True, False
`dn_number`	unsigned int	100	The number of de-noising queries	>0
`dn_box_noise_scale`	float	1.0	The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied.	>=0
`dn_label_noise_ratio`	float	0.5	The scale of noise applied to labels during contrastive de-noising. If this value is 0, noise is not applied.	>=0
`cls_loss_coef`	float	2.0	The relative weight of the classification error in the matching cost	>=0.0
`bbox_loss_coef`	float	5.0	The relative weight of the L1 error of the bounding box coordinates in the matching cost	>=0.0
`giou_loss_coef`	float	2.0	The relative weight of the GIoU loss of the bounding box in the matching cost	>=0.0
`focal_alpha`	float	0.25	The alpha value in the focal loss	>0.0
`aux_loss`	bool	True	A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer)	True, False
`interm_loss_coef`	float	1.0	The coefficient for the intermediate (encoder) outputs loss	>=0.0
`no_interm_box_loss`	bool	False	A flag to disable the intermediate bounding box loss	True, False
`loss_types`	string list	[‘labels’, ‘boxes’]	The loss types to apply	labels, boxes

The following parameters are specific to Co-DETR’s collaborative auxiliary heads and to its post-processing.

Parameter	Datatype	Default	Description	Supported Values
`num_co_heads`	unsigned int	1	The number of collaborative auxiliary (ATSS) heads used to provide extra supervision during training	1 ~ 3
`co_head_loss_weight`	float	1.0	The loss weight applied to the collaborative auxiliary head losses	>=0.0
`co_head_num_convs`	unsigned int	4	The number of convolution layers in the collaborative ATSS head towers	>0
`soft_nms_enabled`	bool	False	A flag to apply per-class soft-NMS after top-K selection during post-processing	True, False
`soft_nms_method`	string	linear	The soft-NMS decay method	linear, gaussian
`soft_nms_iou_threshold`	float	0.8	The IoU threshold for linear soft-NMS; boxes with an IoU <= threshold are not suppressed	0.0 ~ 1.0
`soft_nms_sigma`	float	0.5	The Gaussian sigma for soft-NMS score decay (gaussian method only)	>=0.01

train#

The train parameter defines the hyperparameters of the training process. Co-DETR reuses DINO’s training schema, including the optim subsection. Refer to the DINO train and DINO optim documentation for the full list of supported fields.

train:
  num_gpus: 1
  num_nodes: 1
  num_epochs: 12
  checkpoint_interval: 1
  validation_interval: 1
  precision: fp32
  distributed_strategy: ddp
  activation_checkpoint: True
  clip_grad_norm: 0.1
  pretrained_model_path: null
  resume_training_checkpoint_path: null
  freeze: []
  optim:
    optimizer: AdamW
    lr: 2.0e-4
    lr_backbone: 2.0e-5
    lr_linear_proj_mult: 0.1
    weight_decay: 1.0e-4
    lr_scheduler: MultiStep
    lr_steps: [11]
    lr_decay: 0.1
    layer_decay_rate: 0.65

Parameter	Datatype	Default	Description	Supported Values
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed training	>0
`gpu_ids`	List[int]	[0]	The indices of the GPUs to use for distributed training
`num_nodes`	unsigned int	1	The number of nodes. If the value is larger than 1, multi-node is enabled	>0
`seed`	unsigned int	1234	The random seed for random, NumPy, and torch	>0
`num_epochs`	unsigned int	12	The total number of epochs to run the experiment	>0
`checkpoint_interval`	unsigned int	1	The epoch interval at which the checkpoints are saved	>0
`validation_interval`	unsigned int	1	The epoch interval at which the validation is run	>0
`resume_training_checkpoint_path`	string		The intermediate PyTorch Lightning checkpoint to resume training from
`results_dir`	string	/results/train	The directory to save training results
`optim`	dict config		The config for the optimizer, including the learning rate, learning scheduler, and weight decay
`clip_grad_norm`	float	0.1	The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping	>=0
`precision`	string	fp32	Specifying “fp16” enables mixed-precision training, which can help save GPU memory	fp32, fp16
`distributed_strategy`	string	ddp	The multi-GPU training strategy. DDP (Distributed Data Parallel) and FSDP (Fully Sharded Data Parallel) are supported.	ddp, fsdp
`activation_checkpoint`	bool	True	A True value instructs train to recompute activations in the backward pass to save GPU memory, rather than storing them. (See note below for an automatic override.)	True, False
`pretrained_model_path`	string		The path to a pretrained model checkpoint to load for fine-tuning
`freeze`	string list	[]	The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”]

Note

When activation_checkpoint is True but the model uses fewer than four feature levels (a smaller model) and more than one GPU is used, activation checkpointing is automatically disabled at runtime.

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation. Co-DETR reuses DINO’s dataset schema; refer to the DINO dataset documentation for the complete list of supported fields, including the augmentation subsection.

dataset:
  dataset_type: default
  num_classes: 80
  contiguous_labels: true
  batch_size: 2
  train_data_sources:
    - image_dir: /path/to/coco/images/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.json
  val_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      json_file: /path/to/coco/annotations/instances_val2017.json
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    classmap: /path/to/coco/annotations/coco_classmap.txt
  augmentation:
    fixed_padding: true
    pad_size_divisor: 32
    fixed_random_crop: 1536
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    test_random_resize: 1280
    random_resize_max_size: 2048

Parameter	Datatype	Default	Description	Supported Values
`train_data_sources`	list dict		The training data sources. Each entry contains an `image_dir` and a `json_file` in COCO format.
`val_data_sources`	list dict		The validation data sources. Each entry contains an `image_dir` and a `json_file` in COCO format.
`test_data_sources`	dict		The test data sources for evaluation, containing an `image_dir` and a `json_file`.
`infer_data_sources`	dict		The inference data sources, containing an `image_dir` and a `classmap` `.txt` file.
`num_classes`	unsigned int	91	The number of classes in the training data	>0
`contiguous_labels`	bool	False	A flag to remap category IDs to a contiguous range before training	True, False
`batch_size`	unsigned int	4	The batch size for training and validation	>0
`workers`	unsigned int	8	The number of parallel workers processing data	>0
`dataset_type`	string	serialized	If set to `default`, the standard `CocoDetection` dataset structure is used. If set to `serialized`, the data is serialized to reduce CPU memory usage.	serialized, default
`augmentation`	dict config		The parameters that define the augmentation method

The classmap file#

Inference (infer_data_sources.classmap) requires a plain-text .txt file that lists the class names, one name per line, in category_id order. The first line corresponds to category_id 1 (the first foreground class), the second line to category_id 2, and so on. The file maps the model’s numeric predictions to human-readable names, which are then used as the keys of inference.color_map and as the values referenced by inference.category_mapping.

For a COCO-trained model, the classmap lists the 80 foreground class names:

person
bicycle
car
motorcycle
airplane
bus
train
truck
...
toothbrush

The number of lines in the classmap must match the number of foreground classes the model predicts. When dataset.contiguous_labels is True, this is the number of foreground classes (for example, 80 for COCO); otherwise it corresponds to the non-background category_id values in your COCO annotations.

Training the Model#

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

Use the following command to run Co-DETR training:

codetr train -e <experiment_spec_file>
             results_dir=<results_dir>
             [model.<model_option>=<model_option_value>]

Required Arguments#

-e, --experiment_spec_file: The path to the experiment specification file.

Optional Arguments#

You can override any value in the experiment specification file using Hydra-style overrides of the form <field>=<value>. For example, results_dir=<results_dir> sets the output directory, overriding the results_dir value in the spec file.

Note

The output directory can be supplied either as the results_dir field in the spec file or as a results_dir=<path> Hydra override on the command line. The same applies to the evaluate and inference commands. The short -r / --results_dir flag used by some other TAO models is not used for Co-DETR; always use the Hydra-style override.

Here is an example of using the Co-DETR training command:

codetr train -e $DEFAULT_SPEC results_dir=$RESULTS_DIR

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string		The path to the PyTorch model to evaluate
`results_dir`	string	/results/evaluate	The directory to save evaluation results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed evaluation	>0
`gpu_ids`	List[int]	[0]	The indices of the GPUs to use for distributed evaluation
`conf_threshold`	float	0.0	The confidence threshold to filter predictions	>=0
`input_width`	unsigned int	None	The width of the input image tensor	>0
`input_height`	unsigned int	None	The height of the input image tensor	>0

Use the following command to run Co-DETR evaluation:

codetr evaluate -e <experiment_spec_file>
                evaluate.checkpoint=<model_to_evaluate>

Required Arguments#

-e, --experiment_spec_file: The path to the experiment specification file.
evaluate.checkpoint: The .pth model to be evaluated.

Running Inference with a Co-DETR Model#

inference#

The inference parameter defines the hyperparameters of the inference process. The inference tool for Co-DETR models can be used to visualize bounding boxes and generate frame-by-frame labels in KITTI format on a directory of images.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  input_width: 640
  input_height: 640
  save_annotated_images: True
  color_map:
    person: red
    car: blue
  category_mapping:
    bicycle:   ["bicycle", "motorcycle"]
    car:       ["car", "bus", "train", "truck"]
    person:    ["person"]

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string		The path to the PyTorch model to use for inference
`results_dir`	string	/results/inference	The directory to save inference results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed inference	>0
`gpu_ids`	List[int]	[0]	The indices of the GPUs to use for distributed inference
`conf_threshold`	float	0.5	The confidence threshold to filter predictions	>=0
`input_width`	unsigned int	640	The width of the input image tensor	>=32
`input_height`	unsigned int	640	The height of the input image tensor	>=32
`outline_width`	unsigned int	3	The width in pixels of the bounding box outline	>=1
`save_annotated_images`	bool	True	If True, write annotated JPEGs alongside the KITTI label files. Set to False to write only label files (faster, with no image decode/encode).	True, False
`color_map`	dict		The color map of the bounding boxes for each class	string dict
`category_mapping`	dict		An optional grouping of classmap categories into output categories, applied after the model forward pass. Detections whose original class is not in any group are dropped. When soft-NMS is enabled, an additional per-output-category soft-NMS pass suppresses duplicates within a group.	string-to-list dict

Note

Set input_width and input_height only when you want to run inference (or evaluation) at a fixed resolution that differs from the dataset augmentation resize settings. When input_width/input_height are left at their evaluation defaults (None), the image size is governed by the augmentation resize parameters instead.

Use the following command to run Co-DETR inference:

codetr inference -e <experiment_spec_file>
                 inference.checkpoint=<model_to_infer>

Required Arguments#

-e, --experiment_spec_file: The path to the experiment specification file.
inference.checkpoint: The .pth model to use for inference.