RT-DETR#

RT-DETR is an object-detection model that is included in the TAO. It supports the following tasks:

  • train

  • evaluate

  • inference

  • export

  • distill

Each task is explained in detail in the following sections.

Note

  • Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Data Input for RT-DETR#

RT-DETR expects directories of images for training or validation and annotated JSON files in COCO format.

Creating an Experiment Spec File#

The training experiment spec file for RT-DETR includes model, train, and dataset parameters. Here is an example spec file for training a RT-DETR model with a resnet50 backbone on a COCO dataset.

Use the following command to get an experiment spec file for RT-DETR:

SPECS=$(tao-client rtdetr get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

encryption_key

string

FALSE

results_dir

string

/results

FALSE

wandb

collection

FALSE

model

collection

Configurable parameters to construct the model for a RT-DETR experiment.

FALSE

dataset

collection

Configurable parameters to construct the dataset for a RT-DETR experiment.

FALSE

train

collection

Configurable parameters to construct the trainer for a RT-DETR experiment.

FALSE

evaluate

collection

Configurable parameters to construct the evaluator for a RT-DETR experiment.

FALSE

inference

collection

Configurable parameters to construct the inferencer for a RT-DETR experiment.

FALSE

export

collection

Configurable parameters to construct the exporter for a RT-DETR experiment.

FALSE

gen_trt_engine

collection

Configurable parameters to construct the TensorRT engine builder for a RT-DETR experiment.

FALSE

distill

collection

Configurable parameters to construct the distiller for a RT-DETR experiment.

FALSE

model#

The model parameter provides options to change the RT-DETR architecture.

model:
  pretrained_backbone_path: /path/to/pretrained/backbone.pth
  backbone: resnet_50
  train_backbone: true
  num_queries: 300
  num_select: 300
  num_feature_levels: 3
  return_interm_indices:
  - 1
  - 2
  - 3
  feat_strides:
  - 8
  - 16
  - 32
  feat_channels:
  - 256
  - 256
  - 256
  use_encoder_idx:
  - 2
  hidden_dim: 256
  nheads: 8
  dropout_ratio: 0.0
  enc_layers: 1
  dim_feedforward: 1024
  pe_temperature: 10000
  expansion: 1.0
  depth_mult: 1
  enc_act: gelu
  act: silu
  dec_layers: 6
  dn_number: 100
  eval_idx: -1
  vfl_loss_coef: 1.0
  bbox_loss_coef: 5.0
  giou_loss_coef: 2.0
  alpha: 0.75
  gamma: 2.0
  aux_loss: true
  loss_types:
  - vfl
  - boxes
  backbone_names:
  - backbone.0
  linear_proj_names:
  - reference_points
  - sampling_offsets
  distillation_loss_coef: 1.0
  frozen_fm:
    enabled: false
    backbone: radio_v2-l
    checkpoint: /path/to/pretrained/radio_v2-l.pth

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

pretrained_backbone_path

string

[Optional] Path to a pretrained backbone file.

false

backbone

string

Backbone name of the model. TAO implementation of RT-DETR support ResNet, EfficientViT, FAN, and ConvNext v1/v2.

resnet_50

convnext_tiny, convnext_small, convnext_base, convnext_large, convnext_xlarge, fan_tiny, fan_small, fan_base, fan_large

false

resnet_18, resnet_34, resnet_50, resnet_101, convnextv2_nano, convnextv2_tiny, convnextv2_base, convnextv2_large, convnextv2_huge

false

train_backbone

bool

Flag to set backbone weights as trainable or frozen. When set to False, the backbone weights are frozen.

True

false

num_queries

int

Number of queries.

300

1

inf

true

num_select

int

Number of top-K predictions selected during post-processing.

300

1

true

num_feature_levels

int

Number of feature levels to use in the model.

3

1

4

false

return_interm_indices

list

Index of feature levels to use in the model. The length must match num_feature_levels.

[1, 2, 3]

false

feat_strides

list

Stride used as grid size of positional embedding at each encoder layer.

[8, 16, 32]

false

feat_channels

list

Feature channel sizes in decoder.

[256, 256, 256]

false

use_encoder_idx

list

Index of multi-scale backbone features to pass to encoder.

[2]

false

hidden_dim

int

Dimension of the hidden units.

256

false

nheads

int

Number of heads.

8

false

dropout_ratio

float

Probability to drop hidden units.

0.0

0.0

1.0

false

enc_layers

int

Number of encoder layers in the transformer.

1

1

true

dim_feedforward

int

Dimension of the feedforward network.

1024

1

false

pe_temperature

int

Temperature applied to the positional sine embedding.

10000

1

inf

false

expansion

int

Expansion raito for hidden dimension used in CSPRepLayer.

1.0

0.0

inf

false

depth_mult

int

Number of RegVGGBlock used in CSPRepLayer.

1

1

inf

false

enc_act

string

Activation used for the encoder.

gelu

false

act

string

Activation used for top-down FPN and bottom-up PAN.

silu

false

dec_layers

int

Number of decoder layers in the transformer.

6

1

true

dn_number

int

Number of denoising queries.

100

0

inf

false

eval_idx

int

Index of decoder layer to use for evaluation. By default, use the last decoder layer.

-1

-1

inf

false

vfl_loss_coef

float

Relative weight of the varifocal error in the matching cost.

1.0

0.0

inf

false

bbox_loss_coef

float

Relative weight of the L1 error of the bounding box coordinates in the matching cost.

5.0

0.0

inf

false

giou_loss_coef

float

The relative weight of the GIoU loss of the bounding box in the matching cost.

2.0

0.0

inf

false

alpha

float

Alpha value in the varifocal loss.

0.75

false

gamma

float

Gamma value in the varifocal loss.

2.0

false

aux_loss

bool

A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer).

True

false

loss_types

list

Losses to be used during training.

['vfl', 'boxes']

false

backbone_names

list

Prefix of the tensor names corresponding to the backbone.

['backbone.0']

false

linear_proj_names

list

Linear projection layer names.

['reference_points',

'sampling_offsets']

false

distillation_loss_coef

float

Coefficient for the distillation loss during distillation.

1.0

false

frozen_fm

collection

Configurable parameters used to construct the frozen foundation model.

false

frozen_fm#

The frozen_fm parameter provides options to change the Frozen RT-DETR (RT-DETR + a frozen foundation model) architecture.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

enabled

bool

Flag to set frozen foundation model as enabled or disabled. When set to True, the frozen foundation model will be enabled.

True

FALSE

backbone

string

Name of the frozen foundation model.

radio_v2-l

radio_v2-b,radio_v2-l,radio_v2-h

FALSE

checkpoint

string

The path to the pretrained frozen foundation model checkpoint.

FALSE

Note

The pretrained weights of the frozen foundation model can be found in the TAO Model Zoo.

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    lr_linear_proj_mult: 0.1
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_decay: 0.1
    lr_steps: [40]
    optimizer: AdamW
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  clip_grad_norm: 0.1
  precision: fp32
  distributed_strategy: ddp
  activation_checkpoint: True
  num_gpus: 1
  gpu_ids: [0]
  num_nodes: 1
  seed: 1234

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

Number of GPUs to run the training job.

1

1

false

gpu_ids

list

List of GPU IDs to run the training on. The length of this list must equal the number of GPUs in train.num_gpus.

[0]

false

num_nodes

int

Number of nodes to run the training on. If >1, multi-node is enabled.

1

false

seed

int

Seed for the initializer in PyTorch. If <0, disable fixed seed.

1234

-1

inf

false

cudnn

collection

false

num_epochs

int

Number of epochs to run the training.

10

1

inf

true

checkpoint_interval

int

Interval (in epochs) at which a checkpoint is to be saved. Helps resume training.

1

1

false

validation_interval

int

Interval (in epochs) at which a evaluation is to be triggered on the validation dataset.

1

1

false

resume_training_checkpoint_path

string

Path to the checkpoint at which to resume training.

false

results_dir

string

Path to the location where all the assets generated from a task are stored.

false

freeze

list

List of layer names to freeze. Example: ["backbone", "encoder", "decoder"].

[]

false

pretrained_model_path

string

Path to a pretrained RT-DETR model to initialize the current training from.

false

clip_grad_norm

float

Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping.

0.1

false

is_dry_run

bool

Whether to run the trainer in Dry Run mode. This is a good way to validate the spec file and run a sanity check on the trainer without actually initializing and running the trainer.

false

false

enable_ema

bool

Whether to enable Exponential Moving Average during training.

false

false

ema

collection

Hyper parameters to configure the Exponential Moving Average.

false

optim

collection

Hyper parameters to configure the optimizer.

false

precision

string

Precision to run the training on.

fp32

bf16,fp32,fp16

false

distributed_strategy

string

The multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.

ddp

ddp,fsdp

false

activation_checkpoint

bool

Whether training is to recompute in backward pass to save GPU memory, rather than storing activations.

true

false

verbose

bool

Whether to enable printing of detailed learning rate scaling from the optimizer.

false

false

optim#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0002
  lr_backbone: 0.00002
  lr_linear_proj_mult: 0.1
  momentum: 0.9
  weight_decay: 0.0001
  lr_scheduler: MultiStep
  lr_decay: 0.1
  lr_steps: [40]
  optimizer: AdamW

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

optimizer

string

Type of optimizer used to train the network.

AdamW

AdamW,SGD

FALSE

monitor_name

string

The metric value to be monitored for the AutoReduce Scheduler

val_loss

val_loss,train_loss

FALSE

lr

float

The initial learning rate for training the model, excluding the backbone

0.0001

TRUE

lr_backbone

float

The initial learning rate for training the backbone

1e-05

TRUE

momentum

float

The momentum for the AdamW optimizer

0.9

TRUE

weight_decay

float

The weight decay coefficient

0.0001

TRUE

lr_scheduler

string

The learning scheduler: * MultiStep : Decrease the lr by lr_decay from lr_steps * StepLR : Decrease the lr by lr_decay at every lr_step_size.

MultiStep

MultiStep,StepLR

FALSE

lr_steps

list

The steps at which the learning rate must be decreased. This is applicable only with the MultiStep LR.

[1000]

FALSE

lr_step_size

int

The number of steps to decrease the learning rate in the StepLR

1000

TRUE

lr_decay

float

The decreasing factor for the learning rate scheduler

0.1

TRUE

warmup_steps

int

The number of steps to perform linear learning rate warm-up

0

0

inf

FALSE

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/images/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.json
  val_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    classmap: /path/to/coco/annotations/coco_classmap.txt
  num_classes: 80
  batch_size: 4
  workers: 8

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

train_data_sources

list

The list of data sources for training:

  • image_dir: Directory that contains the training images

  • json_file: Path of the JSON file, which uses training-annotation COCO format

[{‘image_dir’: ‘’, ‘json_file’: ‘’}]

false

val_data_sources

collection

The list of data sources for validation:

  • image_dir: Directory that contains the validation images

  • json_file: Path of the JSON file, which uses validation-annotation COCO format

{‘image_dir’: ‘’, ‘json_file’: ‘’}

false

test_data_sources

collection

The data source for testing:

  • image_dir: Directory that contains the test images

  • json_file: Path of the JSON file, which uses test-annotation COCO format

{‘image_dir’: ‘’, ‘json_file’: ‘’}

false

infer_data_sources

collection

The data source for inference:

  • image_dir: List of directories that contains the inference images

  • classmap: Path of the .txt file that contains class names

{‘image_dir’: [‘’], ‘classmap’: ‘’}

false

batch_size

int

Batch size for training and validation.

4

1

inf

true

workers

int

Number of parallel workers processing data.

8

1

inf

true

remap_mscoco_category

bool

Enables mapping of MSCOCO 91 classes to 80. Only required if you are directly training using the original COCO annotation files. For a custom dataset, set this value to False.

False

false

pin_memory

bool

Enables the dataloader to allocate pagelocked memory for faster data transfer between the CPU and GPU.

True

false

dataset_type

string

If set to default, follow the standard CocoDetection dataset structure from the torchvision which loads COCO annotation in every subprocess. This creates a redundant copy of data and can cause RAM to explode if workers is high. If set to serialized, the data is serialized through pickle and torch.Tensor, which allows the data to be shared acrosssubprocesses. As a result, RAM usage can be greatly improved.

serialized``

serialized, default

false

num_classes

int

The number of classes in the training data

80

1

inf

false

eval_class_ids

list

IDs of the classes for evaluation.

[1]

false

augmentation

collection

Configuration parameters for data augmentation

false

augmentation#

The augmentation parameter contains hyperparameters for augmentation.

augmentation:
  multi_scales:
  - 480
  - 512
  - 544
  - 576
  - 608
  - 640
  - 672
  - 704
  - 736
  - 768
  - 800
  train_spatial_size:
  - 640
  - 640
  eval_spatial_size:
  - 640
  - 640
  distortion_prob: 0.8
  iou_crop_prob: 0.8
  preserve_aspect_ratio: false

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

multi_scales

list

A list of sizes to perform random resize.

[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

false

train_spatial_size

list

Input resolution to run evaluation during training. This is in the [h, w] order.

[640, 640]

false

eval_spatial_size

list

Input resolution to run evaluation during validation and testing. This is in the [h, w] order.

[640, 640]

false

distortion_prob

float

The probability for RandomPhotometricDistort

0.8

0.0

1.0

true

iou_crop_prob

float

The probability for RandomIoUCrop

0.8

0.0

1.0

true

preserve_aspect_ratio

bool

Flag to enable resize with preserving the aspect ratio.

false

false

Training the Model#

Use the following command to run RT-DETR training:

TRAIN_JOB_ID=$(tao-client rtdetr experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

.. include:: ../../../excerpts/multi_node_training_ftms.rst

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

Note

You may resume a previously aborted training job by setting the train.resume_training_checkpoint_path to the path of the intermediate checkpoint file. The checkpoint files must follow the model_epoch_*.pth or model_epoch*-EMA.pth format. You must use the *-EMA.pth file if your training spec has EMA enabled.

You may set this parameter by providing the corresponding flag over command line.

TRAIN_JOB_ID=$(tao-client rtdetr job-resume --job $TRAIN_JOB_ID --action train --id $EXPERIMENT_ID --specs "$SPECS")

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either

  • Specify a new, empty results directory (Recommended), or

  • Remove the latest checkpoint from the results directory

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

???

false

results_dir

string

false

nput_width

int

Width of the input image tensor.

1

false

input_height

int

Height of the input image tensor.

1

false

trt_engine

string

Path to the TensorRT engine to be used for evaluation. This only works with tao-deploy.

false

conf_threshold

float

The value of the confidence threshold to be used when filtering out the final list of boxes.

0.0

false

To run evaluation with a RT-DETR model, use this command:

EVAL_JOB_ID=$(tao-client rtdetr experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Running Inference with an RT-DETR Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  color_map:
    person: red
    car: blue

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

???

false

results_dir

string

false

trt_engine

string

Path to the TensorRT engine to be used for evaluation. This only works with tao-deploy.

false

color_map

collection

Class-wise dictionary with colors to render boxes.

false

conf_threshold

float

The value of the confidence threshold to be used when filtering out the final list of boxes.

0.5

false

is_internal

bool

Flag to render with internal directory structure.

false

false

input_width

int

Width of the input image tensor.

960

32

false

input_height

int

Height of the input image tensor.

544

32

false

outline_width

int

Width in pixels of the bounding box outline.

3

1

false

The inference tool for RT-DETR models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

INFER_JOB_ID=$(tao-client rtdetr experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Distilling the Model#

distill#

The distill parameter defines the hyperparameters for the distillation process.

distill:
  teacher:
    backbone: convnext_large
    train_backbone: False
    num_queries: 300
    num_select: 300
    num_feature_levels: 3
    return_interm_indices:
    - 1
    - 2
    - 3
    feat_strides:
    - 8
    - 16
    - 32
    hidden_dim: 256
    nheads: 8
    dropout_ratio: 0.0
    enc_layers: 1
    dim_feedforward: 1024
    use_encoder_idx:
    - 2
    pe_temperature: 10000
    expansion: 1.0
    depth_mult: 1
    enc_act: gelu
    act: silu
    dec_layers: 6
    dn_number: 100
    feat_channels:
    - 256
    - 256
    - 256
    eval_idx: -1
    vfl_loss_coef: 1.0
    bbox_loss_coef: 5.0
    giou_loss_coef: 2.0
    alpha: 0.75
    gamma: 2.0
    clip_max_norm: 0.1
    aux_loss: true
    loss_types:
    - vfl
    - boxes
    backbone_names:
    - backbone.0
    linear_proj_names:
    - reference_points
    - sampling_offsets
  pretrained_teacher_model_path: /path/to/teacher/model_epoch_070.pth
  bindings:
  - teacher_module_name: 'srcs'
    student_module_name: 'srcs'
    criterion: IOU
    weight: 20

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

teacher

collection

Configurable parameters to construct the teacher model. (Same as the model config)

false

pretrained_teacher_model_path

string

Path to the pre-trained teacher model.

false

bindings

list dict

The list of bindings between teacher and student to use for calculating distill loss:

  • teacher_module_name: The name of the teacher module

  • student_module_name: The name of the student module

  • criterion`: The criterion to use for calculating binding loss (L1, L2, KL, IOU)

  • weight`: The value of the weight to use for the binding; default is 1.0

false

Note

We recommend using “IOU” as the criterion and teacher_module_name/student_module_name as “srcs” for distillation. total_loss = distillation_loss_coef * distillation_loss + other RTDETR losses, where distillation_loss = sum(binding_loss)

DISTILL_JOB_ID=$(tao-client rtdetr experiment-run-action --action distill --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Exporting the Model#

export#

The export parameter defines the hyperparameters for the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 640
  input_height: 640
  batch_size: -1

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

Path to the checkpoint file to run export.

???

false

onnx_file

string

Path to the onnx model file.

???

false

on_cpu

bool

Flag to export CPU compatible model.

false

false

input_channel

int

Number of channels in the input Tensor.

3

3

false

input_width

int

Width of the input image tensor.

960

32

false

input_height

int

Height of the input image tensor.

544

32

false

opset_version

int

Operator set version of the ONNX model used to generate the TensorRT engine.

17

1

false

batch_size

int

The batch size of the input Tensor for the engine. A value of -1 implies dynamic tensor shapes.

-1

-1

false

verbose

bool

Flag to enable verbose TensorRT logging.

false

false

Note

When you export a RT-DETR model with frozen_fm enabled, the .onnx file has a static batch size of 1.

EXPORT_JOB_ID=$(tao-client rtdetr experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

TensorRT engine generation, validation, and int8 calibration#

For deployment, please refer to TAO Deploy documentation.

Deploying to DeepStream#

Refer to the Integrating a RT-DETR Model page for more information about deploying a RT-DETR model to DeepStream.