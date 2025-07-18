Use the following command to run Deformable DETR training:

Copy Copied! tao model deformable_detr train [-h] -e <experiment_spec_file> [results_dir=<global_results_dir>] [model.<model_option>=<model_option_value>] [dataset.<dataset_option>=<dataset_option_value>] [train.<train_option>=<train_option_value>] [train.gpu_ids=<gpu indices>] [train.num_gpus=<number of gpus>]

The only required argument is the path to the experiment spec:

-e, --experiment_spec : The experiment specification file to set up the training experiment

You can set optional arguments to override the option values in the experiment spec file.

Note For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids , which default to 1 and [0] , respectively. If both are passed, but inconsistent, for example num_gpus = 1 , gpu_ids = [0, 1] , then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2 .

At every train.checkpoint_interval , a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth . These are saved in train.results_dir , like so:

Copy Copied! $ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth'

The latest checkpoint will also be saved as dd_model_latest.pth . Training will automatically resume from dd_model_latest.pth if it exists in train.results_dir . This will be superseded by train.resume_training_checkpoint_path if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either

Specify a new, empty results directory (Recommended) , or

Remove the latest checkpoint from the results directory

Training Deformable DETR requires strong GPUs (e.g. V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. In this section, we outline some of the strategies you can use to launch training with only limited resources.

There are various ways to optimize GPU memory usage. One obvious trick is to reduce dataset.batch_size . However, this can cause your training to take longer than usual. Hence, we recommend setting below configurations in order to optimize GPU consumption.

Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

Set train.distributed_strategy to fsdp to enable Fully Sharded Data Parallel training. This will share gradient calculation across different processes to help reduce GPU memory.

Try using more lightweight backbones like gc_vit_xxtiny or freeze the backbone through setting model.train_backbone to False.

Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.