Knowledge Distillation#

All Deep Neural Network tasks supported by TAO provide a train command to enable the users to train models. Training can be done on one or more GPUs.

Knowledge distillation is a model compression technique in which a smaller, lightweight student model is trained to replicate the behavior of a larger, high-performing teacher model. By transferring knowledge from the teacher to the student, this approach enables efficient deployment of models in resource-constrained environments without a significant loss in accuracy.

The student model learns not only from the ground truth labels but also from the soft targets: the output probabilities (logits) produced by the teacher. These soft targets capture the teacher’s learned representations and subtle inter-class relationships, which can help the student generalize better than if it were trained on labeled data alone.

In addition to output-based distillation (using logits), feature distillation is another common strategy, in which the student is encouraged to match intermediate feature representations from the teacher. This allows the student to learn richer internal representations, often leading to improved performance on complex tasks.

Knowledge distillation is commonly used in scenarios where fast inference, low memory usage, or deployment on edge devices is critical.

Tips and Best Practices#

When applying knowledge distillation in practice:

Given a downstream task, we recommend that you plug in the teacher backbone and fine-tune it on the downstream data first. If the model performs well with the teacher, use the fine-tuned teacher to distill a student model that fits your compute budget.
If the teacher is ViT-based and the student is ConvNet-based, the student may struggle to learn from the teacher. ViT-to-ViT or ConvNet-to-ConvNet/ViT distillation generally yields better results. In other words, if the student must be a ConvNet, it’s better to use a ConvNet teacher.
If the student is ViT-based, consider starting with RADIO models as teachers. For image or video classification tasks, CLIP models may be more effective. For instance-level recognition or segmentation, MAE, ConvNeXtV2, or DINOv2 are strong candidates.
Choose the student model architecture based on your target compute budget. Keep in mind that smaller student models often require more training data to optimize effectively.
If training data is limited, try increasing the number of training epochs and applying more aggressive data augmentations to improve generalization.

TAO now supports knowledge distillation for several networks:

Feature distillation for object detection with RT-DETR
Backbone logits distillation over structured and unstructured data for image classification
Logits distillation for object detection with DINO

As of 6.25.09, TAO introduces spatial feature distillation, and Phi-Standardization (PHI-S) in the distillation loss. PHI-S is a technique that standardizes the feature maps of the teacher model to improve the distillation performance.

TAO has also unified backbone implementation for classification_pyt and all the downstream tasks, allowing for distillation of the teacher backbones from downstream trained models to lighter student backbones supported by those tasks.

When choosing the student backbone to distill to, make sure the downstream task supports it. The following is an exhaustive list of options for distill.teacher.backbone.type:

Module	Supported backbones
classification_pyt	faster_vit_0_224 faster_vit_1_224 faster_vit_2_224 faster_vit_3_224 faster_vit_4_224 faster_vit_5_224 faster_vit_6_224 faster_vit_4_21k_224 faster_vit_4_21k_384 faster_vit_4_21k_512 faster_vit_4_21k_768 fan_tiny_12_p16_224 fan_small_12_p16_224_se_attn fan_small_12_p16_224 fan_base_18_p16_224 fan_large_24_p16_224 fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid fan_large_16_p4_hybrid fan_swin_tiny_patch4_window7_224 fan_swin_small_patch4_window7_224 fan_swin_base_patch4_window7_224 fan_swin_large_patch4_window7_224 vit_large_patch14_dinov2_swiglu vit_large_patch14_dinov2_swiglu_legacy vit_giant_patch14_reg4_dinov2_swiglu vit_base_patch16 vit_large_patch16 vit_huge_patch14 efficientvit_b0 efficientvit_b1 efficientvit_b2 efficientvit_b3 efficientvit_l0 efficientvit_l1 efficientvit_l2 efficientvit_l3 convnext_tiny convnext_small convnext_base convnext_large convnext_xlarge convnextv2_atto convnextv2_femto convnextv2_pico convnextv2_nano convnextv2_tiny convnextv2_base convnextv2_large convnextv2_huge hiera_tiny_224 hiera_small_224 hiera_base_224 hiera_base_plus_224 hiera_large_224 hiera_huge_224 resnet_18 resnet_34 resnet_50 resnet_101 resnet_152 resnet_18d resnet_34d resnet_50d resnet_101d resnet_152d swin_tiny_patch4_window7_224 swin_small_patch4_window7_224 swin_base_patch4_window7_224 swin_large_patch4_window7_224 swin_base_patch4_window12_384 swin_large_patch4_window12_384 gc_vit_xxtiny gc_vit_xtiny gc_vit_tiny gc_vit_small gc_vit_base gc_vit_large gc_vit_base_384 gc_vit_large_384 edgenext_xx_small edgenext_x_small edgenext_small edgenext_base edgenext_xx_small_bn_hs edgenext_x_small_bn_hs edgenext_small_bn_hs c_radio_p1_vit_huge_patch16_mlpnorm c_radio_p2_vit_huge_patch16_mlpnorm c_radio_p3_vit_huge_patch16_mlpnorm c_radio_v2_vit_base_patch16 c_radio_v2_vit_large_patch16 c_radio_v2_vit_huge_patch16 c_radio_v3_vit_large_patch16_reg4_dinov2 c_radio_v3_vit_base_patch16_reg4_dinov2 c_radio_v3_vit_huge_patch16_reg4_dinov2 mit_b0 mit_b1 mit_b2 mit_b3 mit_b4 mit_b5 vit_l_14_siglip_clipa_224 vit_l_14_siglip_clipa_336 vit_h_14_siglip_clipa_224
dino	resnet_34 resnet_50 fan_tiny fan_small fan_base fan_large swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224) swin_base_224_22k (alias: swin_base_patch4_window7_224) swin_base_384_22k (alias: swin_base_patch4_window12_384) swin_large_224_22k (alias: swin_large_patch4_window7_224) swin_large_384_22k (alias: swin_large_patch4_window12_384) efficientvit_b0 efficientvit_b1 efficientvit_b2 efficientvit_b3 vit_large_nvdinov2 vit_large_dinov2
mal	ViT family (arch strings with `vit`; patch sizes 8/14/16; sizes tiny/small/base/large/huge) fan_tiny_12_p16_224 fan_small_12_p16_224 fan_base_18_p16_224 fan_large_24_p16_224 fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid fan_large_16_p4_hybrid
grounding_dino	resnet_50 swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224) swin_base_224_22k (alias: swin_base_patch4_window7_224) swin_base_384_22k (alias: swin_base_patch4_window12_384) swin_large_224_22k (alias: swin_large_patch4_window7_224) swin_large_384_22k (alias: swin_large_patch4_window12_384)
mask_grounding_dino	resnet_50 swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224) swin_base_224_22k (alias: swin_base_patch4_window7_224) swin_base_384_22k (alias: swin_base_patch4_window12_384) swin_large_224_22k (alias: swin_large_patch4_window7_224) swin_large_384_22k (alias: swin_large_patch4_window12_384)
rtdetr	resnet_18 resnet_34 resnet_50 resnet_101 convnext_tiny convnext_small convnext_base convnext_large convnext_xlarge convnextv2_atto convnextv2_femto convnextv2_pico convnextv2_nano convnextv2_tiny convnextv2_small convnextv2_base convnextv2_large convnextv2_huge efficientvit_b0 efficientvit_b1 efficientvit_b2 efficientvit_b3 efficientvit_l0 efficientvit_l1 efficientvit_l2 efficientvit_l3 fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_12_p4_hybrid fan_large_12_p4_hybrid edgenext_x_small edgenext_small edgenext_base edgenext_xx_small_bn_hs edgenext_x_small_bn_hs edgenext_small_bn_hs
segformer	fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid fan_large_16_p4_hybrid mit_b0 mit_b1 mit_b2 mit_b3 mit_b4 mit_b5 vit_large_nvdinov2 vit_giant_nvdinov2 vit_base_nvclip_16_siglip vit_huge_nvclip_14_sig c_radio_v2_vit_huge_patch16_224 c_radio_v2_vit_large_patch16_224 c_radio_v2_vit_base_patch16_224 c_radio_v3_vit_large_patch16_reg4_dinov2
visual_changenet	fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid fan_large_16_p4_hybrid vit_large_nvdinov2 vit_large_dinov2 c_radio_p1_vit_huge_patch16_224_mlpnorm c_radio_p2_vit_huge_patch16_224_mlpnorm c_radio_p3_vit_huge_patch16_224_mlpnorm c_radio_v2_vit_huge_patch16_224 c_radio_v2_vit_large_patch16_224 c_radio_v2_vit_base_patch16_224

Note

When using a downstream model as the teacher, make sure to set num_classes to 0 and mode to spatial in the distill config.

For more information on distillation for the specific tasks, please refer to the documentation under the distillation section for that network.

rtdetr <distillation_spec_file_rtdetr>
classification_pyt <distill_the_classification_pyt_model>
dino <dino>

Multi-Teacher Distillation (RADIO)#

Note

The RADIO multi-teacher distillation pipeline is experimental in TAO 7.0.1. Config field names and defaults may change in future releases.

RADIO (Reduce All Domains Into One) is an agglomerative distillation pipeline: a single student backbone is trained to match, simultaneously, the representations of several foundation-model teachers (for example, DINOv2/DINOv3, CLIP/SigLIP, and SAM). The resulting student inherits the complementary strengths of each teacher in one set of weights, which can then be used as a unified backbone for downstream tasks.

The pipeline is exposed through the radio console command, with the single subtask distill:

radio distill -e /path/to/distill.yaml results_dir=/path/to/output

Unlike the single-teacher distillation in the other tasks, the distill.teacher field accepts a list of teacher configurations. Each teacher can declare its own loss, mode, input resolution, and image normalization, while the top-level distill fields provide the global defaults that are used whenever a per-teacher field is left unset.

Data Pipeline and the Teacher Mixture#

RADIO trains over large, unlabeled image collections supplied as WebDataset-style tar shards. Each entry in dataset.train_dataset.tar_data_sources is a separate data source with its own root_dir, samples_per_file, steps_per_epoch, and scale_factor. The scale_factor values act as unnormalized sampling weights: a deterministic deficit-round-robin scheduler (the mixture emitter) draws from each source so that, within any short window, the realized proportions closely track the requested weights. This lets you blend heterogeneous corpora at fixed ratios without manually interleaving the shards.

Preparing the tar shards#

Each root_dir is scanned recursively for *.tar files (the glob {root_dir}/**/*.tar), and every tar file is a standard WebDataset shard. Within a shard, files are grouped into samples by their basename (the part of the filename before the first dot); all files that share a basename form one sample, and the basename becomes the sample’s __key__. A minimal sample is a single image file. The recognized image extensions are jpg, jpeg, png, gif, webp, bmp, tiff, and img; samples that contain none of these are skipped. RADIO is self-supervised, so no label or class files are required in the shards.

A typical shard listing therefore looks like:

shard-000000.tar
  ├── 0000000.jpg
  ├── 0000001.jpg
  ├── 0000002.jpg
  └── ...
shard-000001.tar
  └── ...

An optional per-sample .json sidecar (for example 0000000.json) is read for deduplication and captioning metadata only: a sha256, md5, or uid field is used as the sample’s de-duplication key (otherwise a hash of the image bytes is used), and a caption field, if present, is exposed as text. These fields are optional and do not affect distillation targets.

Shards can be built with the webdataset Python package or with the tar command, as long as files belonging to one sample are written consecutively and share a basename. For example, to pack a directory of JPEGs into 10,000-image shards with the WebDataset writer:

import glob, webdataset as wds

paths = sorted(glob.glob("/raw/images/**/*.jpg", recursive=True))
with wds.ShardWriter("/datasets/corpus_a/shard-%06d.tar", maxcount=10000) as sink:
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            sink.write({"__key__": f"{i:07d}", "jpg": f.read()})

Set samples_per_file to the number of samples written per shard (10000 above) and steps_per_epoch to the number of training steps to draw from that source per epoch. steps_per_epoch defines the epoch length (it does not have to equal the true sample count); together with the global batch size it controls how many samples a source contributes each epoch, independent of scale_factor, which controls the relative mixing ratio between sources.

An optional native_resolution_filter drops samples whose decoded native resolution is too small or too extreme before resize and crop, so that high-resolution crop training does not upsample low-resolution originals. The filter exposes four independent thresholds, each applied only when set: min_short_side, min_long_side, and min_area reject images that are too small, while max_aspect_ratio rejects images that are too elongated (long side divided by short side). There are no maximum short-side, long-side, or area limits and no minimum aspect-ratio limit.

Key Configuration Fields#

The student backbone is configured under model.backbone and the unlabeled data under dataset, exactly as in the other distillation tasks. The multi-teacher behavior is controlled by the distill section:

Field	Description
`distill.teacher`	A list of teacher configurations. Each entry distills into the student in parallel.
`distill.loss_type`	Global distillation loss used when a teacher does not set its own. One of `KL`, `CE`, `L1`, `L2`, `FD`, `CS`, `BALANCED`, `MSE`.
`distill.loss_lambda`	Global weight applied to the distillation loss (`> 0.0` and `<= 1.0`).
`distill.mode`	Global distillation mode: `logits`, `summary`, `spatial`, `auto`, or `combo`.
`distill.use_mlp`, `distill.mlp_hidden_size`, `distill.mlp_num_inner`	Configure the MLP projection head that maps student features into each teacher’s space.
`distill.vitdet`	Optional ViTDet windowed-attention augmentation applied to the student during training. Set `prob > 0` and provide `window_sizes` to enable; acts as a regularizer and can reduce training memory.

Each entry in distill.teacher accepts the following per-teacher fields (any field left unset falls back to the corresponding global distill value):

Per-teacher field	Description
`model.backbone.type`	Teacher backbone architecture (for example a `c_radio_*`, DINOv2, or CLIP/SigLIP variant).
`pretrained_teacher_model_path`	Path to a local checkpoint holding the teacher’s pre-trained weights. This is a filesystem path inside the container, not a Hugging Face model id or URL; the file is loaded with `torch.load` (or read directly if it ends in `.safetensors`). See the note below on obtaining these weights.
`loss_type` / `loss_lambda`	Per-teacher overrides for this teacher’s distillation loss and its weight.
`mode`	Per-teacher distillation mode (`logits`, `summary`, `spatial`, `auto`, `combo`).
`input_size` / `patch_size`	Teacher input resolution and ViT patch size.
`match_student_resolution`	If true, resize teacher inputs to match the student resolution.
`stochastic_resolutions`	Map of `resolution: probability` for per-sample stochastic input resizing.
`norm_mean` / `norm_std`	Per-teacher image normalization (for example ImageNet/DINOv3 vs. SigLIP/SAM values). If empty, the dataset augmentation mean/std are used.
`summary_loss_weight` / `summary_loss_type` / `fd_loss_weight`	In `combo` mode, weight the summary (CLS) loss and the spatial/feature loss separately.
`spatial_mlp_version` / `spatial_num_inner`	Spatial projection head type (`v2` or `attn`) and optional inner-block override.
`adaptor` / `upsampler_checkpoint` / `do_upsample`	Optionally wrap the teacher with a pre-trained FeatSharp upsampler (`adaptor: featsharp`) to produce high-resolution spatial targets.

Note

pretrained_teacher_model_path (and upsampler_checkpoint) must point at local checkpoint files that you have downloaded ahead of time and mounted into the container; the pipeline does not fetch weights from a model hub at runtime. Obtain the teacher weights from the foundation model’s own distribution (for example a Hugging Face repository for DINOv2/DINOv3 or SigLIP, or NGC for the C-RADIO backbones), convert them to a PyTorch state_dict checkpoint (a .pth/.pt file, or a .safetensors file), and reference that local file. The example paths below (for example /weights/teacher_dinov2.pth) are placeholders for such pre-downloaded checkpoints — substitute the paths to your own files.

Example Invocation#

The following spec distills a single student backbone from two teachers, blending two tar data sources at a 2:1 ratio:

results_dir: /workspace/radio_multiteacher
model:
  backbone:
    type: c_radio_v3_vit_base_patch16_reg4_dinov2
dataset:
  num_classes: 0
  img_size: 224
  batch_size: 32
  workers: 8
  train_dataset:
    tar_data_sources:
      - root_dir: /datasets/corpus_a
        samples_per_file: 10000
        steps_per_epoch: 10000
        scale_factor: 2.0
      - root_dir: /datasets/corpus_b
        samples_per_file: 10000
        steps_per_epoch: 10000
        scale_factor: 1.0
    data_weight_mode: inv_frequency
    native_resolution_filter:
      min_short_side: 224
train:
  num_gpus: 1
  num_epochs: 50
  precision: bf16
  optim:
    optim: adamw
    lr: 0.00006
    policy: cosine
distill:
  loss_type: BALANCED
  loss_lambda: 0.5
  mode: auto
  teacher:
    - model:
        backbone:
          type: c_radio_v3_vit_huge_patch16_reg4_dinov2
      pretrained_teacher_model_path: /weights/teacher_dinov2.pth
      mode: spatial
      input_size: 432
      norm_mean: [0.485, 0.456, 0.406]
      norm_std: [0.229, 0.224, 0.225]
    - model:
        backbone:
          type: vit_h_14_siglip_clipa_224
      pretrained_teacher_model_path: /weights/teacher_siglip.pth
      mode: summary
      norm_mean: [0.5, 0.5, 0.5]
      norm_std: [0.5, 0.5, 0.5]

Run the distillation with:

radio distill -e /workspace/specs/distill.yaml results_dir=/workspace/radio_multiteacher