Knowledge Distillation#

All Deep Neural Network tasks supported by TAO provide a train command to enable the users to train models. Training can be done on one or more GPUs.

Knowledge distillation is a model compression technique in which a smaller, lightweight student model is trained to replicate the behavior of a larger, high-performing teacher model. By transferring knowledge from the teacher to the student, this approach enables efficient deployment of models in resource-constrained environments without a significant loss in accuracy.

The student model learns not only from the ground truth labels but also from the soft targets: the output probabilities (logits) produced by the teacher. These soft targets capture the teacher’s learned representations and subtle inter-class relationships, which can help the student generalize better than if it were trained on labeled data alone.

In addition to output-based distillation (using logits), feature distillation is another common strategy, in which the student is encouraged to match intermediate feature representations from the teacher. This allows the student to learn richer internal representations, often leading to improved performance on complex tasks.

Knowledge distillation is commonly used in scenarios where fast inference, low memory usage, or deployment on edge devices is critical.

Tips and Best Practices#

When applying knowledge distillation in practice:

  • Given a downstream task, we recommend that you plug in the teacher backbone and fine-tune it on the downstream data first. If the model performs well with the teacher, use the fine-tuned teacher to distill a student model that fits your compute budget.

  • If the teacher is ViT-based and the student is ConvNet-based, the student may struggle to learn from the teacher. ViT-to-ViT or ConvNet-to-ConvNet/ViT distillation generally yields better results. In other words, if the student must be a ConvNet, it’s better to use a ConvNet teacher.

  • If the student is ViT-based, consider starting with RADIO models as teachers. For image or video classification tasks, CLIP models may be more effective. For instance-level recognition or segmentation, MAE, ConvNeXtV2, or DINOv2 are strong candidates.

  • Choose the student model architecture based on your target compute budget. Keep in mind that smaller student models often require more training data to optimize effectively.

  • If training data is limited, try increasing the number of training epochs and applying more aggressive data augmentations to improve generalization.

TAO now supports knowledge distillation for several networks:

  • Feature distillation for object detection with RT-DETR

  • Backbone logits distillation over structured and unstructured data for image classification

  • Logits distillation for object detection with DINO

As of 6.25.09, TAO introduces spatial feature distillation, and Phi-Standardization (PHI-S) in the distillation loss. PHI-S is a technique that standardizes the feature maps of the teacher model to improve the distillation performance.

TAO has also unified backbone implementation for classification_pyt and all the downstream tasks, allowing for distillation of the teacher backbones from downstream trained models to lighter student backbones supported by those tasks.

When choosing the student backbone to distill to, make sure the downstream task supports it. The following is an exhaustive list of options for distill.teacher.backbone.type:

Module

Supported backbones

classification_pyt

  • faster_vit_0_224

  • faster_vit_1_224

  • faster_vit_2_224

  • faster_vit_3_224

  • faster_vit_4_224

  • faster_vit_5_224

  • faster_vit_6_224

  • faster_vit_4_21k_224

  • faster_vit_4_21k_384

  • faster_vit_4_21k_512

  • faster_vit_4_21k_768

  • fan_tiny_12_p16_224

  • fan_small_12_p16_224_se_attn

  • fan_small_12_p16_224

  • fan_base_18_p16_224

  • fan_large_24_p16_224

  • fan_tiny_8_p4_hybrid

  • fan_small_12_p4_hybrid

  • fan_base_16_p4_hybrid

  • fan_large_16_p4_hybrid

  • fan_swin_tiny_patch4_window7_224

  • fan_swin_small_patch4_window7_224

  • fan_swin_base_patch4_window7_224

  • fan_swin_large_patch4_window7_224

  • vit_large_patch14_dinov2_swiglu

  • vit_large_patch14_dinov2_swiglu_legacy

  • vit_giant_patch14_reg4_dinov2_swiglu

  • vit_base_patch16

  • vit_large_patch16

  • vit_huge_patch14

  • efficientvit_b0

  • efficientvit_b1

  • efficientvit_b2

  • efficientvit_b3

  • efficientvit_l0

  • efficientvit_l1

  • efficientvit_l2

  • efficientvit_l3

  • convnext_tiny

  • convnext_small

  • convnext_base

  • convnext_large

  • convnext_xlarge

  • convnextv2_atto

  • convnextv2_femto

  • convnextv2_pico

  • convnextv2_nano

  • convnextv2_tiny

  • convnextv2_base

  • convnextv2_large

  • convnextv2_huge

  • hiera_tiny_224

  • hiera_small_224

  • hiera_base_224

  • hiera_base_plus_224

  • hiera_large_224

  • hiera_huge_224

  • resnet_18

  • resnet_34

  • resnet_50

  • resnet_101

  • resnet_152

  • resnet_18d

  • resnet_34d

  • resnet_50d

  • resnet_101d

  • resnet_152d

  • swin_tiny_patch4_window7_224

  • swin_small_patch4_window7_224

  • swin_base_patch4_window7_224

  • swin_large_patch4_window7_224

  • swin_base_patch4_window12_384

  • swin_large_patch4_window12_384

  • gc_vit_xxtiny

  • gc_vit_xtiny

  • gc_vit_tiny

  • gc_vit_small

  • gc_vit_base

  • gc_vit_large

  • gc_vit_base_384

  • gc_vit_large_384

  • edgenext_xx_small

  • edgenext_x_small

  • edgenext_small

  • edgenext_base

  • edgenext_xx_small_bn_hs

  • edgenext_x_small_bn_hs

  • edgenext_small_bn_hs

  • c_radio_p1_vit_huge_patch16_mlpnorm

  • c_radio_p2_vit_huge_patch16_mlpnorm

  • c_radio_p3_vit_huge_patch16_mlpnorm

  • c_radio_v2_vit_base_patch16

  • c_radio_v2_vit_large_patch16

  • c_radio_v2_vit_huge_patch16

  • c_radio_v3_vit_large_patch16_reg4_dinov2

  • c_radio_v3_vit_base_patch16_reg4_dinov2

  • c_radio_v3_vit_huge_patch16_reg4_dinov2

  • mit_b0

  • mit_b1

  • mit_b2

  • mit_b3

  • mit_b4

  • mit_b5

  • vit_l_14_siglip_clipa_224

  • vit_l_14_siglip_clipa_336

  • vit_h_14_siglip_clipa_224

dino

  • resnet_34

  • resnet_50

  • fan_tiny

  • fan_small

  • fan_base

  • fan_large

  • swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224)

  • swin_base_224_22k (alias: swin_base_patch4_window7_224)

  • swin_base_384_22k (alias: swin_base_patch4_window12_384)

  • swin_large_224_22k (alias: swin_large_patch4_window7_224)

  • swin_large_384_22k (alias: swin_large_patch4_window12_384)

  • efficientvit_b0

  • efficientvit_b1

  • efficientvit_b2

  • efficientvit_b3

  • vit_large_nvdinov2

  • vit_large_dinov2

mal

  • ViT family (arch strings with vit; patch sizes 8/14/16; sizes tiny/small/base/large/huge)

  • fan_tiny_12_p16_224

  • fan_small_12_p16_224

  • fan_base_18_p16_224

  • fan_large_24_p16_224

  • fan_tiny_8_p4_hybrid

  • fan_small_12_p4_hybrid

  • fan_base_16_p4_hybrid

  • fan_large_16_p4_hybrid

grounding_dino

  • resnet_50

  • swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224)

  • swin_base_224_22k (alias: swin_base_patch4_window7_224)

  • swin_base_384_22k (alias: swin_base_patch4_window12_384)

  • swin_large_224_22k (alias: swin_large_patch4_window7_224)

  • swin_large_384_22k (alias: swin_large_patch4_window12_384)

mask_grounding_dino

  • resnet_50

  • swin_tiny_224_1k (alias: swin_tiny_patch4_window7_224)

  • swin_base_224_22k (alias: swin_base_patch4_window7_224)

  • swin_base_384_22k (alias: swin_base_patch4_window12_384)

  • swin_large_224_22k (alias: swin_large_patch4_window7_224)

  • swin_large_384_22k (alias: swin_large_patch4_window12_384)

rtdetr

  • resnet_18

  • resnet_34

  • resnet_50

  • resnet_101

  • convnext_tiny

  • convnext_small

  • convnext_base

  • convnext_large

  • convnext_xlarge

  • convnextv2_atto

  • convnextv2_femto

  • convnextv2_pico

  • convnextv2_nano

  • convnextv2_tiny

  • convnextv2_small

  • convnextv2_base

  • convnextv2_large

  • convnextv2_huge

  • efficientvit_b0

  • efficientvit_b1

  • efficientvit_b2

  • efficientvit_b3

  • efficientvit_l0

  • efficientvit_l1

  • efficientvit_l2

  • efficientvit_l3

  • fan_tiny_8_p4_hybrid

  • fan_small_12_p4_hybrid

  • fan_base_12_p4_hybrid

  • fan_large_12_p4_hybrid

  • edgenext_x_small

  • edgenext_small

  • edgenext_base

  • edgenext_xx_small_bn_hs

  • edgenext_x_small_bn_hs

  • edgenext_small_bn_hs

segformer

  • fan_tiny_8_p4_hybrid

  • fan_small_12_p4_hybrid

  • fan_base_16_p4_hybrid

  • fan_large_16_p4_hybrid

  • mit_b0

  • mit_b1

  • mit_b2

  • mit_b3

  • mit_b4

  • mit_b5

  • vit_large_nvdinov2

  • vit_giant_nvdinov2

  • vit_base_nvclip_16_siglip

  • vit_huge_nvclip_14_sig

  • c_radio_v2_vit_huge_patch16_224

  • c_radio_v2_vit_large_patch16_224

  • c_radio_v2_vit_base_patch16_224

  • c_radio_v3_vit_large_patch16_reg4_dinov2

visual_changenet

  • fan_tiny_8_p4_hybrid

  • fan_small_12_p4_hybrid

  • fan_base_16_p4_hybrid

  • fan_large_16_p4_hybrid

  • vit_large_nvdinov2

  • vit_large_dinov2

  • c_radio_p1_vit_huge_patch16_224_mlpnorm

  • c_radio_p2_vit_huge_patch16_224_mlpnorm

  • c_radio_p3_vit_huge_patch16_224_mlpnorm

  • c_radio_v2_vit_huge_patch16_224

  • c_radio_v2_vit_large_patch16_224

  • c_radio_v2_vit_base_patch16_224

Note

When using a downstream model as the teacher, make sure to set num_classes to 0 and mode to spatial in the distill config.

For more information on distillation for the specific tasks, please refer to the documentation under the distillation section for that network.

  • rtdetr <distillation_spec_file_rtdetr>

  • classification_pyt <distill_the_classification_pyt_model>

  • dino <dino>

Multi-Teacher Distillation (RADIO)#

Note

The RADIO multi-teacher distillation pipeline is experimental in TAO 7.0.1. Config field names and defaults may change in future releases.

RADIO (Reduce All Domains Into One) is an agglomerative distillation pipeline: a single student backbone is trained to match, simultaneously, the representations of several foundation-model teachers (for example, DINOv2/DINOv3, CLIP/SigLIP, and SAM). The resulting student inherits the complementary strengths of each teacher in one set of weights, which can then be used as a unified backbone for downstream tasks.

The pipeline is exposed through the radio console command, with the single subtask distill:

radio distill -e /path/to/distill.yaml results_dir=/path/to/output

Unlike the single-teacher distillation in the other tasks, the distill.teacher field accepts a list of teacher configurations. Each teacher can declare its own loss, mode, input resolution, and image normalization, while the top-level distill fields provide the global defaults that are used whenever a per-teacher field is left unset.

Data Pipeline and the Teacher Mixture#

RADIO trains over large, unlabeled image collections supplied as WebDataset-style tar shards. Each entry in dataset.train_dataset.tar_data_sources is a separate data source with its own root_dir, samples_per_file, steps_per_epoch, and scale_factor. The scale_factor values act as unnormalized sampling weights: a deterministic deficit-round-robin scheduler (the mixture emitter) draws from each source so that, within any short window, the realized proportions closely track the requested weights. This lets you blend heterogeneous corpora at fixed ratios without manually interleaving the shards.

Preparing the tar shards#

Each root_dir is scanned recursively for *.tar files (the glob {root_dir}/**/*.tar), and every tar file is a standard WebDataset shard. Within a shard, files are grouped into samples by their basename (the part of the filename before the first dot); all files that share a basename form one sample, and the basename becomes the sample’s __key__. A minimal sample is a single image file. The recognized image extensions are jpg, jpeg, png, gif, webp, bmp, tiff, and img; samples that contain none of these are skipped. RADIO is self-supervised, so no label or class files are required in the shards.

A typical shard listing therefore looks like:

shard-000000.tar
  ├── 0000000.jpg
  ├── 0000001.jpg
  ├── 0000002.jpg
  └── ...
shard-000001.tar
  └── ...

An optional per-sample .json sidecar (for example 0000000.json) is read for deduplication and captioning metadata only: a sha256, md5, or uid field is used as the sample’s de-duplication key (otherwise a hash of the image bytes is used), and a caption field, if present, is exposed as text. These fields are optional and do not affect distillation targets.

Shards can be built with the webdataset Python package or with the tar command, as long as files belonging to one sample are written consecutively and share a basename. For example, to pack a directory of JPEGs into 10,000-image shards with the WebDataset writer:

import glob, webdataset as wds

paths = sorted(glob.glob("/raw/images/**/*.jpg", recursive=True))
with wds.ShardWriter("/datasets/corpus_a/shard-%06d.tar", maxcount=10000) as sink:
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            sink.write({"__key__": f"{i:07d}", "jpg": f.read()})

Set samples_per_file to the number of samples written per shard (10000 above) and steps_per_epoch to the number of training steps to draw from that source per epoch. steps_per_epoch defines the epoch length (it does not have to equal the true sample count); together with the global batch size it controls how many samples a source contributes each epoch, independent of scale_factor, which controls the relative mixing ratio between sources.

An optional native_resolution_filter drops samples whose decoded native resolution is too small or too extreme before resize and crop, so that high-resolution crop training does not upsample low-resolution originals. The filter exposes four independent thresholds, each applied only when set: min_short_side, min_long_side, and min_area reject images that are too small, while max_aspect_ratio rejects images that are too elongated (long side divided by short side). There are no maximum short-side, long-side, or area limits and no minimum aspect-ratio limit.

Key Configuration Fields#

The student backbone is configured under model.backbone and the unlabeled data under dataset, exactly as in the other distillation tasks. The multi-teacher behavior is controlled by the distill section:

Field

Description

distill.teacher

A list of teacher configurations. Each entry distills into the student in parallel.

distill.loss_type

Global distillation loss used when a teacher does not set its own. One of KL, CE, L1, L2, FD, CS, BALANCED, MSE.

distill.loss_lambda

Global weight applied to the distillation loss (> 0.0 and <= 1.0).

distill.mode

Global distillation mode: logits, summary, spatial, auto, or combo.

distill.use_mlp, distill.mlp_hidden_size, distill.mlp_num_inner

Configure the MLP projection head that maps student features into each teacher’s space.

distill.vitdet

Optional ViTDet windowed-attention augmentation applied to the student during training. Set prob > 0 and provide window_sizes to enable; acts as a regularizer and can reduce training memory.

Each entry in distill.teacher accepts the following per-teacher fields (any field left unset falls back to the corresponding global distill value):

Per-teacher field

Description

model.backbone.type

Teacher backbone architecture (for example a c_radio_*, DINOv2, or CLIP/SigLIP variant).

pretrained_teacher_model_path

Path to a local checkpoint holding the teacher’s pre-trained weights. This is a filesystem path inside the container, not a Hugging Face model id or URL; the file is loaded with torch.load (or read directly if it ends in .safetensors). See the note below on obtaining these weights.

loss_type / loss_lambda

Per-teacher overrides for this teacher’s distillation loss and its weight.

mode

Per-teacher distillation mode (logits, summary, spatial, auto, combo).

input_size / patch_size

Teacher input resolution and ViT patch size.

match_student_resolution

If true, resize teacher inputs to match the student resolution.

stochastic_resolutions

Map of resolution: probability for per-sample stochastic input resizing.

norm_mean / norm_std

Per-teacher image normalization (for example ImageNet/DINOv3 vs. SigLIP/SAM values). If empty, the dataset augmentation mean/std are used.

summary_loss_weight / summary_loss_type / fd_loss_weight

In combo mode, weight the summary (CLS) loss and the spatial/feature loss separately.

spatial_mlp_version / spatial_num_inner

Spatial projection head type (v2 or attn) and optional inner-block override.

adaptor / upsampler_checkpoint / do_upsample

Optionally wrap the teacher with a pre-trained FeatSharp upsampler (adaptor: featsharp) to produce high-resolution spatial targets.

Note

pretrained_teacher_model_path (and upsampler_checkpoint) must point at local checkpoint files that you have downloaded ahead of time and mounted into the container; the pipeline does not fetch weights from a model hub at runtime. Obtain the teacher weights from the foundation model’s own distribution (for example a Hugging Face repository for DINOv2/DINOv3 or SigLIP, or NGC for the C-RADIO backbones), convert them to a PyTorch state_dict checkpoint (a .pth/.pt file, or a .safetensors file), and reference that local file. The example paths below (for example /weights/teacher_dinov2.pth) are placeholders for such pre-downloaded checkpoints — substitute the paths to your own files.

Example Invocation#

The following spec distills a single student backbone from two teachers, blending two tar data sources at a 2:1 ratio:

results_dir: /workspace/radio_multiteacher
model:
  backbone:
    type: c_radio_v3_vit_base_patch16_reg4_dinov2
dataset:
  num_classes: 0
  img_size: 224
  batch_size: 32
  workers: 8
  train_dataset:
    tar_data_sources:
      - root_dir: /datasets/corpus_a
        samples_per_file: 10000
        steps_per_epoch: 10000
        scale_factor: 2.0
      - root_dir: /datasets/corpus_b
        samples_per_file: 10000
        steps_per_epoch: 10000
        scale_factor: 1.0
    data_weight_mode: inv_frequency
    native_resolution_filter:
      min_short_side: 224
train:
  num_gpus: 1
  num_epochs: 50
  precision: bf16
  optim:
    optim: adamw
    lr: 0.00006
    policy: cosine
distill:
  loss_type: BALANCED
  loss_lambda: 0.5
  mode: auto
  teacher:
    - model:
        backbone:
          type: c_radio_v3_vit_huge_patch16_reg4_dinov2
      pretrained_teacher_model_path: /weights/teacher_dinov2.pth
      mode: spatial
      input_size: 432
      norm_mean: [0.485, 0.456, 0.406]
      norm_std: [0.229, 0.224, 0.225]
    - model:
        backbone:
          type: vit_h_14_siglip_clipa_224
      pretrained_teacher_model_path: /weights/teacher_siglip.pth
      mode: summary
      norm_mean: [0.5, 0.5, 0.5]
      norm_std: [0.5, 0.5, 0.5]

Run the distillation with:

radio distill -e /workspace/specs/distill.yaml results_dir=/workspace/radio_multiteacher