Knowledge Distillation#
All Deep Neural Network tasks supported by TAO provide a train command
to enable the users to train models. Training can be done on one or more GPUs.
Knowledge distillation is a model compression technique in which a smaller, lightweight student model is trained to replicate the behavior of a larger, high-performing teacher model. By transferring knowledge from the teacher to the student, this approach enables efficient deployment of models in resource-constrained environments without a significant loss in accuracy.
The student model learns not only from the ground truth labels but also from the soft targets: the output probabilities (logits) produced by the teacher. These soft targets capture the teacher’s learned representations and subtle inter-class relationships, which can help the student generalize better than if it were trained on labeled data alone.
In addition to output-based distillation (using logits), feature distillation is another common strategy, in which the student is encouraged to match intermediate feature representations from the teacher. This allows the student to learn richer internal representations, often leading to improved performance on complex tasks.
Knowledge distillation is commonly used in scenarios where fast inference, low memory usage, or deployment on edge devices is critical.
Tips and Best Practices#
When applying knowledge distillation in practice:
Given a downstream task, we recommend that you plug in the teacher backbone and fine-tune it on the downstream data first. If the model performs well with the teacher, use the fine-tuned teacher to distill a student model that fits your compute budget.
If the teacher is ViT-based and the student is ConvNet-based, the student may struggle to learn from the teacher. ViT-to-ViT or ConvNet-to-ConvNet/ViT distillation generally yields better results. In other words, if the student must be a ConvNet, it’s better to use a ConvNet teacher.
If the student is ViT-based, consider starting with RADIO models as teachers. For image or video classification tasks, CLIP models may be more effective. For instance-level recognition or segmentation, MAE, ConvNeXtV2, or DINOv2 are strong candidates.
Choose the student model architecture based on your target compute budget. Keep in mind that smaller student models often require more training data to optimize effectively.
If training data is limited, try increasing the number of training epochs and applying more aggressive data augmentations to improve generalization.
TAO now supports knowledge distillation for several networks:
Feature distillation for object detection with RT-DETR
Backbone logits distillation over structured and unstructured data for image classification
Logits distillation for object detection with DINO
As of 6.25.09, TAO introduces spatial feature distillation, and Phi-Standardization (PHI-S) in the distillation loss. PHI-S is a technique that standardizes the feature maps of the teacher model to improve the distillation performance.
TAO has also unified backbone implementation for classification_pyt and all the downstream tasks, allowing for
distillation of the teacher backbones from downstream trained models to lighter student backbones supported by those tasks.
When choosing the student backbone to distill to, make sure the downstream task supports it.
The following is an exhaustive list of options for distill.teacher.backbone.type:
Module |
Supported backbones |
|---|---|
classification_pyt |
|
dino |
|
mal |
|
grounding_dino |
|
mask_grounding_dino |
|
rtdetr |
|
segformer |
|
visual_changenet |
|
Note
When using a downstream model as the teacher, make sure to set num_classes to 0 and mode to
spatial in the distill config.
For more information on distillation for the specific tasks, please refer to the documentation under the
distillation section for that network.
rtdetr <distillation_spec_file_rtdetr>classification_pyt <distill_the_classification_pyt_model>dino <dino>
Multi-Teacher Distillation (RADIO)#
Note
The RADIO multi-teacher distillation pipeline is experimental in TAO 7.0.1. Config field names and defaults may change in future releases.
RADIO (Reduce All Domains Into One) is an agglomerative distillation pipeline: a single student backbone is trained to match, simultaneously, the representations of several foundation-model teachers (for example, DINOv2/DINOv3, CLIP/SigLIP, and SAM). The resulting student inherits the complementary strengths of each teacher in one set of weights, which can then be used as a unified backbone for downstream tasks.
The pipeline is exposed through the radio console command, with the single subtask
distill:
radio distill -e /path/to/distill.yaml results_dir=/path/to/output
Unlike the single-teacher distillation in the other tasks, the distill.teacher field
accepts a list of teacher configurations. Each teacher can declare its own loss, mode,
input resolution, and image normalization, while the top-level distill fields provide
the global defaults that are used whenever a per-teacher field is left unset.
Data Pipeline and the Teacher Mixture#
RADIO trains over large, unlabeled image collections supplied as WebDataset-style tar shards.
Each entry in dataset.train_dataset.tar_data_sources is a separate data source with its
own root_dir, samples_per_file, steps_per_epoch, and scale_factor.
The scale_factor values act as unnormalized sampling weights: a deterministic
deficit-round-robin scheduler (the mixture emitter) draws from each source so that, within any
short window, the realized proportions closely track the requested weights. This lets you blend
heterogeneous corpora at fixed ratios without manually interleaving the shards.
Preparing the tar shards#
Each root_dir is scanned recursively for *.tar files (the glob
{root_dir}/**/*.tar), and every tar file is a standard WebDataset shard. Within a shard,
files are grouped into samples by their basename (the part of the filename before the first
dot); all files that share a basename form one sample, and the basename becomes the sample’s
__key__. A minimal sample is a single image file. The recognized image extensions are
jpg, jpeg, png, gif, webp, bmp, tiff, and
img; samples that contain none of these are skipped. RADIO is self-supervised, so no
label or class files are required in the shards.
A typical shard listing therefore looks like:
shard-000000.tar
├── 0000000.jpg
├── 0000001.jpg
├── 0000002.jpg
└── ...
shard-000001.tar
└── ...
An optional per-sample .json sidecar (for example 0000000.json) is read for
deduplication and captioning metadata only: a sha256, md5, or uid field
is used as the sample’s de-duplication key (otherwise a hash of the image bytes is used), and a
caption field, if present, is exposed as text. These fields are optional and do not affect
distillation targets.
Shards can be built with the webdataset Python package or with the tar command,
as long as files belonging to one sample are written consecutively and share a basename. For
example, to pack a directory of JPEGs into 10,000-image shards with the WebDataset writer:
import glob, webdataset as wds
paths = sorted(glob.glob("/raw/images/**/*.jpg", recursive=True))
with wds.ShardWriter("/datasets/corpus_a/shard-%06d.tar", maxcount=10000) as sink:
for i, path in enumerate(paths):
with open(path, "rb") as f:
sink.write({"__key__": f"{i:07d}", "jpg": f.read()})
Set samples_per_file to the number of samples written per shard (10000 above) and
steps_per_epoch to the number of training steps to draw from that source per epoch.
steps_per_epoch defines the epoch length (it does not have to equal the true sample count);
together with the global batch size it controls how many samples a source contributes each epoch,
independent of scale_factor, which controls the relative mixing ratio between sources.
An optional native_resolution_filter drops samples whose decoded native resolution is too
small or too extreme before resize and crop, so that high-resolution crop training does not
upsample low-resolution originals. The filter exposes four independent thresholds, each applied
only when set: min_short_side, min_long_side, and min_area reject images
that are too small, while max_aspect_ratio rejects images that are too elongated
(long side divided by short side). There are no maximum short-side, long-side, or area limits and
no minimum aspect-ratio limit.
Key Configuration Fields#
The student backbone is configured under model.backbone and the unlabeled data under
dataset, exactly as in the other distillation tasks. The multi-teacher behavior is
controlled by the distill section:
Field |
Description |
|---|---|
|
A list of teacher configurations. Each entry distills into the student in parallel. |
|
Global distillation loss used when a teacher does not set its own. One of
|
|
Global weight applied to the distillation loss ( |
|
Global distillation mode: |
|
Configure the MLP projection head that maps student features into each teacher’s space. |
|
Optional ViTDet windowed-attention augmentation applied to the student during training.
Set |
Each entry in distill.teacher accepts the following per-teacher fields (any field left
unset falls back to the corresponding global distill value):
Per-teacher field |
Description |
|---|---|
|
Teacher backbone architecture (for example a |
|
Path to a local checkpoint holding the teacher’s pre-trained weights. This is a
filesystem path inside the container, not a Hugging Face model id or URL; the file is loaded
with |
|
Per-teacher overrides for this teacher’s distillation loss and its weight. |
|
Per-teacher distillation mode ( |
|
Teacher input resolution and ViT patch size. |
|
If true, resize teacher inputs to match the student resolution. |
|
Map of |
|
Per-teacher image normalization (for example ImageNet/DINOv3 vs. SigLIP/SAM values). If empty, the dataset augmentation mean/std are used. |
|
In |
|
Spatial projection head type ( |
|
Optionally wrap the teacher with a pre-trained FeatSharp upsampler ( |
Note
pretrained_teacher_model_path (and upsampler_checkpoint) must point at local
checkpoint files that you have downloaded ahead of time and mounted into the container; the
pipeline does not fetch weights from a model hub at runtime. Obtain the teacher weights from
the foundation model’s own distribution (for example a Hugging Face repository for DINOv2/DINOv3
or SigLIP, or NGC for the C-RADIO backbones), convert them to a PyTorch state_dict
checkpoint (a .pth/.pt file, or a .safetensors file), and reference that
local file. The example paths below (for example /weights/teacher_dinov2.pth) are
placeholders for such pre-downloaded checkpoints — substitute the paths to your own files.
Example Invocation#
The following spec distills a single student backbone from two teachers, blending two tar data sources at a 2:1 ratio:
results_dir: /workspace/radio_multiteacher
model:
backbone:
type: c_radio_v3_vit_base_patch16_reg4_dinov2
dataset:
num_classes: 0
img_size: 224
batch_size: 32
workers: 8
train_dataset:
tar_data_sources:
- root_dir: /datasets/corpus_a
samples_per_file: 10000
steps_per_epoch: 10000
scale_factor: 2.0
- root_dir: /datasets/corpus_b
samples_per_file: 10000
steps_per_epoch: 10000
scale_factor: 1.0
data_weight_mode: inv_frequency
native_resolution_filter:
min_short_side: 224
train:
num_gpus: 1
num_epochs: 50
precision: bf16
optim:
optim: adamw
lr: 0.00006
policy: cosine
distill:
loss_type: BALANCED
loss_lambda: 0.5
mode: auto
teacher:
- model:
backbone:
type: c_radio_v3_vit_huge_patch16_reg4_dinov2
pretrained_teacher_model_path: /weights/teacher_dinov2.pth
mode: spatial
input_size: 432
norm_mean: [0.485, 0.456, 0.406]
norm_std: [0.229, 0.224, 0.225]
- model:
backbone:
type: vit_h_14_siglip_clipa_224
pretrained_teacher_model_path: /weights/teacher_siglip.pth
mode: summary
norm_mean: [0.5, 0.5, 0.5]
norm_std: [0.5, 0.5, 0.5]
Run the distillation with:
radio distill -e /workspace/specs/distill.yaml results_dir=/workspace/radio_multiteacher