Foundation Models#
Foundation models are large-scale pretrained models that serve as the backbone for a wide range of computer vision and multi-modal applications. These models are often trained using some form of self-supervised of semi-supervised training algorithms, over large-scale datasets. The main goal of foundational models, is to serve as a starter model that can be adapted to a variety of downstream tasks.
Early examples of foundation models were pretrained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-Trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.
Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.
TAO v6.0.0 introduces the ability to:
Pretrain or adapt them to your domain, given a corpus of unstructured data or
fine-tune and use them for downstream computer vision tasks, like:
Image classification
Object detection
Semantic segmentation
Change detection
The NVIDIA trained foundation models that are supported for domain adaptation and downstream finetuning in TAO include:
C-RADIO (NVIDIA’s “C”ommercial RADIO, AGM: Reduce All Domains Into One) is a family of agglomerative vision foundation models that distill multiple teacher models into a single backbone, yielding general-purpose visual features suitable for both classification and dense-prediction tasks.
Note
The C-RADIOv3 and C-RADIOv4 links above point to specific Hugging Face variants: C-RADIOv3-B (the base variant) and C-RADIOv4-H (the huge variant). TAO registers additional sizes for C-RADIOv3 (base, large, huge), but the exact set selectable for any given downstream task is restricted as noted in the per-task tables below.
Which C-RADIO version should I pick?
C-RADIOv2 has the broadest task coverage in TAO: it is the only C-RADIO family selectable for every downstream task on this page (classification, object detection, semantic segmentation, and change detection).
C-RADIOv3 is the newer generation and is recommended for classification and RT-DETR object detection, where the full base/large/huge size range is available. For SegFormer semantic segmentation, only the large variant is supported. C-RADIOv3 is not available for Visual ChangeNet change detection. C-RADIOv3 also powers RADIO-CLIP (contrastive vision-language embeddings): it is the only C-RADIO generation usable with the CLIP task — neither C-RADIOv2 nor C-RADIOv4 is selectable for CLIP.
C-RADIOv4 is the latest generation (huge and SO400M variants). In this release it is selectable for image classification only; it is not yet selectable for RT-DETR object detection, SegFormer semantic segmentation, or Visual ChangeNet change detection. For those tasks, use C-RADIOv2 or C-RADIOv3.
As a rule of thumb, prefer the newest version that supports your task and pick a size (base vs. large vs. huge) by trading accuracy against latency and memory: larger variants are more accurate but slower.
These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.
Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.
They can be used as foundation models for a variety of downstream tasks with a few labeled examples. For more details on the method see: Dinov2
Image Classification with Foundational Model#
TAO supports finetuning of the following foundational vision encoders from NVIDIA for image classification:
Foundation Model |
Classification Head |
Linear |
|
Linear |
|
Linear |
|
Linear |
|
Linear |
The foundational models provide rich visual representations that can be effectively leveraged for classification tasks through a simple linear head. These models, pretrained on large-scale datasets, can be fine-tuned with minimal labeled data to achieve strong performance on specific classification tasks.
To learn more about using a foundational model as a backbone for an image classification task, refer to the section on Image Classification PyTorch.
TAO also supports finetuning of the vision encoders from multi-modal foundation models for image classification:
The roster of multi-modal foundation models supported include:
NVIDIA CLIP Image Backbones:
Foundation Model |
Classification Head |
Linear |
OpenAI CLIP Image Backbones:
Arch |
Pretrained Dataset |
in_channels |
ViT-B-32
|
* laion400m_e31
* laion400m_e32
* laion2b_e16
* laion2b_s34b_b79k
* datacomp_m_s128m_b4k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* openai
|
512
|
ViT-B-16 |
laion400m_e31 |
512 |
ViT-L-14 |
laion400m_e31 |
768 |
ViT-H-14 |
laion2b_s32b_b79k |
1024 |
ViT-g-14 |
laion2b_s12b_b42k |
1024 |
EVA - CLIP Image Backbones:
Arch |
Pretrained Dataset |
in_channels |
EVA02-L-14 |
merged2b_s4b_b131k |
768 |
EVA02-L-14-336 |
laion400m_e31 |
768 |
EVA02-E-14 |
laion400m_e31 |
1024 |
EVA02-E-14-plus |
laion2b_s32b_b79k |
1024 |
Object Detection with Foundational Model#
TAO supports finetuning of the following foundational models for object detection:
Foundation Model |
Detection Architecture |
DINO |
|
RT-DETR |
|
RT-DETR |
|
RT-DETR |
|
Not supported in this release |
To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ and Frozen-DETR style architectures with DINO and RT-DETR respectively. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.
To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Specification File for ViT Backbones.
Semantic Segmentation with Foundational Model#
TAO supports finetuning of the following foundational models for semantic segmentation:
Foundation Model |
Segmentation Architecture |
SegFormer |
|
SegFormer |
|
SegFormer |
|
SegFormer |
|
Not supported in this release |
For SegFormer, only the large C-RADIOv3 variant (c_radio_v3_vit_large_patch16_reg4_dinov2) is
selectable; the base and huge C-RADIOv3 variants are not available for this task.
These foundational models, pretrained on large-scale datasets, provide rich visual representations that can be effectively leveraged for dense prediction tasks through the SegFormer architecture.
To learn more about using a foundational model as a backbone for a semantic segmentation task, refer to Example Specification File for ViT Backbones.
Change Detection with Foundational Model#
TAO supports finetuning of the following foundational models for visual changenet - Classification and Segmentation:
Foundation Model |
Detection Architecture |
Visual ChangeNet |
|
Visual ChangeNet |
|
Visual ChangeNet |
|
Not supported in this release |
|
Not supported in this release |
For Visual ChangeNet, only C-RADIOv2 is selectable; C-RADIOv3 and C-RADIOv4 are not available for this task in this release.
To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.
To learn more about using a foundational model as a backbone for a change detection task, refer to Visual ChangeNet - Segmentation Example Specification File for ViT Backbones. Visual ChangeNet - Classification Example Specification File for ViT Backbones.
Universal Segmentation (OneFormer) with C-RADIO#
TAO supports finetuning the following foundational models as the backbone for OneFormer, the universal (semantic, instance, and panoptic) segmentation architecture:
Foundation Model |
Segmentation Architecture |
|---|---|
OneFormer |
|
OneFormer |
|
OneFormer |
Unlike RT-DETR, SegFormer, and Visual ChangeNet, OneFormer does not gate the C-RADIO generation
behind a per-task allow-list: C-RADIO v2, v3, and v4 are all selectable today. The C-RADIO backbone
is enabled by setting model.backbone.name to D2RADIO, and the specific architecture (and
therefore the C-RADIO generation) is chosen via the model.backbone.radio.backbone field paired
with the matching pretrained weights. For example, a vit_base_patch16_224 /
vit_large_patch16_224 / vit_huge_patch16_224 string selects C-RADIOv2, the
vit_*_patch16_reg4_dinov2 strings select C-RADIOv3, and vit_so400m_patch16_224 selects
C-RADIOv4.
To learn more about using a C-RADIO backbone for universal segmentation, refer to the OneFormer model page.
Contrastive Vision-Language (CLIP / RADIO-CLIP) with Foundational Model#
The TAO CLIP task trains and deploys contrastive vision-language models. It supports three distinct vision foundation-model families as the image encoder, each paired with a pre-aligned text encoder to produce image and text embeddings in a shared space, suitable for zero-shot classification, image-text retrieval, and embedding extraction. The family is chosen by the model variant name set in the experiment specification; the CLIP model builder dispatches to the matching family automatically.
Foundation Model |
Family / Architecture |
Weights source |
|---|---|---|
OpenCLIP / NVCLIP |
OpenCLIP CLIP |
HuggingFace |
Google SigLIP2 |
SigLIP2 CLIP |
HuggingFace |
C-RADIO CLIP (RADIO-CLIP) |
|
The selectable variants for each family (from the CLIP model configs) are:
OpenCLIP / NV-CLIP (
build_openclip_model):ViT-L-14-SigLIP-CLIPA-84ViT-L-14-SigLIP-CLIPA-224ViT-L-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-84ViT-H-14-SigLIP-CLIPA-224ViT-H-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-574
Google SigLIP2 (
build_siglip2_model):siglip2-so400m-patch14-224siglip2-so400m-patch14-384siglip2-so400m-patch16-256siglip2-so400m-patch16-384siglip2-so400m-patch16-512siglip2-so400m-patch16-naflex(dynamic resolution; training only — not exportable to ONNX/TRT)
C-RADIO (
build_radio_model) — RADIO-CLIP, C-RADIOv3 only:c-radio_v3-bc-radio_v3-lc-radio_v3-hc-radio_v3-g
For the C-RADIO family, only C-RADIOv3 is selectable; neither C-RADIOv2 nor C-RADIOv4 is
available for the CLIP task in this release. RADIO-CLIP weights are loaded via torch.hub
(NVlabs/RADIO), while the OpenCLIP / NV-CLIP and SigLIP2 weights load from HuggingFace.
RADIO-CLIP offers two pre-aligned text adaptors, selected with model.adaptor_name:
siglip— a SigLIP2 text encoder (resolves internally to the version-specificsiglip2/siglip2-gadaptor).clip— a DFN CLIP text encoder.
To learn more about using these backbones as a CLIP image encoder, refer to CLIP Introduction and CLIP Training and Deployment.