Foundation Models#

Foundation models are large-scale pretrained models that serve as the backbone for a wide range of computer vision and multi-modal applications. These models are often trained using some form of self-supervised of semi-supervised training algorithms, over large-scale datasets. The main goal of foundational models, is to serve as a starter model that can be adapted to a variety of downstream tasks.

Early examples of foundation models were pretrained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-Trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.

Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.

TAO v6.0.0 introduces the ability to:

  • Pretrain or adapt them to your domain, given a corpus of unstructured data or

  • fine-tune and use them for downstream computer vision tasks, like:

    • Image classification

    • Object detection

    • Semantic segmentation

    • Change detection

The NVIDIA trained foundation models that are supported for domain adaptation and downstream finetuning in TAO include:

C-RADIO (NVIDIA’s “C”ommercial RADIO, AGM: Reduce All Domains Into One) is a family of agglomerative vision foundation models that distill multiple teacher models into a single backbone, yielding general-purpose visual features suitable for both classification and dense-prediction tasks.

Note

The C-RADIOv3 and C-RADIOv4 links above point to specific Hugging Face variants: C-RADIOv3-B (the base variant) and C-RADIOv4-H (the huge variant). TAO registers additional sizes for C-RADIOv3 (base, large, huge), but the exact set selectable for any given downstream task is restricted as noted in the per-task tables below.

Which C-RADIO version should I pick?

  • C-RADIOv2 has the broadest task coverage in TAO: it is the only C-RADIO family selectable for every downstream task on this page (classification, object detection, semantic segmentation, and change detection).

  • C-RADIOv3 is the newer generation and is recommended for classification and RT-DETR object detection, where the full base/large/huge size range is available. For SegFormer semantic segmentation, only the large variant is supported. C-RADIOv3 is not available for Visual ChangeNet change detection. C-RADIOv3 also powers RADIO-CLIP (contrastive vision-language embeddings): it is the only C-RADIO generation usable with the CLIP task — neither C-RADIOv2 nor C-RADIOv4 is selectable for CLIP.

  • C-RADIOv4 is the latest generation (huge and SO400M variants). In this release it is selectable for image classification only; it is not yet selectable for RT-DETR object detection, SegFormer semantic segmentation, or Visual ChangeNet change detection. For those tasks, use C-RADIOv2 or C-RADIOv3.

As a rule of thumb, prefer the newest version that supports your task and pick a size (base vs. large vs. huge) by trading accuracy against latency and memory: larger variants are more accurate but slower.

These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.

Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.

They can be used as foundation models for a variety of downstream tasks with a few labeled examples. For more details on the method see: Dinov2

Image Classification with Foundational Model#

TAO supports finetuning of the following foundational vision encoders from NVIDIA for image classification:

Foundation Model

Classification Head

NvDINOv2

Linear

RADIOv2.5

Linear

C-RADIOv2

Linear

C-RADIOv3

Linear

C-RADIOv4

Linear

The foundational models provide rich visual representations that can be effectively leveraged for classification tasks through a simple linear head. These models, pretrained on large-scale datasets, can be fine-tuned with minimal labeled data to achieve strong performance on specific classification tasks.

To learn more about using a foundational model as a backbone for an image classification task, refer to the section on Image Classification PyTorch.

TAO also supports finetuning of the vision encoders from multi-modal foundation models for image classification:

The roster of multi-modal foundation models supported include:

  • NVIDIA CLIP Image Backbones:

Foundation Model

Classification Head

NVCLIP

Linear

  • OpenAI CLIP Image Backbones:

Arch

Pretrained Dataset

in_channels

ViT-B-32








* laion400m_e31
* laion400m_e32
* laion2b_e16
* laion2b_s34b_b79k
* datacomp_m_s128m_b4k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* openai
512








ViT-B-16

laion400m_e31

512

ViT-L-14

laion400m_e31

768

ViT-H-14

laion2b_s32b_b79k

1024

ViT-g-14

laion2b_s12b_b42k

1024

  • EVA - CLIP Image Backbones:

Arch

Pretrained Dataset

in_channels

EVA02-L-14

merged2b_s4b_b131k

768

EVA02-L-14-336

laion400m_e31

768

EVA02-E-14

laion400m_e31

1024

EVA02-E-14-plus

laion2b_s32b_b79k

1024

Object Detection with Foundational Model#

TAO supports finetuning of the following foundational models for object detection:

Foundation Model

Detection Architecture

NvDINOv2

DINO

RADIOv2.5

RT-DETR

C-RADIOv2

RT-DETR

C-RADIOv3

RT-DETR

C-RADIOv4

Not supported in this release

To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ and Frozen-DETR style architectures with DINO and RT-DETR respectively. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense prediction tasks.

To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Specification File for ViT Backbones.

Semantic Segmentation with Foundational Model#

TAO supports finetuning of the following foundational models for semantic segmentation:

Foundation Model

Segmentation Architecture

NvDINOv2

SegFormer

RADIOv2.5

SegFormer

C-RADIOv2

SegFormer

C-RADIOv3

SegFormer

C-RADIOv4

Not supported in this release

For SegFormer, only the large C-RADIOv3 variant (c_radio_v3_vit_large_patch16_reg4_dinov2) is selectable; the base and huge C-RADIOv3 variants are not available for this task.

These foundational models, pretrained on large-scale datasets, provide rich visual representations that can be effectively leveraged for dense prediction tasks through the SegFormer architecture.

To learn more about using a foundational model as a backbone for a semantic segmentation task, refer to Example Specification File for ViT Backbones.

Change Detection with Foundational Model#

TAO supports finetuning of the following foundational models for visual changenet - Classification and Segmentation:

Foundation Model

Detection Architecture

NvDINOv2

Visual ChangeNet

RADIOv2.5

Visual ChangeNet

C-RADIOv2

Visual ChangeNet

C-RADIOv3

Not supported in this release

C-RADIOv4

Not supported in this release

For Visual ChangeNet, only C-RADIOv2 is selectable; C-RADIOv3 and C-RADIOv4 are not available for this task in this release.

To mitigate the inferior performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.

To learn more about using a foundational model as a backbone for a change detection task, refer to Visual ChangeNet - Segmentation Example Specification File for ViT Backbones. Visual ChangeNet - Classification Example Specification File for ViT Backbones.

Universal Segmentation (OneFormer) with C-RADIO#

TAO supports finetuning the following foundational models as the backbone for OneFormer, the universal (semantic, instance, and panoptic) segmentation architecture:

Foundation Model

Segmentation Architecture

C-RADIOv2

OneFormer

C-RADIOv3

OneFormer

C-RADIOv4

OneFormer

Unlike RT-DETR, SegFormer, and Visual ChangeNet, OneFormer does not gate the C-RADIO generation behind a per-task allow-list: C-RADIO v2, v3, and v4 are all selectable today. The C-RADIO backbone is enabled by setting model.backbone.name to D2RADIO, and the specific architecture (and therefore the C-RADIO generation) is chosen via the model.backbone.radio.backbone field paired with the matching pretrained weights. For example, a vit_base_patch16_224 / vit_large_patch16_224 / vit_huge_patch16_224 string selects C-RADIOv2, the vit_*_patch16_reg4_dinov2 strings select C-RADIOv3, and vit_so400m_patch16_224 selects C-RADIOv4.

To learn more about using a C-RADIO backbone for universal segmentation, refer to the OneFormer model page.

Contrastive Vision-Language (CLIP / RADIO-CLIP) with Foundational Model#

The TAO CLIP task trains and deploys contrastive vision-language models. It supports three distinct vision foundation-model families as the image encoder, each paired with a pre-aligned text encoder to produce image and text embeddings in a shared space, suitable for zero-shot classification, image-text retrieval, and embedding extraction. The family is chosen by the model variant name set in the experiment specification; the CLIP model builder dispatches to the matching family automatically.

Foundation Model

Family / Architecture

Weights source

OpenCLIP / NVCLIP

OpenCLIP CLIP

HuggingFace

Google SigLIP2

SigLIP2 CLIP

HuggingFace

C-RADIOv3

C-RADIO CLIP (RADIO-CLIP)

torch.hub

The selectable variants for each family (from the CLIP model configs) are:

  • OpenCLIP / NV-CLIP (build_openclip_model):

    • ViT-L-14-SigLIP-CLIPA-84

    • ViT-L-14-SigLIP-CLIPA-224

    • ViT-L-14-SigLIP-CLIPA-336

    • ViT-H-14-SigLIP-CLIPA-84

    • ViT-H-14-SigLIP-CLIPA-224

    • ViT-H-14-SigLIP-CLIPA-336

    • ViT-H-14-SigLIP-CLIPA-574

  • Google SigLIP2 (build_siglip2_model):

    • siglip2-so400m-patch14-224

    • siglip2-so400m-patch14-384

    • siglip2-so400m-patch16-256

    • siglip2-so400m-patch16-384

    • siglip2-so400m-patch16-512

    • siglip2-so400m-patch16-naflex (dynamic resolution; training only — not exportable to ONNX/TRT)

  • C-RADIO (build_radio_model) — RADIO-CLIP, C-RADIOv3 only:

    • c-radio_v3-b

    • c-radio_v3-l

    • c-radio_v3-h

    • c-radio_v3-g

For the C-RADIO family, only C-RADIOv3 is selectable; neither C-RADIOv2 nor C-RADIOv4 is available for the CLIP task in this release. RADIO-CLIP weights are loaded via torch.hub (NVlabs/RADIO), while the OpenCLIP / NV-CLIP and SigLIP2 weights load from HuggingFace.

RADIO-CLIP offers two pre-aligned text adaptors, selected with model.adaptor_name:

  • siglip — a SigLIP2 text encoder (resolves internally to the version-specific siglip2 / siglip2-g adaptor).

  • clip — a DFN CLIP text encoder.

To learn more about using these backbones as a CLIP image encoder, refer to CLIP Introduction and CLIP Training and Deployment.