Foundation Models

TAO Toolkit v5.2.0

Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The main goal of foundational models, is to serve as a starter that can be adapted to a variety of downstream tasks.

Early examples of foundation models were pre-trained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.

Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.

TAO Toolkit v5.1.0, introduces the ability to finetune the image backbone of foundation models for the following downstream tasks.

The roster of backbones supported include:

  • OpenAI CLIP Image Backbones:

Arch Pretrained Dataset in_channels


* laion400m_e31
* laion400m_e32
* laion2b_e16
* laion2b_s34b_b79k
* datacomp_m_s128m_b4k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* laion2b_s34b_b79k
* openai

ViT-B-16 laion400m_e31 512
ViT-L-14 laion400m_e31 768
ViT-H-14 laion2b_s32b_b79k 1024
ViT-g-14 laion2b_s12b_b42k 1024
  • EVA - CLIP Image Backbones:

Arch Pretrained Dataset in_channels
EVA02-L-14 merged2b_s4b_b131k 768
EVA02-L-14-336 laion400m_e31 768
EVA02-E-14 laion400m_e31 1024
EVA02-E-14-plus laion2b_s32b_b79k 1024

NVIDIA has also released a foundational model called NV-Dinov2, which is available through the NVIDIA AI Enterprise program. NV-Dinov2 is a visual foundational model trained on an NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques:

  • DINO

  • iBOT

These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.

Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.

This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2

TAO Toolkit versions 5.2 and later support some of the foundational models for object detection. NV-DINOv2 can now be used as the backbone for the DINO object detection model.

To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.

To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Spec File for ViT Backbones.

Previous CenterPose
Next TAO Toolkit MLOPS Integration
© Copyright 2024, NVIDIA. Last updated on Mar 18, 2024.