Foundation Models
Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The main goal of foundational models, is to serve as a starter that can be adapted to a variety of downstream tasks.
Early examples of foundation models were pre-trained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.
Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.
TAO Toolkit v5.1.0, introduces the ability to finetune the image backbone of foundation models for the following downstream tasks.
The roster of backbones supported include:
OpenAI CLIP Image Backbones:
Arch | Pretrained Dataset | in_channels |
|
* laion400m_e31 |
512 |
ViT-B-16 |
laion400m_e31 | 512 |
ViT-L-14 |
laion400m_e31 | 768 |
ViT-H-14 |
laion2b_s32b_b79k | 1024 |
ViT-g-14 |
laion2b_s12b_b42k | 1024 |
EVA - CLIP Image Backbones:
Arch | Pretrained Dataset | in_channels |
EVA02-L-14 |
merged2b_s4b_b131k | 768 |
EVA02-L-14-336 |
laion400m_e31 | 768 |
EVA02-E-14 |
laion400m_e31 | 1024 |
EVA02-E-14-plus |
laion2b_s32b_b79k | 1024 |
NVIDIA has also released a foundational model called NV-Dinov2, which is available through the NVIDIA AI Enterprise program. NV-Dinov2 is a visual foundational model trained on an NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques:
DINO
iBOT
These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.
Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.
This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2
TAO Toolkit versions 5.2 and later support some of the foundational models for object detection. NV-DINOv2 can now be used as the backbone for the DINO object detection model.
To mitigate the inferiror performance of a standard vision transformer (ViT) on dense prediction tasks, TAO supports the ViT-Adapter_ architecture. This allows a powerful ViT that has learned rich semantic representations from a large corpus of data to achieve comparable performance to vision-specific transformers on dense preidiction tasks.
To learn more about using a foundational model as a backbone for an object-detection task, refer to Example Spec File for ViT Backbones.