Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The main goal of foundational models, is to serve as a starter that can be adapted to a variety of downstream tasks.
Early examples of foundation models were pre-trained language models (LMs) including Google’s BERT and various early GPT (Generative Pre-trained Transformer) foundation models, which notably includes OpenAI’s “GPT-n” series. Such broad models can, in turn, be used for task and domain specific models using targeted datasets of various kinds. For example, medical codes.
Having started as models primarily for text and language based applications, these foundation models have evolved to support Computer Vision and Multi-Modal applications, such as DALL-E and Flamingo.
TAO Toolkit v5.1.0, introduces the ability to finetune the image backbone of foundation models for the following downstream tasks.
The roster of backbones supported include:
OpenAI CLIP Image Backbones:
Arch | Pretrained Dataset | in_channels |
|
* laion400m_e31 |
512 |
ViT-B-16 |
laion400m_e31 | 512 |
ViT-L-14 |
laion400m_e31 | 768 |
ViT-H-14 |
laion2b_s32b_b79k | 1024 |
ViT-g-14 |
laion2b_s12b_b42k | 1024 |
EVA - CLIP Image Backbones:
Arch | Pretrained Dataset | in_channels |
EVA02-L-14 |
merged2b_s4b_b131k | 768 |
EVA02-L-14-336 |
laion400m_e31 | 768 |
EVA02-E-14 |
laion400m_e31 | 1024 |
EVA02-E-14-plus |
laion2b_s32b_b79k | 1024 |
NVIDIA has also released a foundational model called NV-Dinov2, which is available through the NVIDIA AI Enterprise program. NV-Dinov2 is a visual foundational model trained on an NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques:
DINO
iBOT
These models can simplify the use of images in systems by producing all purpose visual features. All purpose visual features work across image distributions and tasks without finetuning.
Trained on large curated datasets, Nvidia’s model has learnt robust fine-grained representation that is useful for localization and classification tasks.
This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2