Overview

TAO Toolkit provides an extensive model zoo containing pretrained models for both computer vision and conversational AI use cases.

Computer Vision Model Zoo

There are two types of pre-trained models that you can start with:

  • General-purpose vision models: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

  • Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO Toolkit for transfer learning on your own dataset.

../../_images/tao_toolkit_models_tree.png

* New in TAO Toolkit 3.0-21.08 GA

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models. For more information on fine tuning models for conversational AI use cases, see the pretrained models section for Conversational AI.

Purpose-built models

Purpose-built models are built for high accuracy and performance. You can deploy these models out of the box for applications such as smart city, retail, public safety, and healthcare, or you can retrain them with your own data. All models are trained on thousands of proprietary images and achieve very high accuracy on NVIDIA test data. More information about each of these models is available in ndividual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Model Name

Network Architecture

Number of classes

Accuracy

Use Case

TrafficCamNet

DetectNet_v2-ResNet18

4

84% mAP

Detect and track cars.

PeopleNet

DetectNet_v2-ResNet18/34

3

84% mAP

People counting, heatmap generation, social distancing.

DashCamNet

DetectNet_v2-ResNet18

4

80% mAP

Identify objects from a moving object.

FaceDetectIR

DetectNet_v2-ResNet18

1

96% mAP

Detect face in a dark environment with IR camera.

VehicleMakeNet

ResNet18

20

91% mAP

Classifying car models.

VehicleTypeNet

ResNet18

6

96% mAP

Classifying type of cars as coupe, sedan, truck, etc.

PeopleSegNet

MaskRCNN-ResNet50

1

85% mAP

Creates segmentation masks around people, provides pixel

PeopleSemSegNet

Vanilla Unet Dynamic

2

92% mIOU

Creates semantic segmentation masks for people.

PeopleSemSegNet

Shuffle Unet

2

87% mIOU

Creates semantic segmentation masks for people.

License Plate Detection

DetectNet_v2-ResNet18

1

98% mAP

Detecting and localizing License plates on vehicles

License Plate Recognition

Tuned ResNet18

36(US) / 68(CH)

97%(US)/99%(CH)

Recognize License plates numbers

Gaze Estimation

Four branch AlexNet based model

NA

6.5 RMSE

Detects person’s eye gaze

Facial Landmark

Recombinator networks

NA

6.1 pixel error

Estimates key points on person’s face

Heart Rate Estimation

Two branch model with attention

NA

0.7 BPM

Estimates person’s heartrate from RGB video

Gesture Recognition

ResNet18

6

0.85 F1 score

Recognize hand gestures

Emotion Recognition

5 Fully Connected Layers

6

0.91 F1 score

Recognize facial Emotion

FaceDetect

DetectNet_v2-ResNet18

1

85.3 mAP

Detect faces from RGB or grayscale image

BodyPoseNet

Single shot bottom-up

18

56.1% mAP*

Estimates body key points for persons in the image

PoseClassificationNet

ST-Graph Convolutional Network

6

89.62

Classify poses of people from their skeletons

PointPillarNet

PointPillars

65.22 mAP

Detect objects from Lidar point cloud

Note

The accuracy reported for BodyPoseNet is based on a model trained using the COCO dataset. To reproduce the same accuracy, use the sample notebook.

Performance Metrics

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet-ResNet18

960x544x3

INT8

8

218

8

128

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

8

169

8

94

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

8

79

8

46

TrafficCamNet

960x544x3

INT8

8

251

8

174

DashCamNet

960x544x3

INT8

16

251

32

172

FaceDetect-IR

384x240x3

INT8

32

1407

32

974

VehilceMakeNet

224x224x3

INT8

32

2434

32

1166

VehicleTypeNet

224x224x3

INT8

32

1781

32

1064

FaceDetect (pruned)

736x416x3

INT8

16

395

16

268

License Plate Detection

640x480x3

INT8

16

784

16

388

License Plate Recognition

96x48x3

FP16

16

706

Facial landmark

80x80x1

FP16

16

1105

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

32

812

GestureNet

160x160x3

FP16

32

2585

BodyPose

288x384x3

INT8

4

104

Action Recognition 2D RGB

224x224x96

FP16

16

245

Action Recognition 3D RGB

224x224x32x3

FP16

4

21

Action Recognition 2D OF

224x224x96

FP16

16

317

Action Recognition 3D OF

224x224x32x3

FP16

8

25

Point Pillar

FP16

1

25

Pose classification

FP16

8

87

3D Pose - Accuracy

FP16

16

117

3D Pose - Performance

FP16

16

147

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

16

199

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

4

15

General purpose computer vision models

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

  • For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.

  • For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.

  • For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below and see the Open Model Architectures section.

../../_images/tao_matrix.png

TAO Toolkit 3.0-22.05

Computer Vision Feature Summary

The table below summarizes the computer vision models and the features enabled.

Feature Summary

CV Task

Model

New in 22-04

Pruning

QAT

REST API

Channel-wise QAT

Class weighting

Visualization

BYOM

Multi-node

Multi-GPU

AMP

Early Stopping

Framework

Annotation Format

DLA

Classification

ResNet10/18/34/50/101

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

VGG16/19

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

GoogleNet

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

MobileNet_v1/v2

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

SqueezeNet

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

DarkNet19/53

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

EfficientNet_B0-B7

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

CSPDarkNet19/53

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

CSPDarkNet-Tiny

No

Yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Object Detection

YoloV3

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

No

tf1

KITTI/COCO

yes

Object Detection

YoloV4

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

Yes

tf1

KITTI/COCO

yes

Object Detection

YoloV4 - Tiny

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

Yes

tf1

KITTI/COCO

yes

Object Detection

FasterRCNN

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

EfficientDet

No

yes

no

yes

no

no

no

No

yes

yes

yes

no

tf1

COCO

yes

Object Detection

RetinaNet

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

DetectNet_v2

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

no

tf1

KITTI/COCO

yes

Object Detection

SSD

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

DSSD

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Multitask classification

All classification

No

yes

no

yes

no

no

yes

No

yes

yes

yes

no

tf1

Custom

yes

Instance Segmentation

MaskRCNN

No

yes

no

yes

no

no

yes

No

yes

yes

yes

no

tf1

COCO

no

Semantic Segmentation

UNET

No

yes

yes

yes

no

no

yes

yes

yes

yes

yes

no

tf1

CityScape - PNG

no

Character Recognition

LPRNet

No

no

no

yes

no

no

yes

no

yes

yes

yes

yes

tf1

Custom - txt file

no

Key Points

2D body pose

No

yes

no, but PTQ

no

no

no

no

no

yes

yes

yes

no

tf1

COCO

no

Key Points

2D body pose

No

yes

no, but PTQ

no

no

no

no

no

yes

yes

yes

no

tf1

COCO

no

Point Cloud

PointPillars

Yes

Yes

no

no

no

no

no

no

yes

yes

yes

no

pyt

KITTI

no

Action Recognition

2D action recognition RGB

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

3D action recognition RGB

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

2D action recognition OF

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

3D action recognition OF

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Other

Pose action classification

Yes

no

no

no

no

no

no

no

no

yes

yes

no

pyt

COCO

no

Other

HeartRateNet

No

no

no

no

no

no

no

no

no

yes

yes

no

tf1

NVIDIA Defined

no

Other

GazeNet

No

no

no

no

no

no

no

no

no

yes

yes

no

tf1

NVIDIA Defined

no

Other

EmotionNet

No

no

no

no

no

no

yes

no

no

no

yes

no

tf1

NVIDIA Defined

no

Other

GestureNet

No

no

no

no

no

no

no

no

yes

yes

yes

no

tf1

NVIDIA Defined

no

Conversational AI

Purpose Built Models for Conversational AI

Model Name

Base Architecture

Dataset

Purpose

Speech to Text English Jasper

Jasper

ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours

Speech Transcription

Speech to Text English QuartzNet

Quartznet

ASR Set 1.2

Speech Transcription

Speech to Text English CitriNet

CitriNet

ASR Set 1.4

Speech Transcription

Speech to Text English Conformer

Conformer

ASR Set 1.4

Speech Transcription

Question Answering SQUAD2.0 Bert

BERT

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Question Answering SQUAD2.0 Bert - Large

BERT Large

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Question Answering SQUAD2.0 Megatron

Megatron

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Named Entity Recognition Bert

BERT

GMB (Gronigen Meaning Book)

Identifying entities in a given text (Supported Categories: Geographical Entity, Organization, Person , Geopolitical Entity, Time Indicator, Natural Phenomenon/Event)

Joint Intent and Slot Classification Bert

BERT

Proprietary

Classifying an intent and detecting all relevant slots (Entities) for this Intent in a query. Intent and slot names are usually task specific. This model recognizes weather related intents like weather, temperature, rainfall etc. and entities like place, time, unit of temperature etc. For a comprehensive list, please check the corresponding model card.

Punctuation and Capitalization Bert

BERT

Tatoeba sentences, Books from the Project Gutenberg that were used as part of the LibriSpeech corpus, Transcripts from Fisher English Training Speech

Add punctuation and capitalization to text.

Domain Classification English Bert

BERT

Proprietary

For domain classification of queries into the 4 supported domains: weather, meteorology, personality, and none.