NVIDIA TAO Toolkit v30.2205
NVIDIA TAO Release 30.2205

Overview

TAO Toolkit provides an extensive model zoo containing pretrained models for both computer vision and conversational AI use cases.

There are two types of pre-trained models that you can start with:

  • General-purpose vision models: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

  • Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO Toolkit for transfer learning on your own dataset.

tao_toolkit_models_tree.png

* New in TAO Toolkit 3.0-21.08 GA

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models. For more information on fine tuning models for conversational AI use cases, see the pretrained models section for Conversational AI.

Purpose-built models

Purpose-built models are built for high accuracy and performance. You can deploy these models out of the box for applications such as smart city, retail, public safety, and healthcare, or you can retrain them with your own data. All models are trained on thousands of proprietary images and achieve very high accuracy on NVIDIA test data. More information about each of these models is available in ndividual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Model Name

Network Architecture

Number of classes

Accuracy

Use Case

TrafficCamNet

DetectNet_v2-ResNet18

4

84% mAP

Detect and track cars.

PeopleNet

DetectNet_v2-ResNet18/34

3

84% mAP

People counting, heatmap generation, social distancing.

DashCamNet

DetectNet_v2-ResNet18

4

80% mAP

Identify objects from a moving object.

FaceDetectIR

DetectNet_v2-ResNet18

1

96% mAP

Detect face in a dark environment with IR camera.

VehicleMakeNet

ResNet18

20

91% mAP

Classifying car models.

VehicleTypeNet

ResNet18

6

96% mAP

Classifying type of cars as coupe, sedan, truck, etc.

PeopleSegNet

MaskRCNN-ResNet50

1

85% mAP

Creates segmentation masks around people, provides pixel

PeopleSemSegNet

Vanilla Unet Dynamic

2

92% mIOU

Creates semantic segmentation masks for people.

PeopleSemSegNet

Shuffle Unet

2

87% mIOU

Creates semantic segmentation masks for people.

License Plate Detection

DetectNet_v2-ResNet18

1

98% mAP

Detecting and localizing License plates on vehicles

License Plate Recognition

Tuned ResNet18

36(US) / 68(CH)

97%(US)/99%(CH)

Recognize License plates numbers

Gaze Estimation

Four branch AlexNet based model

NA

6.5 RMSE

Detects person’s eye gaze

Facial Landmark

Recombinator networks

NA

6.1 pixel error

Estimates key points on person’s face

Heart Rate Estimation

Two branch model with attention

NA

0.7 BPM

Estimates person’s heartrate from RGB video

Gesture Recognition

ResNet18

6

0.85 F1 score

Recognize hand gestures

Emotion Recognition

5 Fully Connected Layers

6

0.91 F1 score

Recognize facial Emotion

FaceDetect

DetectNet_v2-ResNet18

1

85.3 mAP

Detect faces from RGB or grayscale image

BodyPoseNet

Single shot bottom-up

18

56.1% mAP*

Estimates body key points for persons in the image

PoseClassificationNet

ST-Graph Convolutional Network

6

89.53%

Classify poses of people from their skeletons

PointPillarNet

PointPillars

65.22 mAP

Detect objects from Lidar point cloud

Note

The accuracy reported for BodyPoseNet is based on a model trained using the COCO dataset. To reproduce the same accuracy, use the sample notebook.


Performance Metrics

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.


Model arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet-ResNet18

960x544x3

INT8

8

218

8

128

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

8

169

8

94

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

8

79

8

46

TrafficCamNet

960x544x3

INT8

8

251

8

174

DashCamNet

960x544x3

INT8

16

251

32

172

FaceDetect-IR

384x240x3

INT8

32

1407

32

974

VehilceMakeNet

224x224x3

INT8

32

2434

32

1166

VehicleTypeNet

224x224x3

INT8

32

1781

32

1064

FaceDetect (pruned)

736x416x3

INT8

16

395

16

268

License Plate Detection

640x480x3

INT8

16

784

16

388

License Plate Recognition

96x48x3

FP16

16

706

–

–

Facial landmark

80x80x1

FP16

16

1105

–

–

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

32

812

–

–

GestureNet

160x160x3

FP16

32

2585

–

–

BodyPose

288x384x3

INT8

4

104

–

–

Action Recognition 2D RGB

224x224x96

FP16

16

245

–

–

Action Recognition 3D RGB

224x224x32x3

FP16

4

21

–

–

Action Recognition 2D OF

224x224x96

FP16

16

317

–

–

Action Recognition 3D OF

224x224x32x3

FP16

8

25

–

–

Point Pillar

FP16

1

25

–

–

Pose classification

FP16

8

87

–

–

3D Pose - Accuracy

FP16

16

117

–

–

3D Pose - Performance

FP16

16

147

–

–

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

16

199

–

–

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

4

15

–

–

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet-ResNet18

960x544x3

INT8

16

390

16

164

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

16

296

16

122

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

8

136

4

58

TrafficCamNet

960x544x3

INT8

16

458

16

220

DashCamNet

960x544x3

INT8

16

442

16

228

FaceDetect-IR

384x240x3

INT8

64

2575

64

1266

VehilceMakeNet

224x224x3

INT8

64

4342

64

1508

VehicleTypeNet

224x224x3

INT8

64

3281

64

1412

FaceDetect (pruned)

736x416x3

INT8

32

719

32

354

License Plate Detection

640x480x3

INT8

32

1370

32

512

License Plate Recognition

96x48x3

FP16

32

1190

–

–

Facial landmark

80x80x1

FP16

32

2069

–

–

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

64

1387

–

–

GestureNet

160x160x3

FP16

64

4429

–

–

BodyPose

288x384x3

INT8

8

172

–

–

Action Recognition 2D RGB

224x224x96

FP16

16

471

–

–

Action Recognition 3D RGB

224x224x32x3

FP16

4

32

–

–

Action Recognition 2D OF

224x224x96

FP16

16

658

–

–

Action Recognition 3D OF

224x224x32x3

FP16

4

41

–

–

Point Pillar

FP16

1

40

–

–

Pose classification

FP16

8

150

–

–

3D Pose - Accuracy

FP16

16

188

–

–

3D Pose - Performance

FP16

16

235

–

–

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

16

356

–

–

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

4

25

–

–

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet-ResNet18

960x544x3

INT8

16

400

16

300

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

314

32

226

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

16

140

32

70

TrafficCamNet

960x544x3

INT8

16

457

16

352

DashCamNet

960x544x3

INT8

32

479

64

358

FaceDetect-IR

384x240x3

INT8

64

2588

64

1700

VehilceMakeNet

224x224x3

INT8

64

4261

64

2218

VehicleTypeNet

224x224x3

INT8

64

3391

64

2044

FaceDetect (pruned)

736x416x3

INT8

32

613

32

492

License Plate Detection

640x480x3

INT8

32

32

License Plate Recognition

96x48x3

FP16

128

1498

–

–

Facial landmark

80x80x1

FP16

32

1606

–

–

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

64

1241

–

–

GestureNet

160x160x3

FP16

64

5420

–

–

BodyPose

288x384x3

INT8

16

195

–

–

Action Recognition 2D RGB

224x224x96

FP16

32

577

–

–

Action Recognition 3D RGB

224x224x32x3

FP16

4

38

–

–

Action Recognition 2D OF

224x224x96

FP16

16

826

–

–

Action Recognition 3D OF

224x224x32x3

FP16

4

42

–

–

Point Pillar

FP16

1

38

–

–

Pose classification

FP16

8

105

–

–

3D Pose - Accuracy

FP16

16

241

–

–

3D Pose - Performance

FP16

16

295

–

–

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

16

289

–

–

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

4

27

–

–

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet-ResNet18

960x544x3

INT8

32

1116

32

528

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

890

32

404

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

16

421

32

104

TrafficCamNet

960x544x3

INT8

32

1268

32

594

DashCamNet

960x544x3

INT8

32

1308

64

587

FaceDetect-IR

384x240x3

INT8

128

7462

128

2720

VehilceMakeNet

224x224x3

INT8

128

11872

128

3956

VehicleTypeNet

224x224x3

INT8

128

9815

128

3494

FaceDetect (pruned)

736x416x3

INT8

64

1700

64

870

License Plate Detection

640x480x3

INT8

64

64

License Plate Recognition

96x48x3

FP16

128

4118

–

–

Facial landmark

80x80x1

FP16

64

–

–

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

128

3226

–

–

GestureNet

160x160x3

FP16

128

15133

–

–

BodyPose

288x384x3

INT8

16

559

–

–

Action Recognition 2D RGB

224x224x96

FP16

64

1577

–

–

Action Recognition 3D RGB

224x224x32x3

FP16

8

105

–

–

Action Recognition 2D OF

224x224x96

FP16

32

1702

–

–

Action Recognition 3D OF

224x224x32x3

FP16

4

109

–

–

Point Pillar

FP16

1

90

–

–

Pose classification

FP16

16

262

–

–

3D Pose - Accuracy

FP16

16

597

–

–

3D Pose - Performance

FP16

16

711

–

–

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

32

703

–

–

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

4

75

–

–

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet-ResNet18

960x544x3

INT8

64

1379

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

1064

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

32

465

TrafficCamNet

960x544x3

INT8

64

1725

DashCamNet

960x544x3

INT8

64

1676

FaceDetect-IR

384x240x3

INT8

128

9810

VehilceMakeNet

224x224x3

INT8

256

16500

VehicleTypeNet

224x224x3

INT8

128

12500

FaceDetect (pruned)

736x416x3

INT8

64

2578

License Plate Detection

640x480x3

INT8

128

6123

License Plate Recognition

96x48x3

FP16

128

3959

Facial landmark

80x80x1

FP16

128

4622

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

4563

GestureNet

160x160x3

FP16

512

15377

BodyPose

288x384x3

INT8

32

598

Action Recognition 2D RGB

224x224x96

FP16

16

1897

Action Recognition 3D RGB

224x224x32x3

FP16

4

139

Action Recognition 2D OF

224x224x96

FP16

32

3320

Action Recognition 3D OF

224x224x32x3

FP16

16

192

Point Pillar

FP16

1

111

Pose classification

FP16

64

376.4

3D Pose - Accuracy

FP16

32

614.98

3D Pose - Performance

FP16

32

712.94

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

64

1027.85

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

16

79.08

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet-ResNet18

960x544x3

INT8

128

8500

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

64

6245

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

64

3291

TrafficCamNet

960x544x3

INT8

256

9717

DashCamNet

960x544x3

INT8

256

9500

FaceDetect-IR

384x240x3

INT8

256

51600

VehilceMakeNet

224x224x3

INT8

1024

88300

VehicleTypeNet

224x224x3

INT8

512

72300

FaceDetect (pruned)

736x416x3

INT8

256

14900

License Plate Detection

640x480x3

INT8

256

23200

License Plate Recognition

96x48x3

FP16

256

27200

Facial landmark

80x80x1

FP16

256

19600

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

1024

25394

GestureNet

160x160x3

FP16

1024

94555

BodyPose

288x384x3

INT8

16

3180

Action Recognition 2D RGB

224x224x96

FP16

32

12600

Action Recognition 3D RGB

224x224x32x3

FP16

16

797

Action Recognition 2D OF

224x224x96

FP16

64

17535

Action Recognition 3D OF

224x224x32x3

FP16

16

899

Point Pillar

FP16

1

425

Pose classification

FP16

64

2144.84

3D Pose - Accuracy

FP16

32

3466.34

3D Pose - Performance

FP16

32

4176.37

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

64

5745.79

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

16

496.34

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet-ResNet18

960x544x3

INT8

64

4228

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

3160

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

32

1603

TrafficCamNet

960x544x3

INT8

64

5082

DashCamNet

960x544x3

INT8

64

4900

FaceDetect-IR

384x240x3

INT8

128

27100

VehilceMakeNet

224x224x3

INT8

256

46200

VehicleTypeNet

224x224x3

INT8

128

37200

PeopleSegNet

960x576x3

INT8

8

158529

FaceDetect

736x416x3

INT8

64

7700

LPD

640x480x3

INT8

128

12500

LPR

96x48x3

FP16

128

12400

Facial landmark

80x80x1

FP16

128

12400

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

12321

GestureNet

160x160x3

FP16

512

47361

BodyPose

288x384x3

INT8

32

1596

AR 2D

224x224x96

FP16

16

6000

AR 3D

224x224x32x3

FP16

4

380

AR 2D OF

224x224x96

FP16

32

8940

AR 3D OF

224x224x32x3

FP16

16

461

Point Pillar

FP16

1

271

Pose classification

FP16

64

1121.68

3D Pose - Accuracy

FP16

32

1913.92

3D Pose - Performance

FP16

32

2241.83

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

64

2862.76

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

16

253.77

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet-ResNet18

960x544x3

INT8

64

3819

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

2568

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

32

1007

TrafficCamNet

960x544x3

INT8

64

4754

DashCamNet

960x544x3

INT8

64

4600

FaceDetect-IR

384x240x3

INT8

128

26900

VehilceMakeNet

224x224x3

INT8

256

44800

VehicleTypeNet

224x224x3

INT8

256

31500

FaceDetect (pruned)

736x416x3

INT8

64

6000

License Plate Detection

640x480x3

INT8

256

13900

License Plate Recognition

96x48x3

FP16

256

9000

Facial landmark

80x80x1

FP16

512

9600

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

10718

GestureNet

160x160x3

FP16

512

35371

BodyPose

288x384x3

INT8

32

1334

Action Recognition 2D RGB

224x224x96

FP16

16

4600

Action Recognition 3D RGB

224x224x32x3

FP16

4

265

Action Recognition 2D OF

224x224x96

FP16

32

6500

Action Recognition 3D OF

224x224x32x3

FP16

16

284

Point Pillar

FP16

1

246

Pose classification

FP16

64

825.75

3D Pose - Accuracy

FP16

32

1286.05

3D Pose - Performance

FP16

32

1558.21

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

64

2429.62

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

16

180.04

Model arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet-ResNet18

960x544x3

INT8

32

749

PeopleNet-ResNet34 (v2.3)

960x544x3

INT8

32

581

PeopleNet-ResNet34 (v2.5 unpruned)

960x544x3

INT8

32

231

TrafficCamNet

960x544x3

INT8

32

916

DashCamNet

960x544x3

INT8

32

865

FaceDetect-IR

384x240x3

INT8

64

4982

VehilceMakeNet

224x224x3

INT8

128

8000

VehicleTypeNet

224x224x3

INT8

128

6302

FaceDetect (pruned)

736x416x3

INT8

32

1174

License Plate Detection

640x480x3

INT8

128

2570

License Plate Recognition

96x48x3

FP16

128

2180

Facial landmark

80x80x1

FP16

256

2800

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

256

2488

GestureNet

160x160x3

FP16

256

7690

BodyPose

288x384x3

INT8

16

278

Action Recognition 2D RGB

224x224x96

FP16

8

1044

Action Recognition 3D RGB

224x224x32x3

FP16

4

56

Action Recognition 2D OF

224x224x96

FP16

16

1419

Action Recognition 3D OF

224x224x32x3

FP16

2

58

Point Pillar

FP16

1

63

Pose classification

FP16

64

211.5

3D Pose - Accuracy

FP16

32

370.13

3D Pose - Performance

FP16

32

471.81

PeopleSemSegNet_v2 - Shuffle

960x544x3

FP16

16

631.31

PeopleSemSegNet_v2 - Vanilla

960x544x3

FP16

16

44.09

General purpose computer vision models

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

  • For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.

  • For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.

  • For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below and see the Open Model Architectures section.

tao_matrix.png

TAO Toolkit 3.0-22.05

Computer Vision Feature Summary

The table below summarizes the computer vision models and the features enabled.

Feature Summary

CV Task

Model

New in 22-04

Pruning

QAT

REST API

Channel-wise QAT

Class weighting

Visualization

BYOM

Multi-node

Multi-GPU

AMP

Early Stopping

Framework

Annotation Format

DLA

Classification

ResNet10/18/34/50/101

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

VGG16/19

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

GoogleNet

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

MobileNet_v1/v2

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

SqueezeNet

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

DarkNet19/53

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

EfficientNet_B0-B7

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

CSPDarkNet19/53

No

yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Classification

CSPDarkNet-Tiny

No

Yes

No

yes

no

no

yes

yes

yes

yes

yes

No

tf1

ImageNet

yes

Object Detection

YoloV3

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

No

tf1

KITTI/COCO

yes

Object Detection

YoloV4

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

Yes

tf1

KITTI/COCO

yes

Object Detection

YoloV4 - Tiny

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

Yes

tf1

KITTI/COCO

yes

Object Detection

FasterRCNN

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

EfficientDet

No

yes

no

yes

no

no

no

No

yes

yes

yes

no

tf1

COCO

yes

Object Detection

RetinaNet

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

DetectNet_v2

No

yes

yes

yes

no

yes

yes

No

yes

yes

yes

no

tf1

KITTI/COCO

yes

Object Detection

SSD

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Object Detection

DSSD

No

yes

yes

yes

no

no

yes

No

yes

yes

yes

yes

tf1

KITTI/COCO

yes

Multitask classification

All classification

No

yes

no

yes

no

no

yes

No

yes

yes

yes

no

tf1

Custom

yes

Instance Segmentation

MaskRCNN

No

yes

no

yes

no

no

yes

No

yes

yes

yes

no

tf1

COCO

no

Semantic Segmentation

UNET

No

yes

yes

yes

no

no

yes

yes

yes

yes

yes

no

tf1

CityScape - PNG

no

Character Recognition

LPRNet

No

no

no

yes

no

no

yes

no

yes

yes

yes

yes

tf1

Custom - txt file

no

Key Points

2D body pose

No

yes

no, but PTQ

no

no

no

no

no

yes

yes

yes

no

tf1

COCO

no

Key Points

2D body pose

No

yes

no, but PTQ

no

no

no

no

no

yes

yes

yes

no

tf1

COCO

no

Point Cloud

PointPillars

Yes

Yes

no

no

no

no

no

no

yes

yes

yes

no

pyt

KITTI

no

Action Recognition

2D action recognition RGB

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

3D action recognition RGB

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

2D action recognition OF

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Action Recognition

3D action recognition OF

No

no

no

no

no

no

no

no

no

yes

yes

no

pyt

Custom

no

Other

Pose action classification

Yes

no

no

no

no

no

no

no

no

yes

yes

no

pyt

COCO

no

Other

HeartRateNet

No

no

no

no

no

no

no

no

no

yes

yes

no

tf1

NVIDIA Defined

no

Other

GazeNet

No

no

no

no

no

no

no

no

no

yes

yes

no

tf1

NVIDIA Defined

no

Other

EmotionNet

No

no

no

no

no

no

yes

no

no

no

yes

no

tf1

NVIDIA Defined

no

Other

GestureNet

No

no

no

no

no

no

no

no

yes

yes

yes

no

tf1

NVIDIA Defined

no

Purpose Built Models for Conversational AI

Model Name

Base Architecture

Dataset

Purpose

Speech to Text English Jasper

Jasper

ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours

Speech Transcription

Speech to Text English QuartzNet

Quartznet

ASR Set 1.2

Speech Transcription

Speech to Text English CitriNet

CitriNet

ASR Set 1.4

Speech Transcription

Speech to Text English Conformer

Conformer

ASR Set 1.4

Speech Transcription

Question Answering SQUAD2.0 Bert

BERT

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Question Answering SQUAD2.0 Bert - Large

BERT Large

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Question Answering SQUAD2.0 Megatron

Megatron

SQuAD 2.0

Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.

Named Entity Recognition Bert

BERT

GMB (Gronigen Meaning Book)

Identifying entities in a given text (Supported Categories: Geographical Entity, Organization, Person , Geopolitical Entity, Time Indicator, Natural Phenomenon/Event)

Joint Intent and Slot Classification Bert

BERT

Proprietary

Classifying an intent and detecting all relevant slots (Entities) for this Intent in a query. Intent and slot names are usually task specific. This model recognizes weather related intents like weather, temperature, rainfall etc. and entities like place, time, unit of temperature etc. For a comprehensive list, please check the corresponding model card.

Punctuation and Capitalization Bert

BERT

Tatoeba sentences, Books from the Project Gutenberg that were used as part of the LibriSpeech corpus, Transcripts from Fisher English Training Speech

Add punctuation and capitalization to text.

Domain Classification English Bert

BERT

Proprietary

For domain classification of queries into the 4 supported domains: weather, meteorology, personality, and none.

© Copyright 2022, NVIDIA.. Last updated on Dec 13, 2022.