Overview - NVIDIA Docs

TAO Toolkit provides an extensive model zoo containing pretrained models for both computer vision and conversational AI use cases.

Computer Vision Model Zoo

There are two types of pre-trained models that you can start with:

General-purpose vision models: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.
Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO Toolkit for transfer learning on your own dataset.

* New in TAO Toolkit 3.0-21.08 GA

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models. For more information on fine tuning models for conversational AI use cases, see the pretrained models section for Conversational AI.

Purpose-built models

Purpose-built models are built for high accuracy and performance. You can deploy these models out of the box for applications such as smart city, retail, public safety, and healthcare, or you can retrain them with your own data. All models are trained on thousands of proprietary images and achieve very high accuracy on NVIDIA test data. More information about each of these models is available in ndividual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Model Name	Network Architecture	Number of classes	Accuracy	Use Case
TrafficCamNet	DetectNet_v2-ResNet18	4	84% mAP	Detect and track cars.
PeopleNet	DetectNet_v2-ResNet18/34	3	84% mAP	People counting, heatmap generation, social distancing.
DashCamNet	DetectNet_v2-ResNet18	4	80% mAP	Identify objects from a moving object.
FaceDetectIR	DetectNet_v2-ResNet18	1	96% mAP	Detect face in a dark environment with IR camera.
VehicleMakeNet	ResNet18	20	91% mAP	Classifying car models.
VehicleTypeNet	ResNet18	6	96% mAP	Classifying type of cars as coupe, sedan, truck, etc.
PeopleSegNet	MaskRCNN-ResNet50	1	85% mAP	Creates segmentation masks around people, provides pixel
PeopleSemSegNet	Vanilla Unet Dynamic	2	92% mIOU	Creates semantic segmentation masks for people.
PeopleSemSegNet	Shuffle Unet	2	87% mIOU	Creates semantic segmentation masks for people.
License Plate Detection	DetectNet_v2-ResNet18	1	98% mAP	Detecting and localizing License plates on vehicles
License Plate Recognition	Tuned ResNet18	36(US) / 68(CH)	97%(US)/99%(CH)	Recognize License plates numbers
Gaze Estimation	Four branch AlexNet based model	NA	6.5 RMSE	Detects person’s eye gaze
Facial Landmark	Recombinator networks	NA	6.1 pixel error	Estimates key points on person’s face
Heart Rate Estimation	Two branch model with attention	NA	0.7 BPM	Estimates person’s heartrate from RGB video
Gesture Recognition	ResNet18	6	0.85 F1 score	Recognize hand gestures
Emotion Recognition	5 Fully Connected Layers	6	0.91 F1 score	Recognize facial Emotion
FaceDetect	DetectNet_v2-ResNet18	1	85.3 mAP	Detect faces from RGB or grayscale image
BodyPoseNet	Single shot bottom-up	18	56.1% mAP*	Estimates body key points for persons in the image
PoseClassificationNet	ST-Graph Convolutional Network	6	89.53%	Classify poses of people from their skeletons
PointPillarNet	PointPillars		65.22 mAP	Detect objects from Lidar point cloud

Note

The accuracy reported for BodyPoseNet is based on a model trained using the COCO dataset. To reproduce the same accuracy, use the sample notebook.

Performance Metrics

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Model arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet-ResNet18	960x544x3	INT8	8	218	8	128
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	8	169	8	94
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	8	79	8	46
TrafficCamNet	960x544x3	INT8	8	251	8	174
DashCamNet	960x544x3	INT8	16	251	32	172
FaceDetect-IR	384x240x3	INT8	32	1407	32	974
VehilceMakeNet	224x224x3	INT8	32	2434	32	1166
VehicleTypeNet	224x224x3	INT8	32	1781	32	1064
FaceDetect (pruned)	736x416x3	INT8	16	395	16	268
License Plate Detection	640x480x3	INT8	16	784	16	388
License Plate Recognition	96x48x3	FP16	16	706	–	–
Facial landmark	80x80x1	FP16	16	1105	–	–
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	32	812	–	–
GestureNet	160x160x3	FP16	32	2585	–	–
BodyPose	288x384x3	INT8	4	104	–	–
Action Recognition 2D RGB	224x224x96	FP16	16	245	–	–
Action Recognition 3D RGB	224x224x32x3	FP16	4	21	–	–
Action Recognition 2D OF	224x224x96	FP16	16	317	–	–
Action Recognition 3D OF	224x224x32x3	FP16	8	25	–	–
Point Pillar		FP16	1	25	–	–
Pose classification		FP16	8	87	–	–
3D Pose - Accuracy		FP16	16	117	–	–
3D Pose - Performance		FP16	16	147	–	–
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	16	199	–	–
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	4	15	–	–

Model arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet-ResNet18	960x544x3	INT8	16	390	16	164
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	16	296	16	122
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	8	136	4	58
TrafficCamNet	960x544x3	INT8	16	458	16	220
DashCamNet	960x544x3	INT8	16	442	16	228
FaceDetect-IR	384x240x3	INT8	64	2575	64	1266
VehilceMakeNet	224x224x3	INT8	64	4342	64	1508
VehicleTypeNet	224x224x3	INT8	64	3281	64	1412
FaceDetect (pruned)	736x416x3	INT8	32	719	32	354
License Plate Detection	640x480x3	INT8	32	1370	32	512
License Plate Recognition	96x48x3	FP16	32	1190	–	–
Facial landmark	80x80x1	FP16	32	2069	–	–
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	64	1387	–	–
GestureNet	160x160x3	FP16	64	4429	–	–
BodyPose	288x384x3	INT8	8	172	–	–
Action Recognition 2D RGB	224x224x96	FP16	16	471	–	–
Action Recognition 3D RGB	224x224x32x3	FP16	4	32	–	–
Action Recognition 2D OF	224x224x96	FP16	16	658	–	–
Action Recognition 3D OF	224x224x32x3	FP16	4	41	–	–
Point Pillar		FP16	1	40	–	–
Pose classification		FP16	8	150	–	–
3D Pose - Accuracy		FP16	16	188	–	–
3D Pose - Performance		FP16	16	235	–	–
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	16	356	–	–
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	4	25	–	–

Model arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet-ResNet18	960x544x3	INT8	16	400	16	300
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	314	32	226
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	16	140	32	70
TrafficCamNet	960x544x3	INT8	16	457	16	352
DashCamNet	960x544x3	INT8	32	479	64	358
FaceDetect-IR	384x240x3	INT8	64	2588	64	1700
VehilceMakeNet	224x224x3	INT8	64	4261	64	2218
VehicleTypeNet	224x224x3	INT8	64	3391	64	2044
FaceDetect (pruned)	736x416x3	INT8	32	613	32	492
License Plate Detection	640x480x3	INT8	32		32
License Plate Recognition	96x48x3	FP16	128	1498	–	–
Facial landmark	80x80x1	FP16	32	1606	–	–
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	64	1241	–	–
GestureNet	160x160x3	FP16	64	5420	–	–
BodyPose	288x384x3	INT8	16	195	–	–
Action Recognition 2D RGB	224x224x96	FP16	32	577	–	–
Action Recognition 3D RGB	224x224x32x3	FP16	4	38	–	–
Action Recognition 2D OF	224x224x96	FP16	16	826	–	–
Action Recognition 3D OF	224x224x32x3	FP16	4	42	–	–
Point Pillar		FP16	1	38	–	–
Pose classification		FP16	8	105	–	–
3D Pose - Accuracy		FP16	16	241	–	–
3D Pose - Performance		FP16	16	295	–	–
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	16	289	–	–
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	4	27	–	–

Model arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet-ResNet18	960x544x3	INT8	32	1116	32	528
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	890	32	404
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	16	421	32	104
TrafficCamNet	960x544x3	INT8	32	1268	32	594
DashCamNet	960x544x3	INT8	32	1308	64	587
FaceDetect-IR	384x240x3	INT8	128	7462	128	2720
VehilceMakeNet	224x224x3	INT8	128	11872	128	3956
VehicleTypeNet	224x224x3	INT8	128	9815	128	3494
FaceDetect (pruned)	736x416x3	INT8	64	1700	64	870
License Plate Detection	640x480x3	INT8	64		64
License Plate Recognition	96x48x3	FP16	128	4118	–	–
Facial landmark	80x80x1	FP16	64		–	–
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	128	3226	–	–
GestureNet	160x160x3	FP16	128	15133	–	–
BodyPose	288x384x3	INT8	16	559	–	–
Action Recognition 2D RGB	224x224x96	FP16	64	1577	–	–
Action Recognition 3D RGB	224x224x32x3	FP16	8	105	–	–
Action Recognition 2D OF	224x224x96	FP16	32	1702	–	–
Action Recognition 3D OF	224x224x32x3	FP16	4	109	–	–
Point Pillar		FP16	1	90	–	–
Pose classification		FP16	16	262	–	–
3D Pose - Accuracy		FP16	16	597	–	–
3D Pose - Performance		FP16	16	711	–	–
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	32	703	–	–
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	4	75	–	–

Model arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet-ResNet18	960x544x3	INT8	64	1379
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	1064
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	32	465
TrafficCamNet	960x544x3	INT8	64	1725
DashCamNet	960x544x3	INT8	64	1676
FaceDetect-IR	384x240x3	INT8	128	9810
VehilceMakeNet	224x224x3	INT8	256	16500
VehicleTypeNet	224x224x3	INT8	128	12500
FaceDetect (pruned)	736x416x3	INT8	64	2578
License Plate Detection	640x480x3	INT8	128	6123
License Plate Recognition	96x48x3	FP16	128	3959
Facial landmark	80x80x1	FP16	128	4622
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	4563
GestureNet	160x160x3	FP16	512	15377
BodyPose	288x384x3	INT8	32	598
Action Recognition 2D RGB	224x224x96	FP16	16	1897
Action Recognition 3D RGB	224x224x32x3	FP16	4	139
Action Recognition 2D OF	224x224x96	FP16	32	3320
Action Recognition 3D OF	224x224x32x3	FP16	16	192
Point Pillar		FP16	1	111
Pose classification		FP16	64	376.4
3D Pose - Accuracy		FP16	32	614.98
3D Pose - Performance		FP16	32	712.94
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	64	1027.85
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	16	79.08

Model arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet-ResNet18	960x544x3	INT8	128	8500
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	64	6245
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	64	3291
TrafficCamNet	960x544x3	INT8	256	9717
DashCamNet	960x544x3	INT8	256	9500
FaceDetect-IR	384x240x3	INT8	256	51600
VehilceMakeNet	224x224x3	INT8	1024	88300
VehicleTypeNet	224x224x3	INT8	512	72300
FaceDetect (pruned)	736x416x3	INT8	256	14900
License Plate Detection	640x480x3	INT8	256	23200
License Plate Recognition	96x48x3	FP16	256	27200
Facial landmark	80x80x1	FP16	256	19600
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	1024	25394
GestureNet	160x160x3	FP16	1024	94555
BodyPose	288x384x3	INT8	16	3180
Action Recognition 2D RGB	224x224x96	FP16	32	12600
Action Recognition 3D RGB	224x224x32x3	FP16	16	797
Action Recognition 2D OF	224x224x96	FP16	64	17535
Action Recognition 3D OF	224x224x32x3	FP16	16	899
Point Pillar		FP16	1	425
Pose classification		FP16	64	2144.84
3D Pose - Accuracy		FP16	32	3466.34
3D Pose - Performance		FP16	32	4176.37
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	64	5745.79
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	16	496.34

Model arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet-ResNet18	960x544x3	INT8	64	4228
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	3160
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	32	1603
TrafficCamNet	960x544x3	INT8	64	5082
DashCamNet	960x544x3	INT8	64	4900
FaceDetect-IR	384x240x3	INT8	128	27100
VehilceMakeNet	224x224x3	INT8	256	46200
VehicleTypeNet	224x224x3	INT8	128	37200
PeopleSegNet	960x576x3	INT8	8	158529
FaceDetect	736x416x3	INT8	64	7700
LPD	640x480x3	INT8	128	12500
LPR	96x48x3	FP16	128	12400
Facial landmark	80x80x1	FP16	128	12400
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	12321
GestureNet	160x160x3	FP16	512	47361
BodyPose	288x384x3	INT8	32	1596
AR 2D	224x224x96	FP16	16	6000
AR 3D	224x224x32x3	FP16	4	380
AR 2D OF	224x224x96	FP16	32	8940
AR 3D OF	224x224x32x3	FP16	16	461
Point Pillar		FP16	1	271
Pose classification		FP16	64	1121.68
3D Pose - Accuracy		FP16	32	1913.92
3D Pose - Performance		FP16	32	2241.83
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	64	2862.76
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	16	253.77

Model arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet-ResNet18	960x544x3	INT8	64	3819
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	2568
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	32	1007
TrafficCamNet	960x544x3	INT8	64	4754
DashCamNet	960x544x3	INT8	64	4600
FaceDetect-IR	384x240x3	INT8	128	26900
VehilceMakeNet	224x224x3	INT8	256	44800
VehicleTypeNet	224x224x3	INT8	256	31500
FaceDetect (pruned)	736x416x3	INT8	64	6000
License Plate Detection	640x480x3	INT8	256	13900
License Plate Recognition	96x48x3	FP16	256	9000
Facial landmark	80x80x1	FP16	512	9600
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	10718
GestureNet	160x160x3	FP16	512	35371
BodyPose	288x384x3	INT8	32	1334
Action Recognition 2D RGB	224x224x96	FP16	16	4600
Action Recognition 3D RGB	224x224x32x3	FP16	4	265
Action Recognition 2D OF	224x224x96	FP16	32	6500
Action Recognition 3D OF	224x224x32x3	FP16	16	284
Point Pillar		FP16	1	246
Pose classification		FP16	64	825.75
3D Pose - Accuracy		FP16	32	1286.05
3D Pose - Performance		FP16	32	1558.21
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	64	2429.62
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	16	180.04

Model arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet-ResNet18	960x544x3	INT8	32	749
PeopleNet-ResNet34 (v2.3)	960x544x3	INT8	32	581
PeopleNet-ResNet34 (v2.5 unpruned)	960x544x3	INT8	32	231
TrafficCamNet	960x544x3	INT8	32	916
DashCamNet	960x544x3	INT8	32	865
FaceDetect-IR	384x240x3	INT8	64	4982
VehilceMakeNet	224x224x3	INT8	128	8000
VehicleTypeNet	224x224x3	INT8	128	6302
FaceDetect (pruned)	736x416x3	INT8	32	1174
License Plate Detection	640x480x3	INT8	128	2570
License Plate Recognition	96x48x3	FP16	128	2180
Facial landmark	80x80x1	FP16	256	2800
GazeNet	224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	256	2488
GestureNet	160x160x3	FP16	256	7690
BodyPose	288x384x3	INT8	16	278
Action Recognition 2D RGB	224x224x96	FP16	8	1044
Action Recognition 3D RGB	224x224x32x3	FP16	4	56
Action Recognition 2D OF	224x224x96	FP16	16	1419
Action Recognition 3D OF	224x224x32x3	FP16	2	58
Point Pillar		FP16	1	63
Pose classification		FP16	64	211.5
3D Pose - Accuracy		FP16	32	370.13
3D Pose - Performance		FP16	32	471.81
PeopleSemSegNet_v2 - Shuffle	960x544x3	FP16	16	631.31
PeopleSemSegNet_v2 - Vanilla	960x544x3	FP16	16	44.09

General purpose computer vision models

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.
For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.
For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below and see the Open Model Architectures section.

TAO Toolkit 3.0-22.05

Computer Vision Feature Summary

The table below summarizes the computer vision models and the features enabled.

*Feature Summary*
CV Task	Model	New in 22-04	Pruning	QAT	REST API	Channel-wise QAT	Class weighting	Visualization	BYOM	Multi-node	Multi-GPU	AMP	Early Stopping	Framework	Annotation Format	DLA
Classification	ResNet10/18/34/50/101	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	VGG16/19	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	GoogleNet	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	MobileNet_v1/v2	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	SqueezeNet	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	DarkNet19/53	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	EfficientNet_B0-B7	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	CSPDarkNet19/53	No	yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Classification	CSPDarkNet-Tiny	No	Yes	No	yes	no	no	yes	yes	yes	yes	yes	No	tf1	ImageNet	yes
Object Detection	YoloV3	No	yes	yes	yes	no	no	yes	No	yes	yes	yes	No	tf1	KITTI/COCO	yes
Object Detection	YoloV4	No	yes	yes	yes	no	yes	yes	No	yes	yes	yes	Yes	tf1	KITTI/COCO	yes
Object Detection	YoloV4 - Tiny	No	yes	yes	yes	no	yes	yes	No	yes	yes	yes	Yes	tf1	KITTI/COCO	yes
Object Detection	FasterRCNN	No	yes	yes	yes	no	no	yes	No	yes	yes	yes	yes	tf1	KITTI/COCO	yes
Object Detection	EfficientDet	No	yes	no	yes	no	no	no	No	yes	yes	yes	no	tf1	COCO	yes
Object Detection	RetinaNet	No	yes	yes	yes	no	yes	yes	No	yes	yes	yes	yes	tf1	KITTI/COCO	yes
Object Detection	DetectNet_v2	No	yes	yes	yes	no	yes	yes	No	yes	yes	yes	no	tf1	KITTI/COCO	yes
Object Detection	SSD	No	yes	yes	yes	no	no	yes	No	yes	yes	yes	yes	tf1	KITTI/COCO	yes
Object Detection	DSSD	No	yes	yes	yes	no	no	yes	No	yes	yes	yes	yes	tf1	KITTI/COCO	yes
Multitask classification	All classification	No	yes	no	yes	no	no	yes	No	yes	yes	yes	no	tf1	Custom	yes
Instance Segmentation	MaskRCNN	No	yes	no	yes	no	no	yes	No	yes	yes	yes	no	tf1	COCO	no
Semantic Segmentation	UNET	No	yes	yes	yes	no	no	yes	yes	yes	yes	yes	no	tf1	CityScape - PNG	no
Character Recognition	LPRNet	No	no	no	yes	no	no	yes	no	yes	yes	yes	yes	tf1	Custom - txt file	no
Key Points	2D body pose	No	yes	no, but PTQ	no	no	no	no	no	yes	yes	yes	no	tf1	COCO	no
Key Points	2D body pose	No	yes	no, but PTQ	no	no	no	no	no	yes	yes	yes	no	tf1	COCO	no
Point Cloud	PointPillars	Yes	Yes	no	no	no	no	no	no	yes	yes	yes	no	pyt	KITTI	no
Action Recognition	2D action recognition RGB	No	no	no	no	no	no	no	no	no	yes	yes	no	pyt	Custom	no
Action Recognition	3D action recognition RGB	No	no	no	no	no	no	no	no	no	yes	yes	no	pyt	Custom	no
Action Recognition	2D action recognition OF	No	no	no	no	no	no	no	no	no	yes	yes	no	pyt	Custom	no
Action Recognition	3D action recognition OF	No	no	no	no	no	no	no	no	no	yes	yes	no	pyt	Custom	no
Other	Pose action classification	Yes	no	no	no	no	no	no	no	no	yes	yes	no	pyt	COCO	no
Other	HeartRateNet	No	no	no	no	no	no	no	no	no	yes	yes	no	tf1	NVIDIA Defined	no
Other	GazeNet	No	no	no	no	no	no	no	no	no	yes	yes	no	tf1	NVIDIA Defined	no
Other	EmotionNet	No	no	no	no	no	no	yes	no	no	no	yes	no	tf1	NVIDIA Defined	no
Other	GestureNet	No	no	no	no	no	no	no	no	yes	yes	yes	no	tf1	NVIDIA Defined	no

Conversational AI

*Purpose Built Models for Conversational AI*
Model Name	Base Architecture	Dataset	Purpose
Speech to Text English Jasper	Jasper	ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours	Speech Transcription
Speech to Text English QuartzNet	Quartznet	ASR Set 1.2	Speech Transcription
Speech to Text English CitriNet	CitriNet	ASR Set 1.4	Speech Transcription
Speech to Text English Conformer	Conformer	ASR Set 1.4	Speech Transcription
Question Answering SQUAD2.0 Bert	BERT	SQuAD 2.0	Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.
Question Answering SQUAD2.0 Bert - Large	BERT Large	SQuAD 2.0	Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.
Question Answering SQUAD2.0 Megatron	Megatron	SQuAD 2.0	Answering questions in SQuADv2.0, a reading comprehension dataset consisting of Wikipedia articles.
Named Entity Recognition Bert	BERT	GMB (Gronigen Meaning Book)	Identifying entities in a given text (Supported Categories: Geographical Entity, Organization, Person , Geopolitical Entity, Time Indicator, Natural Phenomenon/Event)
Joint Intent and Slot Classification Bert	BERT	Proprietary	Classifying an intent and detecting all relevant slots (Entities) for this Intent in a query. Intent and slot names are usually task specific. This model recognizes weather related intents like weather, temperature, rainfall etc. and entities like place, time, unit of temperature etc. For a comprehensive list, please check the corresponding model card.
Punctuation and Capitalization Bert	BERT	Tatoeba sentences, Books from the Project Gutenberg that were used as part of the LibriSpeech corpus, Transcripts from Fisher English Training Speech	Add punctuation and capitalization to text.
Domain Classification English Bert	BERT	Proprietary	For domain classification of queries into the 4 supported domains: weather, meteorology, personality, and none.