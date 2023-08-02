There are two types of pre-trained models that you can start with:

General-purpose vision models : The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO Toolkit for transfer learning on your own dataset.

* New in TAO Toolkit 3.0-21.08 GA

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models. For more information on fine tuning models for conversational AI use cases, see the pretrained models section for Conversational AI.

Purpose-built models are built for high accuracy and performance. You can deploy these models out of the box for applications such as smart city, retail, public safety, and healthcare, or you can retrain them with your own data. All models are trained on thousands of proprietary images and achieve very high accuracy on NVIDIA test data. More information about each of these models is available in ndividual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Note The accuracy reported for BodyPoseNet is based on a model trained using the COCO dataset. To reproduce the same accuracy, use the sample notebook.





The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Xavier NX AGX Xavier Orin NX Orin T4 A100 A30 A10 A2



Model arch Inference resolution Precision GPU BS GPU FPS DLA1 + DLA2 BS DLA1 + DLA2 FPS PeopleNet-ResNet18 960x544x3 INT8 8 218 8 128 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 8 169 8 94 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 8 79 8 46 TrafficCamNet 960x544x3 INT8 8 251 8 174 DashCamNet 960x544x3 INT8 16 251 32 172 FaceDetect-IR 384x240x3 INT8 32 1407 32 974 VehilceMakeNet 224x224x3 INT8 32 2434 32 1166 VehicleTypeNet 224x224x3 INT8 32 1781 32 1064 FaceDetect (pruned) 736x416x3 INT8 16 395 16 268 License Plate Detection 640x480x3 INT8 16 784 16 388 License Plate Recognition 96x48x3 FP16 16 706 – – Facial landmark 80x80x1 FP16 16 1105 – – GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 32 812 – – GestureNet 160x160x3 FP16 32 2585 – – BodyPose 288x384x3 INT8 4 104 – – Action Recognition 2D RGB 224x224x96 FP16 16 245 – – Action Recognition 3D RGB 224x224x32x3 FP16 4 21 – – Action Recognition 2D OF 224x224x96 FP16 16 317 – – Action Recognition 3D OF 224x224x32x3 FP16 8 25 – – Point Pillar FP16 1 25 – – Pose classification FP16 8 87 – – 3D Pose - Accuracy FP16 16 117 – – 3D Pose - Performance FP16 16 147 – – PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 16 199 – – PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 4 15 – – PeopleNet Transformer 960x544x3 FP16 2 9 – – ReidentificationNet 3x256x18 FP16 32 502 – – RetailObjectDetection-100 class 416x416x3 FP16 8 29 – – RetailObjectDetection-binary 416x416x3 FP16 8 27 – – RetailObjectEmbedding 3x224x224 FP16 32 259 – – CitySemSegFormer FP16 1 0.4 – –

Model arch Inference resolution Precision GPU BS GPU FPS DLA1 + DLA2 BS DLA1 + DLA2 FPS PeopleNet-ResNet18 960x544x3 INT8 16 390 16 164 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 16 296 16 122 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 8 136 4 58 TrafficCamNet 960x544x3 INT8 16 458 16 220 DashCamNet 960x544x3 INT8 16 442 16 228 FaceDetect-IR 384x240x3 INT8 64 2575 64 1266 VehilceMakeNet 224x224x3 INT8 64 4342 64 1508 VehicleTypeNet 224x224x3 INT8 64 3281 64 1412 FaceDetect (pruned) 736x416x3 INT8 32 719 32 354 License Plate Detection 640x480x3 INT8 32 1370 32 512 License Plate Recognition 96x48x3 FP16 32 1190 – – Facial landmark 80x80x1 FP16 32 2069 – – GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 64 1387 – – GestureNet 160x160x3 FP16 64 4429 – – BodyPose 288x384x3 INT8 8 172 – – Action Recognition 2D RGB 224x224x96 FP16 16 471 – – Action Recognition 3D RGB 224x224x32x3 FP16 4 32 – – Action Recognition 2D OF 224x224x96 FP16 16 658 – – Action Recognition 3D OF 224x224x32x3 FP16 4 41 – – Point Pillar FP16 1 40 – – Pose classification FP16 8 150 – – 3D Pose - Accuracy FP16 16 188 – – 3D Pose - Performance FP16 16 235 – – PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 16 356 – – PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 4 25 – – PeopleNet Transformer 960x544x3 FP16 2 15 – – ReidentificationNet 3x256x18 FP16 64 854 – – RetailObjectDetection-100 class 416x416x3 FP16 16 48 – – RetailObjectDetection-binary 416x416x3 FP16 16 45 – – RetailObjectEmbedding 3x224x224 FP16 32 423 – – CitySemSegFormer FP16 1 0.7 – –

Model arch Inference resolution Precision GPU BS GPU FPS DLA1 + DLA2 BS DLA1 + DLA2 FPS PeopleNet-ResNet18 960x544x3 INT8 16 400 16 300 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 314 32 226 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 16 140 32 70 TrafficCamNet 960x544x3 INT8 16 457 16 352 DashCamNet 960x544x3 INT8 32 479 64 358 FaceDetect-IR 384x240x3 INT8 64 2588 64 1700 VehilceMakeNet 224x224x3 INT8 64 4261 64 2218 VehicleTypeNet 224x224x3 INT8 64 3391 64 2044 FaceDetect (pruned) 736x416x3 INT8 32 613 32 492 License Plate Detection 640x480x3 INT8 32 32 License Plate Recognition 96x48x3 FP16 128 1498 – – Facial landmark 80x80x1 FP16 32 1606 – – GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 64 1241 – – GestureNet 160x160x3 FP16 64 5420 – – BodyPose 288x384x3 INT8 16 195 – – Action Recognition 2D RGB 224x224x96 FP16 32 577 – – Action Recognition 3D RGB 224x224x32x3 FP16 4 38 – – Action Recognition 2D OF 224x224x96 FP16 16 826 – – Action Recognition 3D OF 224x224x32x3 FP16 4 42 – – Point Pillar FP16 1 38 – – Pose classification FP16 8 105 – – 3D Pose - Accuracy FP16 16 241 – – 3D Pose - Performance FP16 16 295 – – PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 16 289 – – PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 4 27 – – PeopleNet Transformer 960x544x3 FP16 2 13 – – ReidentificationNet 3x256x18 FP16 64 1049 – – RetailObjectDetection-100 class 416x416x3 FP16 16 48 – – RetailObjectDetection-binary 416x416x3 FP16 16 45 – – RetailObjectEmbedding 3x224x224 FP16 32 544 – – CitySemSegFormer FP16 1 0.6 – –

Model arch Inference resolution Precision GPU BS GPU FPS DLA1 + DLA2 BS DLA1 + DLA2 FPS PeopleNet-ResNet18 960x544x3 INT8 32 1116 32 528 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 890 32 404 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 16 421 32 104 TrafficCamNet 960x544x3 INT8 32 1268 32 594 DashCamNet 960x544x3 INT8 32 1308 64 587 FaceDetect-IR 384x240x3 INT8 128 7462 128 2720 VehilceMakeNet 224x224x3 INT8 128 11872 128 3956 VehicleTypeNet 224x224x3 INT8 128 9815 128 3494 FaceDetect (pruned) 736x416x3 INT8 64 1700 64 870 License Plate Detection 640x480x3 INT8 64 64 License Plate Recognition 96x48x3 FP16 128 4118 – – Facial landmark 80x80x1 FP16 64 – – GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 128 3226 – – GestureNet 160x160x3 FP16 128 15133 – – BodyPose 288x384x3 INT8 16 559 – – Action Recognition 2D RGB 224x224x96 FP16 64 1577 – – Action Recognition 3D RGB 224x224x32x3 FP16 8 105 – – Action Recognition 2D OF 224x224x96 FP16 32 1702 – – Action Recognition 3D OF 224x224x32x3 FP16 4 109 – – Point Pillar FP16 1 90 – – Pose classification FP16 16 262 – – 3D Pose - Accuracy FP16 16 597 – – 3D Pose - Performance FP16 16 711 – – PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 32 703 – – PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 4 75 – – PeopleNet Transformer 960x544x3 FP16 2 33 – – ReidentificationNet 3x256x18 FP16 64 2772 – – RetailObjectDetection-100 class 416x416x3 FP16 16 121 – – RetailObjectDetection-binary 416x416x3 FP16 16 114 – – RetailObjectEmbedding 3x224x224 FP16 32 1355 – – CitySemSegFormer FP16 1 1.5 – –

Model arch Inference resolution Precision GPU BS GPU FPS PeopleNet-ResNet18 960x544x3 INT8 64 1379 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 1064 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 32 465 TrafficCamNet 960x544x3 INT8 64 1725 DashCamNet 960x544x3 INT8 64 1676 FaceDetect-IR 384x240x3 INT8 128 9810 VehilceMakeNet 224x224x3 INT8 256 16500 VehicleTypeNet 224x224x3 INT8 128 12500 FaceDetect (pruned) 736x416x3 INT8 64 2578 License Plate Detection 640x480x3 INT8 128 6123 License Plate Recognition 96x48x3 FP16 128 3959 Facial landmark 80x80x1 FP16 128 4622 GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 4563 GestureNet 160x160x3 FP16 512 15377 BodyPose 288x384x3 INT8 32 598 Action Recognition 2D RGB 224x224x96 FP16 16 1897 Action Recognition 3D RGB 224x224x32x3 FP16 4 139 Action Recognition 2D OF 224x224x96 FP16 32 3320 Action Recognition 3D OF 224x224x32x3 FP16 16 192 Point Pillar FP16 1 111 Pose classification FP16 64 376.4 3D Pose - Accuracy FP16 32 614.98 3D Pose - Performance FP16 32 712.94 PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 64 1027.85 PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 16 79.08 PeopleNet Transformer 960x544x3 FP16 4 48 ReidentificationNet 3x256x18 FP16 64 2586 RetailObjectDetection-100 class 416x416x3 FP16 32 162 RetailObjectDetection-binary 416x416x3 FP16 32 151 RetailObjectEmbedding 3x224x224 FP16 32 1270 CitySemSegFormer FP16 1 2

Model arch Inference resolution Precision GPU BS GPU FPS PeopleNet-ResNet18 960x544x3 INT8 128 8500 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 64 6245 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 64 3291 TrafficCamNet 960x544x3 INT8 256 9717 DashCamNet 960x544x3 INT8 256 9500 FaceDetect-IR 384x240x3 INT8 256 51600 VehilceMakeNet 224x224x3 INT8 1024 88300 VehicleTypeNet 224x224x3 INT8 512 72300 FaceDetect (pruned) 736x416x3 INT8 256 14900 License Plate Detection 640x480x3 INT8 256 23200 License Plate Recognition 96x48x3 FP16 256 27200 Facial landmark 80x80x1 FP16 256 19600 GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 1024 25394 GestureNet 160x160x3 FP16 1024 94555 BodyPose 288x384x3 INT8 16 3180 Action Recognition 2D RGB 224x224x96 FP16 32 12600 Action Recognition 3D RGB 224x224x32x3 FP16 16 797 Action Recognition 2D OF 224x224x96 FP16 64 17535 Action Recognition 3D OF 224x224x32x3 FP16 16 899 Point Pillar FP16 1 425 Pose classification FP16 64 2144.84 3D Pose - Accuracy FP16 32 3466.34 3D Pose - Performance FP16 32 4176.37 PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 64 5745.79 PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 16 496.34 PeopleNet Transformer 960x544x3 FP16 16 267 ReidentificationNet 3x256x18 FP16 256 20781 RetailObjectDetection-100 class 416x416x3 FP16 64 1012 RetailObjectDetection-binary 416x416x3 FP16 64 947 RetailObjectEmbedding 3x224x224 FP16 64 9851 CitySemSegFormer FP16 4 13

Model arch Inference resolution Precision GPU BS GPU FPS PeopleNet-ResNet18 960x544x3 INT8 64 4228 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 3160 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 32 1603 TrafficCamNet 960x544x3 INT8 64 5082 DashCamNet 960x544x3 INT8 64 4900 FaceDetect-IR 384x240x3 INT8 128 27100 VehilceMakeNet 224x224x3 INT8 256 46200 VehicleTypeNet 224x224x3 INT8 128 37200 PeopleSegNet 960x576x3 INT8 8 158529 FaceDetect 736x416x3 INT8 64 7700 LPD 640x480x3 INT8 128 12500 LPR 96x48x3 FP16 128 12400 Facial landmark 80x80x1 FP16 128 12400 GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 12321 GestureNet 160x160x3 FP16 512 47361 BodyPose 288x384x3 INT8 32 1596 AR 2D 224x224x96 FP16 16 6000 AR 3D 224x224x32x3 FP16 4 380 AR 2D OF 224x224x96 FP16 32 8940 AR 3D OF 224x224x32x3 FP16 16 461 Point Pillar FP16 1 271 Pose classification FP16 64 1121.68 3D Pose - Accuracy FP16 32 1913.92 3D Pose - Performance FP16 32 2241.83 PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 64 2862.76 PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 16 253.77 PeopleNet Transformer 960x544x3 FP16 16 135 ReidentificationNet 3x256x18 FP16 128 9649 RetailObjectDetection-100 class 416x416x3 FP16 64 485 RetailObjectDetection-binary 416x416x3 FP16 64 450 RetailObjectEmbedding 3x224x224 FP16 64 4791 CitySemSegFormer FP16 1 6.4

Model arch Inference resolution Precision GPU BS GPU FPS PeopleNet-ResNet18 960x544x3 INT8 64 3819 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 2568 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 32 1007 TrafficCamNet 960x544x3 INT8 64 4754 DashCamNet 960x544x3 INT8 64 4600 FaceDetect-IR 384x240x3 INT8 128 26900 VehilceMakeNet 224x224x3 INT8 256 44800 VehicleTypeNet 224x224x3 INT8 256 31500 FaceDetect (pruned) 736x416x3 INT8 64 6000 License Plate Detection 640x480x3 INT8 256 13900 License Plate Recognition 96x48x3 FP16 256 9000 Facial landmark 80x80x1 FP16 512 9600 GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 10718 GestureNet 160x160x3 FP16 512 35371 BodyPose 288x384x3 INT8 32 1334 Action Recognition 2D RGB 224x224x96 FP16 16 4600 Action Recognition 3D RGB 224x224x32x3 FP16 4 265 Action Recognition 2D OF 224x224x96 FP16 32 6500 Action Recognition 3D OF 224x224x32x3 FP16 16 284 Point Pillar FP16 1 246 Pose classification FP16 64 825.75 3D Pose - Accuracy FP16 32 1286.05 3D Pose - Performance FP16 32 1558.21 PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 64 2429.62 PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 16 180.04 PeopleNet Transformer 960x544x3 FP16 16 106 ReidentificationNet 3x256x18 FP16 128 6129 RetailObjectDetection-100 class 416x416x3 FP16 64 367 RetailObjectDetection-binary 416x416x3 FP16 64 341 RetailObjectEmbedding 3x224x224 FP16 64 3105 CitySemSegFormer FP16 1 4.5

Model arch Inference resolution Precision GPU BS GPU FPS PeopleNet-ResNet18 960x544x3 INT8 32 749 PeopleNet-ResNet34 (v2.3) 960x544x3 INT8 32 581 PeopleNet-ResNet34 (v2.5 unpruned) 960x544x3 INT8 32 231 TrafficCamNet 960x544x3 INT8 32 916 DashCamNet 960x544x3 INT8 32 865 FaceDetect-IR 384x240x3 INT8 64 4982 VehilceMakeNet 224x224x3 INT8 128 8000 VehicleTypeNet 224x224x3 INT8 128 6302 FaceDetect (pruned) 736x416x3 INT8 32 1174 License Plate Detection 640x480x3 INT8 128 2570 License Plate Recognition 96x48x3 FP16 128 2180 Facial landmark 80x80x1 FP16 256 2800 GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 256 2488 GestureNet 160x160x3 FP16 256 7690 BodyPose 288x384x3 INT8 16 278 Action Recognition 2D RGB 224x224x96 FP16 8 1044 Action Recognition 3D RGB 224x224x32x3 FP16 4 56 Action Recognition 2D OF 224x224x96 FP16 16 1419 Action Recognition 3D OF 224x224x32x3 FP16 2 58 Point Pillar FP16 1 63 Pose classification FP16 64 211.5 3D Pose - Accuracy FP16 32 370.13 3D Pose - Performance FP16 32 471.81 PeopleSemSegNet_v2 - Shuffle 960x544x3 FP16 16 631.31 PeopleSemSegNet_v2 - Vanilla 960x544x3 FP16 16 44.09 PeopleNet Transformer 960x544x3 FP16 4 27 ReidentificationNet 3x256x18 FP16 64 1570 RetailObjectDetection-100 class 416x416x3 FP16 32 106 RetailObjectDetection-binary 416x416x3 FP16 32 99 RetailObjectEmbedding 3x224x224 FP16 32 803 CitySemSegFormer FP16 1 1.3

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

For classification , you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.

For object detection tasks , you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.

For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below and see the Open Model Architectures section.

The table below summarizes the computer vision models and the features enabled.