TAO v5.5.0
NVIDIA TAO v5.5.0

Overview

TAO provides an extensive model zoo containing pre-trained models for computer vision use cases.

There are three types of pre-trained models that you can start with:

  • Foundation models: Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The primary goal of a foundational model is to serve as a starter that can be adapted to a variety of downstream tasks.

  • Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO for transfer learning on your own dataset.

  • Pre-trained weights: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

tao_toolkit_models_tree.png

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models.

Foundation Models

Model name

Description

NGC Instance

TAO Finetuning

TAO Commercial Pretrained NV-CLIP Model TAO Commercial Pretrained NV-CLIP ViT-H Model nvidia/tao/nvclip_vit No
TAO Commercial Pretrained NV-Dinov2 Model TAO Commercial Pretrained NV-Dinov2 Model ViT-G backbone nvidia/tao/nvdinov2_vitg Downstream finetuning with classification_pyt, DINO and visual changenet
SegIC In-context segmentation model trained on commercial data. nvidia/tao/segic No
Mask Grounding DINO Open vocabulary multi-modal instance segmentation model trained on commercial data. nvidia/tao/mask_grounding_dino Yes
Grounding DINO Open vocabulary multi-modal object detection model trained on commercial data. nvidia/tao/grounding_dino Yes
TAO Toolkit ODISE 1.1 Model to recognise characters from a preceding OCDNet model. nvidia/tao/odise Yes (from source only)
Mask Auto Label Pretrained model to generate semantic segmentation labels. nvidia/tao/mask_auto_label Yes
FoundationPose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box nvidia/tao/foundationpose No

Purpose-built models

Model name

Description

NGC Instance

TAO Finetuning

BEVFusion for 3D Object Detection BEVFusion model to detect 3D objects from point cloud and RGB data. nvidia/tao/bevfusion Yes
Action Recognition Net 5 class action recognition network to recognize what people do in an image. nvidia/tao/actionrecognitionnet Yes
ReidentifcationNet Transformer SWIN Transformer based Re-Identification network to generate embeddings for identifying persons in different scenes. nvidia/tao/reidentificationnet_transformer Yes
ReIdentificationNet Re-Identification network to generate embeddings for identifying persons in different scenes. nvidia/tao/reidentificationnet Yes
CenterPose - ISAAC Ros 3 pose detection model for retail objects. nvidia/tao/centerpose_ros Yes
Optical Character Recognition Model to recognise characters from a preceding OCDNet model. nvidia/tao/ocrnet Yes
Retail Object Detection DINO (DETR with Improved DeNoising Anchor Boxes) based object detection network to detect retail objects on a checkout counter. nvidia/tao/retail_object_detection Yes
Pose Classification Pose classification network to classify poses of people from their skeletons. nvidia/tao/poseclassificationnet Yes
PointPillarNet Model to detect one or more objects from a LIDAR point cloud file and return 3D bounding boxes. nvidia/tao/pointpillarnet Yes
PeopleSemSegNet - AMR Semantic segmentation of persons in an image. nvidia/tao/peoplesemsegnet_amr Yes
PeopleSemSegnet Semantic segmentation of persons in an image. nvidia/tao/peoplesemsegnet Yes
PeopleSegNet 1 class instance segmentation network to detect and segment instances of people in an image. nvidia/tao/peoplesegnet Yes
PeopleNet Transformer 3 class object detection network to detect people in an image. nvidia/tao/peoplenet_transformer Yes
PeopleNet - AMR 3 class object detection network to detect people in an image. nvidia/tao/peoplenet_amr Yes
PeopleNet 3 class object detection network to detect people in an image. nvidia/tao/peoplenet Yes
Multiple 3D CenterPose Single-stage, keypoint-based method for category-level object pose estimation nvidia/tao/multiclass_3d_centerpose Yes
Retail Object Recognition Embedding generator model to recognize objects on a checkout counter. nvidia/tao/retail_object_recognition Yes
PeopleSemSegformer Model to segment persons in an image. nvidia/tao/peoplesemsegformer Yes
PeopleNet Transformer v2.0 3 class object detection network to detect people in an image. nvidia/tao/peoplenet_transformer_v2 Yes
Visual ChangeNet - Segmentation Visual ChangeNet - Segmentation nvidia/tao/visual_changenet_segmentation_landsatscd Yes
CenterPose 3 pose detection model for retail objects. nvidia/tao/centerpose Yes
Visual ChangeNet Segmentation - MvTEC Change segmentation model. nvidia/tao/visual_changenet_segmentation_mvtec Yes
Optical Inspection Model to detect defects in soldered components on a Printed Circuit Board. nvidia/tao/optical_inspection Yes
PCB Defect Classification Model to classify defects in soldered components on a Printed Circuit Board. nvidia/tao/pcb_classification Yes
Visual ChangeNet Segmentation - (Research-only) Visual ChangeNet-Segmentation (Research-only) nvidia/tao/visual_changenet_segmentation_levircd Yes
Visual ChangeNet Classification Visual ChangeNet - Classification Models nvidia/tao/visual_changenet_classification Yes
FaceDetect Detect faces from an image. nvidia/tao/facenet Yes
FaceDetectIR 1 class object detection network to detect faces in an image. nvidia/tao/facedetectir Yes
HeartRateNet Estimate heart-rate non-invasively from RGB facial videos. nvidia/tao/heartratenet Yes
Gaze Estimation Detect a persons eye gaze point of regard and gaze vector. nvidia/tao/gazenet Yes
Facial Landmarks Estimation Detect fiducial keypoints from an image of a face. nvidia/tao/fpenet Yes
EmotionNet Network to classify emotions from face. nvidia/tao/emotionnet Yes
GestureNet Classify gestures from hand crop images. nvidia/tao/gesturenet Yes
License Plate Recognition Model to recognize characters from the image crop of a License Plate. nvidia/tao/lprnet Yes
LPDNet Object Detection network to detect license plates in an image of a car. nvidia/tao/lpdnet Yes
DashCamNet 4 class object detection network to detect cars in an image. nvidia/tao/dashcamnet Yes
CitySemSegFormer Semantic segmentation of persons in an image. nvidia/tao/citysemsegformer Yes
BodyPoseNet Detect body pose from an image. nvidia/tao/bodyposenet Yes
Optical Character Detection Network to detect characters in an image. nvidia/tao/ocdnet Yes
VehicleTypeNet Resnet18 model to classify a car crop into 1 out 6 car types. nvidia/tao/vehicletypenet Yes
VehicleMakeNet Resnet18 model to classify a car crop into 1 out 20 car brands. nvidia/tao/vehiclemakenet Yes
TrafficCamNet 4 class object detection network to detect cars in an image. nvidia/tao/trafficcamnet Yes
BodyPose3DNet 3D human pose estimation network to predict 34 keypoints in 3D of a person in an image. nvidia/tao/bodypose3dnet No

Pre-trained Models

Model name

Description

NGC Instance

Mask2Former Binary instance segmentation model trained on COCO data. nvidia/tao/mask2former
TAO Pretrained Semantic Segmentation Pretrained weights to facilitate transfer learning using Transfer Learning Toolkit. nvidia/tao/pretrained_semantic_segmentation
TAO Pretrained Instance Segmentation Pretrained weights to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_instance_segmentation
TAO Pretrained EfficientDet Pretrained weights to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_efficientdet
TAO Pretrained DetectNet V2 Pretrained weights to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_detectnet_v2
TAO Pretrained Classification Pretrained weights to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_classification
TAO Pretrained Object Detection Pretrained weights to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_object_detection
Pre-trained SegFormer ImageNet weights Pre-trained SegFormer weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_segformer_imagenet
Pre-trained EfficientDet NvImageNet backbones Pre-trained EfficientNet backbone weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_efficientdet_tf2_nvimagenet
Pre-trained EfficientDet Model trained on COCO Pre-trained EfficientDet models trained on COCO to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_efficientdet_tf2_coco
Deformable DETR Model trained on COCO2017 to detect and classify objects. nvidia/tao/pretrained_deformable_detr_coco
DINO Model trained on COCO2017 to detect and classify objects. nvidia/tao/pretrained_dino_coco
TAO Pretrained DINO with Foundational Model Backbone TAO Pretrained DINO with Foundational Model Backbone nvidia/tao/dino_with_fm_backbone
DINO with NVDinoV2 backbone on COCO Model trained on COCO2017 to detect and classify objects. nvidia/tao/pretrained_nvdinov2_dino_coco
Pre-trained FasterViT based ImageNet Classification weights Pre-trained FasterViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_fastervit_classification_imagenet
Pre-trained FasterViT based NVImageNet Classification weights Pre-trained FasterViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_fastervit_classification_nvimagenet
Pre-trained GCViT ImageNet Classification weights Pre-trained GCViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_gcvit_classification_imagenet
Pre-trained Deformable DETR NvImageNet weights Pre-trained deformable_detr weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_deformable_detr_nvimagenet
Pre-trained FAN based ImageNet Classification weights Pre-trained FAN weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_fan_classification_imagenet
Pre-trained FAN based NVImageNet Classification weights Pre-trained FAN weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_fan_classification_nvimagenet
Pre-trained DINO ImageNet weights Pre-trained DINO weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_dino_imagenet
Pre-trained Segformer - CityScapes Pre-trained segformer models trained on CityScapes. nvidia/tao/pretrained_segformer_cityscapes
Pre-trained SegFormer NvImageNet weights Pre-trained SegFormer weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_segformer_nvimagenet
Pre-trained GCViT NVImageNet Classification weights Pre-trained GcViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_gcvit_classification_nvimagenet
Pre-trained DINO NvImageNet weights Pre-trained DINO weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_dino_nvimagenet
Pre-trained Deformable DETR ImageNet weights Pre-trained deformable_detr weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_deformable_detr_imagenet
Pretrained Mask Auto Label Pretrained model to facilitate transfer learning for MAL on TAO Toolkit nvidia/tao/pretrained_mask_auto_label
Pre-trained ImageNet Back bones weights Pre-trained weights trained on ImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_imagenet_backbones
Pre-trained NVImageNet Back bones weights Pre-trained weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit. nvidia/tao/pretrained_nvimagenet_backbones
TAO Pretrained EfficientDet-TF2 Pretrained efficientnet backbones for TAO Toolkit’s efficientdet-tf2 nvidia/tao/pretrained_efficientdet_tf2
TAO Pretrained Classification-TF2 Pretrained backbones for TAO Toolkit TF2 image classification nvidia/tao/pretrained_classification_tf2

Performance Metrics

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 16 400 16 300
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 314 32 226
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 16 140 32 70
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 16 457 16 352
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 479 64 358
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 64 2588 64 1700
VehilceMakeNet ResNet18 224x224x3 INT8 64 4261 64 2218
VehicleTypeNet ResNet18 224x224x3 INT8 64 3391 64 2044
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 32 613 32 492
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 32 32
License Plate Recognition ResNet 96x48x3 FP16 128 1498
Facial landmark 80x80x1 FP16 32 1606
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 64 1241
GestureNet 160x160x3 FP16 64 5420
BodyPose 288x384x3 INT8 16 195
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 32 577
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 4 38
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 16 826
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 4 42
Point Pillar Point Pillar FP16 1 38
Pose classification ST-GCN FP16 8 105
3D Pose - Accuracy FP16 16 241
3D Pose - Performance FP16 16 295
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 16 289
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 4 27
PeopleNet Transformer D-DETR 960x544x3 FP16 2 13
ReidentificationNet ResNet50 3x256x18 FP16 64 1049
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 16 48
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 16 45
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 32 544
CitySemSegFormer v1.0 SegFormer FP16 1 0.6
ImageNet Classification FAN-T-H 224x224x3 FP16 4 146.6
ImageNet Classification FAN-S-H 224x224x3 FP16 4 94.5
ImageNet Classification FAN-B-H 224x224x3 FP16 4 62
ImageNet Classification FAN-L-H 224x224x3 FP16 4 43.9
ImageNet Classification FAN-XL-H 224x224x3 FP16 4 39.9
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 8 212
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 8 163
ImageNet Classification GC-ViT-T 224x224x3 FP16 8 110
ImageNet Classification GC-ViT-S 224x224x3 FP16
ImageNet Classification GC-ViT-B 224x224x3 FP16 4 51.2
ImageNet Classification GC-ViT-L 224x224x3 FP16 4 28.8
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 1 10.4
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 4 23.4
ImageNet Classification FAN-L-H-384 384x384x3 FP16
ImageNet Classification EfficientNetB0 224x224x3 FP16 64 870
ImageNet Classification EfficientNetB1 224x224x3 FP16 64 344
ImageNet Classification EfficientNetB2 224x224x3 FP16 32 313
ImageNet Classification EfficientNetB3 224x224x3 FP16 32 234
ImageNet Classification EfficientNetB4 224x224x3 FP16 32 175
ImageNet Classification EfficientNetB5 224x224x3 FP16 16 123
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 9.6
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 1 5.2
COCO Object Detection DINO + RN50 960x544x3 FP16 1 8.4
COCO Object Detection DINO + FAN-S 960x544x3 FP16 1 4.4
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 1 4.9
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 2.6
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 3.4
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 3.4
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 4 59.8
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 4 41.9
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 575
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 16 482
OCDNet DCN-RN18 640x640x3 FP16 2 45
OCDNet DCN-RN50 640x640x3 FP16 2 22
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 64 1373
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 64 1051
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 4 198
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 1.9
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 1.78
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 44.7
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 21.92
CenterPose DLA34 512x512x3 FP16 1 27.74
CenterPose FAN-S-H 512x512x3 FP16 1 11.35

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 1116 32 528
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 890 32 404
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 16 421 32 104
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 1268 32 594
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 1308 64 587
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 128 7462 128 2720
VehilceMakeNet ResNet18 224x224x3 INT8 128 11872 128 3956
VehicleTypeNet ResNet18 224x224x3 INT8 128 9815 128 3494
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 64 1700 64 870
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 64 64
License Plate Recognition ResNet 96x48x3 FP16 128 4118
Facial landmark 80x80x1 FP16 64
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 128 3226
GestureNet 160x160x3 FP16 128 15133
BodyPose 288x384x3 INT8 16 559
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 64 1577
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 8 105
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 32 1702
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 4 109
Point Pillar Point Pillar FP16 1 90
Pose classification ST-GCN FP16 16 262
3D Pose - Accuracy FP16 16 597
3D Pose - Performance FP16 16 711
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 32 703
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 4 75
PeopleNet Transformer D-DETR 960x544x3 FP16 2 33
ReidentificationNet ResNet50 3x256x18 FP16 64 2772
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 16 121
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 16 114
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 32 1355
CitySemSegFormer v1.0 SegFormer FP16 1 1.5
ImageNet Classification FAN-T-H 224x224x3 FP16 8 385
ImageNet Classification FAN-S-H 224x224x3 FP16 8 255
ImageNet Classification FAN-B-H 224x224x3 FP16 8 166
ImageNet Classification FAN-L-H 224x224x3 FP16 8 117
ImageNet Classification FAN-XL-H 224x224x3 FP16 8 106
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 16 559
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 16 431
ImageNet Classification GC-ViT-T 224x224x3 FP16 16 293
ImageNet Classification GC-ViT-S 224x224x3 FP16
ImageNet Classification GC-ViT-B 224x224x3 FP16 8 138
ImageNet Classification GC-ViT-L 224x224x3 FP16 8 80.7
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 4 28.1
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 8 61.2
ImageNet Classification FAN-L-H-384 384x384x3 FP16
ImageNet Classification EfficientNetB0 224x224x3 FP16 64 2152
ImageNet Classification EfficientNetB1 224x224x3 FP16 64 860
ImageNet Classification EfficientNetB2 224x224x3 FP16 64 811
ImageNet Classification EfficientNetB3 224x224x3 FP16 64 609
ImageNet Classification EfficientNetB4 224x224x3 FP16 64 451
ImageNet Classification EfficientNetB5 224x224x3 FP16 32 318
COCO Object Detection DDETR+RN50 960x544x3 FP16 4 25.8
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 4 14.2
COCO Object Detection DINO + RN50 960x544x3 FP16 4 22
COCO Object Detection DINO + FAN-S 960x544x3 FP16 4 11.2
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 4 13
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 6.2
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 8.1
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 8.1
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 8 161
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 8 112
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 32 1541
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 32 1538
OCDNet DCN-RN18 640x640x3 FP16 4 120
OCDNet DCN-RN50 640x640x3 FP16
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 3876
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 128 2950
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 16 560
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 4.8
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 4.4
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 113.2
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 55
CenterPose DLA34 512x512x3 FP16 1 57.04
CenterPose FAN-S-H 512x512x3 FP16 1 28.03

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 4228
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 3160
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 32 1603
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 5082
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 4900
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 128 27100
VehilceMakeNet ResNet18 224x224x3 INT8 256 46200
VehicleTypeNet ResNet18 224x224x3 INT8 128 37200
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 64 7700
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 128 12500
License Plate Recognition ResNet 96x48x3 FP16 128 12400
Facial landmark 80x80x1 FP16 128 12400
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 12321
GestureNet 160x160x3 FP16 512 47361
BodyPose 288x384x3 INT8 32 1596
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 16 6000
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 4 380
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 32 8940
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 16 461
Point Pillar Point Pillar FP16 1 271
Pose classification ST-GCN FP16 64 1121.68
3D Pose - Accuracy FP16 32 1913.92
3D Pose - Performance FP16 32 2241.83
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 64 2862.76
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 16 253.77
PeopleNet Transformer D-DETR 960x544x3 FP16 16 135
ReidentificationNet ResNet50 3x256x18 FP16 128 9649
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 64 485
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 64 450
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 64 4791
CitySemSegFormer v1.0 SegFormer FP16 1 6.4
ImageNet Classification FAN-T-H 224x224x3 FP16 32 1806
ImageNet Classification FAN-S-H 224x224x3 FP16 32 1185
ImageNet Classification FAN-B-H 224x224x3 FP16 16 769
ImageNet Classification FAN-L-H 224x224x3 FP16 16 548
ImageNet Classification FAN-XL-H 224x224x3 FP16 16 423
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 32 3105
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 32 2339
ImageNet Classification GC-ViT-T 224x224x3 FP16 32 1658
ImageNet Classification GC-ViT-S 224x224x3 FP16 16 1077
ImageNet Classification GC-ViT-B 224x224x3 FP16 16 711
ImageNet Classification GC-ViT-L 224x224x3 FP16 16 388
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 8 151
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 16 260
ImageNet Classification FAN-L-H-384 384x384x3 FP16 8 179
ImageNet Classification EfficientNetB0 224x224x3 FP16 64 8361
ImageNet Classification EfficientNetB1 224x224x3 FP16 64 3245
ImageNet Classification EfficientNetB2 224x224x3 FP16 64 3078
ImageNet Classification EfficientNetB3 224x224x3 FP16 64 2320
ImageNet Classification EfficientNetB4 224x224x3 FP16 64 1669
ImageNet Classification EfficientNetB5 224x224x3 FP16 64 1203
COCO Object Detection DDETR+RN50 960x544x3 FP16 8 141
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 8 87
COCO Object Detection DINO + RN50 960x544x3 FP16 8 115
COCO Object Detection DINO + FAN-S 960x544x3 FP16 4 56
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 8 77
COCO Object Detection DINO + FAN-L 960x544x3 FP16 8 33.4
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 16 1098
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 16 834
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 16 606
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 16 465
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 8 44.2
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 8 44
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 16 734
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 16 522
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 5886
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 32 6318
OCDNet DCN-RN18 640x640x3 FP16 16 387
OCDNet DCN-RN50 640x640x3 FP16 8 186.6
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 12122
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 128 9656
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 32 3221
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 4 29.3
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 4 23.7
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 8 116.8
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 436
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 204
CenterPose DLA34 512x512x3 FP16 1 57.46
CenterPose FAN-S-H 512x512x3 FP16 1 26.73

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 3819
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 2568
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 32 1007
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 4754
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 4600
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 128 26900
VehilceMakeNet ResNet18 224x224x3 INT8 256 44800
VehicleTypeNet ResNet18 224x224x3 INT8 256 31500
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 64 6000
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 256 13900
License Plate Recognition ResNet 96x48x3 FP16 256 9000
Facial landmark 80x80x1 FP16 512 9600
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 10718
GestureNet 160x160x3 FP16 512 35371
BodyPose 288x384x3 INT8 32 1334
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 16 4600
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 4 265
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 32 6500
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 16 284
Point Pillar Point Pillar FP16 1 246
Pose classification ST-GCN FP16 64 825.75
3D Pose - Accuracy FP16 32 1286.05
3D Pose - Performance FP16 32 1558.21
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 64 2429.62
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 16 180.04
PeopleNet Transformer D-DETR 960x544x3 FP16 16 106
ReidentificationNet ResNet50 3x256x18 FP16 128 6129
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 64 367
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 64 341
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 64 3105
CitySemSegFormer v1.0 SegFormer FP16 1 4.5

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification FAN-T-H 224x224x3 FP16 4 99.1
ImageNet Classification FAN-S-H 224x224x3 FP16 4 64.5
ImageNet Classification FAN-B-H 224x224x3 FP16 4 42.3
ImageNet Classification FAN-L-H 224x224x3 FP16 4 29.8
ImageNet Classification FAN-XL-H 224x224x3 FP16 4 27
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 8 146
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 8 111
ImageNet Classification GC-ViT-T 224x224x3 FP16 8 75
ImageNet Classification GC-ViT-S 224x224x3 FP16
ImageNet Classification GC-ViT-B 224x224x3 FP16 4 35.2
ImageNet Classification GC-ViT-L 224x224x3 FP16 4 19.8
ImageNet Classification GC-ViT-L-384 384x384x3 FP16
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 4 16
ImageNet Classification FAN-L-H-384 384x384x3 FP16
ImageNet Classification EfficientNetB0 224x224x3 FP16 32 571
ImageNet Classification EfficientNetB1 224x224x3 FP16 32 227
ImageNet Classification EfficientNetB2 224x224x3 FP16 16 208
ImageNet Classification EfficientNetB3 224x224x3 FP16 16 155
ImageNet Classification EfficientNetB4 224x224x3 FP16 16 116
ImageNet Classification EfficientNetB5 224x224x3 FP16 8 81
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 6.6
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 1 3.6
COCO Object Detection DINO + RN50 960x544x3 FP16 1 5.7
COCO Object Detection DINO + FAN-S 960x544x3 FP16 1 3.1
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 1 3.3
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 1.8
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 2.3
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 2.3
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 4 40.2
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 4 28.4
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 393
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 16 399
OCDNet DCN-RN18 640x640x3 FP16
OCDNet DCN-RN50 640x640x3 FP16
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 64 935
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 64 715
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 4 133.9
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 1.36
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 1.2
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 31
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 15.2
CenterPose DLA34 512x512x3 FP16 1 19.16
CenterPose FAN-S-H 512x512x3 FP16 1 7.94

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 128 8500
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 64 6245
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 64 3291
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 256 9717
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 256 9500
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 256 51600
VehilceMakeNet ResNet18 224x224x3 INT8 1024 88300
VehicleTypeNet ResNet18 224x224x3 INT8 512 72300
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 256 14900
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 256 23200
License Plate Recognition ResNet 96x48x3 FP16 256 27200
Facial landmark 80x80x1 FP16 256 19600
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 1024 25394
GestureNet 160x160x3 FP16 1024 94555
BodyPose 288x384x3 INT8 16 3180
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 32 12600
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 16 797
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 64 17535
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 16 899
Point Pillar Point Pillar FP16 1 425
Pose classification ST-GCN FP16 64 2144.84
3D Pose - Accuracy FP16 32 3466.34
3D Pose - Performance FP16 32 4176.37
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 64 5745.79
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 16 496.34
PeopleNet Transformer D-DETR 960x544x3 FP16 16 267
ReidentificationNet ResNet50 3x256x18 FP16 256 20781
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 64 1012
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 64 947
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 64 9851
CitySemSegFormer v1.0 SegFormer FP16 4 13
ImageNet Classification FAN-T-H 224x224x3 FP16 128 3867
ImageNet Classification FAN-S-H 224x224x3 FP16 128 2576
ImageNet Classification FAN-B-H 224x224x3 FP16 64 1692
ImageNet Classification FAN-L-H 224x224x3 FP16 64 1196
ImageNet Classification FAN-XL-H 224x224x3 FP16 32 1046
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 128 6863
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 128 5162
ImageNet Classification GC-ViT-T 224x224x3 FP16 128 3600
ImageNet Classification GC-ViT-S 224x224x3 FP16 64 2435
ImageNet Classification GC-ViT-B 224x224x3 FP16 64 1663
ImageNet Classification GC-ViT-L 224x224x3 FP16 64 917
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 32 344
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 64 577
ImageNet Classification FAN-L-H-384 384x384x3 FP16 64 402
ImageNet Classification EfficientNetB0 224x224x3 FP16 256 17909
ImageNet Classification EfficientNetB1 224x224x3 FP16 256 6994
ImageNet Classification EfficientNetB2 224x224x3 FP16 256 6665
ImageNet Classification EfficientNetB3 224x224x3 FP16 256 4987
ImageNet Classification EfficientNetB4 224x224x3 FP16 64 3577
ImageNet Classification EfficientNetB5 224x224x3 FP16 256 2583
COCO Object Detection DDETR+RN50 960x544x3 FP16 32 296
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 32 184
COCO Object Detection DINO + RN50 960x544x3 FP16 32 244
COCO Object Detection DINO + FAN-S 960x544x3 FP16 32 121
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 32 165
COCO Object Detection DINO + FAN-L 960x544x3 FP16 16 70.6
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 64 2381
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 64 1808
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 64 1338
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 64 1000
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 32 94.3
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 32 94.2
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 64 1621
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 64 1155
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 64 15584
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 128 15660
OCDNet DCN-RN18 640x640x3 FP16 64 724
OCDNet DCN-RN50 640x640x3 FP16 64 351
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 512 28297
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 512 21212
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 128 7095
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 8 62.2
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 8 50.4
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 32 254
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 920
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 435
CenterPose DLA34 512x512x3 FP16 1 82.16
CenterPose FAN-S-H 512x512x3 FP16 1 49.99

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 1379
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 1064
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 32 465
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 1725
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 64 1676
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 128 9810
VehilceMakeNet ResNet18 224x224x3 INT8 256 16500
VehicleTypeNet ResNet18 224x224x3 INT8 128 12500
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 64 2578
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 128 6123
License Plate Recognition ResNet 96x48x3 FP16 128 3959
Facial landmark 80x80x1 FP16 128 4622
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 512 4563
GestureNet 160x160x3 FP16 512 15377
BodyPose 288x384x3 INT8 32 598
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 16 1897
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 4 139
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 32 3320
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 16 192
Point Pillar Point Pillar FP16 1 111
Pose classification ST-GCN FP16 64 376.4
3D Pose - Accuracy FP16 32 614.98
3D Pose - Performance FP16 32 712.94
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 64 1027.85
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 16 79.08
PeopleNet Transformer D-DETR 960x544x3 FP16 4 48
ReidentificationNet ResNet50 3x256x18 FP16 64 2586
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 32 162
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 32 151
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 32 1270
CitySemSegFormer v1.0 SegFormer FP16 1 2
ImageNet Classification FAN-T-H 224x224x3 FP16 16 624
ImageNet Classification FAN-S-H 224x224x3 FP16 8 403
ImageNet Classification FAN-B-H 224x224x3 FP16 8 259
ImageNet Classification FAN-L-H 224x224x3 FP16 16 188
ImageNet Classification FAN-XL-H 224x224x3 FP16 16 166
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 16 999
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 32 739
ImageNet Classification GC-ViT-T 224x224x3 FP16 16 517
ImageNet Classification GC-ViT-S 224x224x3 FP16 8 335
ImageNet Classification GC-ViT-B 224x224x3 FP16 4 223
ImageNet Classification GC-ViT-L 224x224x3 FP16 16 122
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 4 45.1
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 8 91
ImageNet Classification FAN-L-H-384 384x384x3 FP16 4 62
ImageNet Classification EfficientNetB0 224x224x3 FP16 64 2735
ImageNet Classification EfficientNetB1 224x224x3 FP16 64 1099
ImageNet Classification EfficientNetB2 224x224x3 FP16 32 1006
ImageNet Classification EfficientNetB3 224x224x3 FP16 32 753
ImageNet Classification EfficientNetB4 224x224x3 FP16 16 526
ImageNet Classification EfficientNetB5 224x224x3 FP16 16 375
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 46.5
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 4 30.3
COCO Object Detection DINO + RN50 960x544x3 FP16 4 38.9
COCO Object Detection DINO + FAN-S 960x544x3 FP16 4 20
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 8 26.7
COCO Object Detection DINO + FAN-L 960x544x3 FP16 4 10.9
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 16 374
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 8 272
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 8 198
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 16 156.7
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 2 15.2
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 4 15.4
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 8 253
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 16 184
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 1838
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 32 2314
OCDNet DCN-RN18 640x640x3 FP16 16 155
OCDNet DCN-RN50 640x640x3 FP16 1 72.5
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 3649
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 128 2673
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 16 1012
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 9.4
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 7.3
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 4 39.6
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 146
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 59.7
CenterPose DLA34 512x512x3 FP16 1 24.27
CenterPose FAN-S-H 512x512x3 FP16 1 9.77

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification FAN-T-H 224x224x3 FP16 16 4249
ImageNet Classification FAN-S-H 224x224x3 FP16 8 2647
ImageNet Classification FAN-B-H 224x224x3 FP16 8 1720
ImageNet Classification FAN-L-H 224x224x3 FP16 8 1154
ImageNet Classification FAN-XL-H 224x224x3 FP16 8 1005
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 16 6282
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 16 4632
ImageNet Classification GC-ViT-T 224x224x3 FP16 16 3118
ImageNet Classification GC-ViT-S 224x224x3 FP16 8 2022
ImageNet Classification GC-ViT-B 224x224x3 FP16 8 1282
ImageNet Classification GC-ViT-L 224x224x3 FP16 4 628
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 2 247
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 4 558
ImageNet Classification FAN-L-H-384 384x384x3 FP16 4 366
ImageNet Classification EfficientNetB0 224x224x3 FP16 32 19577
ImageNet Classification EfficientNetB1 224x224x3 FP16 16 7723
ImageNet Classification EfficientNetB2 224x224x3 FP16 16 7234
ImageNet Classification EfficientNetB3 224x224x3 FP16 16 5063
ImageNet Classification EfficientNetB4 224x224x3 FP16 16 3740
ImageNet Classification EfficientNetB5 224x224x3 FP16 16 2671
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 283
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 1 182
COCO Object Detection DINO + RN50 960x544x3 FP16 1 215
COCO Object Detection DINO + FAN-S 960x544x3 FP16 1 119.5
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 1 151
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 68.5
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 8 1985
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 8 1510
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 8 1144
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 8 840
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 88.5
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 88.6
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 8 1648
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 8 1124
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 32 9407
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 128 10390
OCDNet DCN-RN18 640x640x3 FP16 4 940
OCDNet DCN-RN50 640x640x3 FP16 4 445
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 18971
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 256 13950
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 16 6619
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 47.3
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 40.9
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 2 210
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 4 980
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 8 364
CenterPose DLA34 512x512x3 FP16 1 106.70
CenterPose FAN-S-H 512x512x3 FP16 1 56.65

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification FAN-T-H 224x224x3 FP16 8 1604
ImageNet Classification FAN-S-H 224x224x3 FP16 8 949
ImageNet Classification FAN-B-H 224x224x3 FP16 4 632
ImageNet Classification FAN-L-H 224x224x3 FP16 4 442
ImageNet Classification FAN-XL-H 224x224x3 FP16 4 365
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 8 2440
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 8 1797
ImageNet Classification GC-ViT-T 224x224x3 FP16 16 1266
ImageNet Classification GC-ViT-S 224x224x3 FP16 8 817
ImageNet Classification GC-ViT-B 224x224x3 FP16 8 546
ImageNet Classification GC-ViT-L 224x224x3 FP16 8 268
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 4 101
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 4 207
ImageNet Classification FAN-L-H-384 384x384x3 FP16 4 145
ImageNet Classification EfficientNetB0 224x224x3 FP16 16 7072
ImageNet Classification EfficientNetB1 224x224x3 FP16 16 2454
ImageNet Classification EfficientNetB2 224x224x3 FP16 16 2306
ImageNet Classification EfficientNetB3 224x224x3 FP16 16 1527
ImageNet Classification EfficientNetB4 224x224x3 FP16 16 1134
ImageNet Classification EfficientNetB5 224x224x3 FP16 16 780
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 107
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 1 70
COCO Object Detection DINO + RN50 960x544x3 FP16 1 79.6
COCO Object Detection DINO + FAN-S 960x544x3 FP16 1 44
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 1 56.6
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 26.4
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 4 750
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 4 581
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 4 442
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 4 341
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 34.1
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 34.1
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 4 603
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 4 430
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 4375
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 32 2821
OCDNet DCN-RN18 640x640x3 FP16 1 333
OCDNet DCN-RN50 640x640x3 FP16 1 169
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 8036
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 64 6045
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 8 2543
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 17.8
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 15.7
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 1 83.4
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 4 344
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 8 131.48
CenterPose DLA34 512x512x3 FP16 1 40.68
CenterPose FAN-S-H 512x512x3 FP16 1 18.69

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 749
PeopleNet (v2.3) DetectNet_v2 - ResNet34 960x544x3 INT8 32 581
PeopleNet (v2.5 unpruned) DetectNet_v2 - ResNet34 960x544x3 INT8 32 231
TrafficCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 916
DashCamNet DetectNet_v2 - ResNet18 960x544x3 INT8 32 865
FaceDetect-IR DetectNet_v2 - ResNet18 384x240x3 INT8 64 4982
VehilceMakeNet ResNet18 224x224x3 INT8 128 8000
VehicleTypeNet ResNet18 224x224x3 INT8 128 6302
FaceDetect (pruned) DetectNet_v2 - ResNet18 736x416x3 INT8 32 1174
License Plate Detection v1.0 DetectNet_v2 - ResNet18 640x480x3 INT8 128 2570
License Plate Recognition ResNet 96x48x3 FP16 128 2180
Facial landmark 80x80x1 FP16 256 2800
GazeNet 224x224x1, 224x224x1, 224x224x1, 25x25x1 FP16 256 2488
GestureNet 160x160x3 FP16 256 7690
BodyPose 288x384x3 INT8 16 278
Action Recognition 2D RGB 2D CNN 224x224x96 FP16 8 1044
Action Recognition 3D RGB 3D CNN 224x224x32x3 FP16 4 56
Action Recognition 2D OF 2D Optical Flow 224x224x96 FP16 16 1419
Action Recognition 3D OF 3D Optical Flow 224x224x32x3 FP16 2 58
Point Pillar Point Pillar FP16 1 63
Pose classification ST-GCN FP16 64 211.5
3D Pose - Accuracy FP16 32 370.13
3D Pose - Performance FP16 32 471.81
PeopleSemSegNet_v2 - Shuffle UNET - Shuffle 960x544x3 FP16 16 631.31
PeopleSemSegNet_v2 - Vanilla UNET - Vanilla 960x544x3 FP16 16 44.09
PeopleNet Transformer D-DETR 960x544x3 FP16 4 27
ReidentificationNet ResNet50 3x256x18 FP16 64 1570
RetailObjectDetection v1.0 - 100 class EfficientNet-D5 416x416x3 FP16 32 106
RetailObjectDetection v1.0 - binary EfficientNet-D5 416x416x3 FP16 32 99
RetailObjectEmbedding v1.0 ResNet101 3x224x224 FP16 32 803
CitySemSegFormer v1.0 SegFormer FP16 1 1.3
ImageNet Classification FAN-T-H 224x224x3 FP16 16 399
ImageNet Classification FAN-S-H 224x224x3 FP16 16 256
ImageNet Classification FAN-B-H 224x224x3 FP16 16 168
ImageNet Classification FAN-L-H 224x224x3 FP16 16 118
ImageNet Classification FAN-XL-H 224x224x3 FP16 8 103
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 16 666
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 32 492
ImageNet Classification GC-ViT-T 224x224x3 FP16 32 336
ImageNet Classification GC-ViT-S 224x224x3 FP16 16 213
ImageNet Classification GC-ViT-B 224x224x3 FP16 16 137
ImageNet Classification GC-ViT-L 224x224x3 FP16 16 67
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 2 26
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 8 55.5
ImageNet Classification FAN-L-H-384 384x384x3 FP16 1 38
ImageNet Classification EfficientNetB0 224x224x3 FP16 64 1870
ImageNet Classification EfficientNetB1 224x224x3 FP16 64 726
ImageNet Classification EfficientNetB2 224x224x3 FP16 64 690
ImageNet Classification EfficientNetB3 224x224x3 FP16 64 518
ImageNet Classification EfficientNetB4 224x224x3 FP16 32 363
ImageNet Classification EfficientNetB5 224x224x3 FP16 32 265
COCO Object Detection DDETR+RN50 960x544x3 FP16 1 27.7
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 1 17.9
COCO Object Detection DINO + RN50 960x544x3 FP16 1 22.5
COCO Object Detection DINO + FAN-S 960x544x3 FP16 1 11.7
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 1 15.7
COCO Object Detection DINO + FAN-L 960x544x3 FP16 1 6.7
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 16 230
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 16 174
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 16 129
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 16 97.7
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 1 8.9
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 1 8.8
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 16 161
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 16 113
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 16 1200
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 32 1391
OCDNet DCN-RN18 640x640x3 FP16 8 93
OCDNet DCN-RN50 640x640x3 FP16 1 45.5
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 128 2094
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 128 1482
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 32 688
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 1 5.8
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 1 4.4
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 1 23.3
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 16 95.81
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 16 36.02
CenterPose DLA34 512x512x3 FP16 1 15.69
CenterPose FAN-S-H 512x512x3 FP16 1 5.80

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification FAN-T-H 224x224x3 FP16 128 6555
ImageNet Classification FAN-S-H 224x224x3 FP16 128 4393
ImageNet Classification FAN-B-H 224x224x3 FP16 64 2833
ImageNet Classification FAN-L-H 224x224x3 FP16 64 1982
ImageNet Classification FAN-XL-H 224x224x3 FP16 32 1692
ImageNet Classification GC-ViT-xxTiny 224x224x3 FP16 128 11942
ImageNet Classification GC-ViT-xTiny 224x224x3 FP16 128 9094
ImageNet Classification GC-ViT-T 224x224x3 FP16 128 6381
ImageNet Classification GC-ViT-S 224x224x3 FP16 64 4145
ImageNet Classification GC-ViT-B 224x224x3 FP16 64 3040
ImageNet Classification GC-ViT-L 224x224x3 FP16 64 1618
ImageNet Classification GC-ViT-L-384 384x384x3 FP16 32 613
ImageNet Classification FAN-B-H-384 (384 resolution) 384x384x3 FP16 64 985
ImageNet Classification FAN-L-H-384 384x384x3 FP16 64 681
ImageNet Classification EfficientNetB0 224x224x3 FP16 256 28765
ImageNet Classification EfficientNetB1 224x224x3 FP16 256 11363
ImageNet Classification EfficientNetB2 224x224x3 FP16 256 10830
ImageNet Classification EfficientNetB3 224x224x3 FP16 256 8115
ImageNet Classification EfficientNetB4 224x224x3 FP16 64 5915
ImageNet Classification EfficientNetB5 224x224x3 FP16 256 4323
COCO Object Detection DDETR+RN50 960x544x3 FP16 32 526
COCO Object Detection DDETR + GCViT-T 960x544x3 FP16 32 317
COCO Object Detection DINO + RN50 960x544x3 FP16 32 442
COCO Object Detection DINO + FAN-S 960x544x3 FP16 32 213
COCO Object Detection DINO + GC-ViT-T 960x544x3 FP16 32 290
COCO Object Detection DINO + FAN-L 960x544x3 FP16 32 125.5
Cityscapes Segmentation SegFormer + FAN-T-H 224x224x3 FP16 64 4189
Cityscapes Segmentation Segformer + FAN-S-H 224x224x3 FP16 64 3143
Cityscapes Segmentation SegFormer + FAN-B-H 224x224x3 FP16 64 2306
Cityscapes Segmentation SegFormer + FAN-L-H 224x224x3 FP16 64 1732
RetailObjectDetection v2.0 - binary Efficientdet-D5 960x544x3 FP16
RetailObjectDetection v2.0 - binary DINO-FAN_base 960x544x3 FP16 32 167
RetailObjectDetection v2.0 - Meta DINO-FAN_base 960x544x3 FP16 32 167
RetailObjectEmbedding v2.0 FAN-B-H 224x224x3 FP16 64 2686
RetailObjectEmbedding v2.0 FAN-L-H 224x224x3 FP16 64 1919
RetailObjectEmbedding v2.0 NVCLIP-B 224x224x3 FP16 64 29070
Siamese Optical Inspection Siamese CNN 2x512x128x3 FP16 128 24107
OCDNet DCN-RN18 640x640x3 FP16 64 1468
OCDNet DCN-RN50 640x640x3 FP16 64 706
OCRNet ResNet50 + Bi-LSTM pruned 32x100x3 FP16 512 55717
OCRNet ResNet50 + Bi-LSTM unpruned 32x100x3 FP16 512 48492
PCB Inspection GC-ViT-xxTiny 224x224x3 FP16 128 12273
CitySemSegFormer v2.0 Segformer + MIT 1024x1024x3 FP16 8 108
CitySemSegFormer v2.0 SegFormer + FAN -B-H 1024x1024x3 FP16 8 89.5
PeopleSemSegFormer v2.0 SegFormer + FAN-B-H 512x512x3 FP16 32 454
Visual ChangeNet Classification Visual ChangeNet + FAN -S-H 512x128x3 FP16 64 1581
Visual ChangeNet Segmentation Visual ChangeNet + FAN -B-H 256x256x3 FP16 32 841
CenterPose DLA34 512x512x3 FP16 1 105.84
CenterPose FAN-S-H 512x512x3 FP16 1 76.26

General purpose computer vision models

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

  • For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.

  • For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.

  • For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below.

Backbone

Re-identification

Metric Learning Recognition

NvDINOv2 X
GcViT
ViT X
FAN
FasterViT
ResNet X X
Swin X
EfficientNet
ST-GCN (graph convolutional network)
MIT-b

Backbone

Image classification

NvDINOv2 X
GcViT X
ViT X
FAN X
FasterViT X
ResNet X
Swin X
EfficientNet X
ST-GCN (graph convolutional network)
MIT-b

Backbone

DINO

D-DETR

Grounding DINO

EfficientDet

NvDINOv2 X
GcViT X X
ViT X X
FAN X
FasterViT
ResNet X X
Swin X
EfficientNet X
ST-GCN (graph convolutional network)
MIT-b

Backbone

MAL

Mask GroundingDINO

Mask2Former

NvDINOv2
GcViT
ViT X
FAN
FasterViT
ResNet
Swin X X
EfficientNet
ST-GCN (graph convolutional network)
MIT-b

Backbone

SegFormer

Mask2Former

NvDINOv2
GcViT
ViT
FAN X
FasterViT
ResNet
Swin X
EfficientNet
ST-GCN (graph convolutional network)
MIT-b X

Backbone

Mask2Former

NvDINOv2
GcViT
ViT
FAN
FasterViT
ResNet
Swin X
EfficientNet
ST-GCN (graph convolutional network)
MIT-b

Backbone

OCD

OCR

NvDINOv2
GcViT
ViT
FAN X X
FasterViT
ResNet X X
Swin
EfficientNet
ST-GCN (graph convolutional network)
MIT-b

Backbone

Classification

Segmentation

NvDINOv2 X X
GcViT
ViT X X
FAN X X
FasterViT
ResNet
Swin
EfficientNet
ST-GCN (graph convolutional network)
MIT-b

Backbone

Pose Classification

NvDINOv2
GcViT
ViT
FAN
FasterViT
ResNet
Swin
EfficientNet
ST-GCN (graph convolutional network) X
MIT-b

Computer Vision Feature Summary

The table below summarizes the computer vision models and the features enabled.

Feature Summary

CV Task

Model

New in 22-05

New in TAO 4.0

New in TAO 5.0

New in TAO 5.5

Pruning

Knowledge Distillation

QAT

AutoML

REST API

Channel-wise QAT

Class weighting

Visualization (TB)

MLOPs integration (W&B/ClearML)

BYOM

Multi-node

Multi-GPU

AMP

Early Stopping

Framework

Annotation Format

DLA

Classification ResNet10/18/34/50/101 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification VGG16/19 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification GoogleNet No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification MobileNet_v1/v2 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification SqueezeNet No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification DarkNet19/53 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification EfficientNet_B0-B7 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification CSPDarkNet19/53 No No No No yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification CSPDarkNet-Tiny No No No No Yes No No yes yes no no yes clearml,wandb yes yes yes yes No tf1 ImageNet yes
Classification EfficientNet_B0-B5 No No No No yes No No yes yes yes no yes clearml,wandb yes yes yes yes No tf2 ImageNet yes
Classification GcViT No No Yes No No No No yes yes no yes no no no yes yes yes No pyt ImageNet no
Classification FAN No No Yes No No No No yes yes no yes no no no yes yes yes No pyt ImageNet no
Detection EfficientDet No No No No yes No no yes no yes no yes clearml,wandb No yes yes yes no tf2 COCO yes
Detection RetinaNet No No No No yes No yes yes yes no yes yes clearml,wandb No yes yes yes yes tf1 KITTI/COCO yes
Detection DetectNet_v2 No No No No yes No yes yes yes no yes yes clearml,wandb No yes yes yes yes tf1 KITTI/COCO yes
Detection SSD No No No No yes No yes yes yes no no yes clearml,wandb No yes yes yes yes tf1 KITTI/COCO yes
Detection DSSD No No No No yes No yes yes yes no no yes clearml,wandb No yes yes yes yes tf1 KITTI/COCO yes
Detection Deformable DETR Yes no No No no No no yes yes no no no wandb No yes yes yes no pyt COCO no
Detection DINO No no Yes No no yes no yes yes no no no wandb No yes yes yes no pyt COCO no
Detection GroundingDINO No no No Yes no No no yes yes no no no wandb No yes yes yes no pyt ODVG/COCO no
Detection BevFusion No No No Yes no No no no no no no no wandb No yes yes yes no pyt KITTI/Nuscenes no
Multitask classification All classification No No No No yes No no yes yes no no yes clearml No yes yes yes no tf1 Custom yes
Instance/Panoptic/Semantic Segmentation Mask2Former No No No Yes no No no no no no no no wandb No yes yes yes no pyt COCO no
Instance Segmentation MaskRCNN No No No No yes No no yes yes no no yes clearml,wandb No yes yes yes no tf1 COCO no
Semantic Segmentation Segformer Yes no No No no No no yes yes no no no no No yes no no no pyt CityScape - PNG no
Semantic Segmentation UNET No No No No yes No yes yes yes no no yes clearml,wandb yes yes yes yes no tf1 CityScape - PNG no
OCR LPR No No No No no No no yes yes no no yes no no yes yes yes yes tf1 Custom - txt file no
Key Points 2D body pose No No No No yes No no, but PTQ yes yes no no no no no yes yes yes no tf1 COCO no
Key Points 2D body pose No No No No yes No no, but PTQ yes yes no no no no no yes yes yes no tf1 COCO no
Point Cloud PointPillars Yes No No No Yes No no yes yes no no no no no yes yes yes no pyt KITTI no
Action Recognition 2D action recognition RGB No No No No no No no yes yes no no no wandb no no yes yes no pyt Custom no
Action Recognition 3D action recognition RGB No No No No no No no yes yes no no no wandb no no yes yes no pyt Custom no
Action Recognition 2D action recognition OF No No No No no No no yes yes no no no wandb no no yes yes no pyt Custom no
Action Recognition 3D action recognition OF No No No No no No no yes yes no no no wandb no no yes yes no pyt Custom no
Other Pose action classification Yes No No No no No no yes yes no no no wandb no no yes yes no pyt COCO no
Other HeartRateNet No No No No no No no yes yes no no no no no no yes yes no tf1 NVIDIA Defined no
Other GazeNet No No No No no No no yes yes no no no no no no yes yes no tf1 NVIDIA Defined no
Other EmotionNet No No No No no No no yes no no no yes no no no no yes no tf1 NVIDIA Defined no
Other GestureNet No No No No no No no yes no no no no no no yes yes yes no tf1 NVIDIA Defined no
Feature Embedding ResNet50 Yes no No No no No no yes no no no no no no yes no no no pyt Market1501 No
Other OpticalInspection No No Yes No No No No yes yes no Yes (False Positive Rate Sampling) yes wandb no no yes no no pyt NVIDIA Defined Yes
Other Optical Character Detection No No Yes No Yes No No yes yes no no no wandb No Yes Yes no No pyt NVDIA defined no
OCR Optical Character Recognition No No Yes No Yes No No yes yes no no no wandb No yes Yes no No pyt NVIDIA Defined no
Previous Advanced Users
Next Running TAO in the Cloud
© Copyright 2024, NVIDIA. Last updated on Oct 15, 2024.