Overview#

TAO provides an extensive model zoo containing both foundational models and purpose-built models for numerous computer vision use cases.

Computer Vision Model Zoo#

There are three types of pre-trained models that you can start with:

  • Foundation models: Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The primary goal of a foundational model is to serve as a starter that can be adapted to a variety of downstream tasks.

  • Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO for transfer learning on your own dataset.

  • Pre-trained weights: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

../../_images/tao_toolkit_models_tree.png

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models.

Foundation Models#

Model name

Description

NGC Instance

TAO Finetuning

TAO Commercial C-RADIOv2 Model

RADIOv2 model distilled from multiiple teachers on commercial data.

nvidia/tao/cradiov2

Downstream finetuning with classification_pyt, RT-DETR, SegFormer and visual changenet

TAO Commercial ConvNext Model

ConvNext series of models pretrained using FC-MAE on commercial data.

nvidia/tao/convnext_v2

Domain adaptation and linear probe with mae and downstream finetuning with rtdetr for object detection.

TAO Commercial Pretrained NV-Dinov2 Model

TAO Commercial Pretrained NV-Dinov2 Model ViT-G backbone

nvidia/tao/nvdinov2_vitg

Yes

TAO Commercial Pretrained NV-CLIP Model

TAO Commercial Pretrained NV-CLIP ViT-H Model

nvidia/tao/nvclip_vit

No

SegIC

In-context segmentation model trained on commercial data.

nvidia/tao/segic

No

Mask Grounding DINO

Open vocabulary multi-modal instance segmentation model trained on commercial data.

nvidia/tao/mask_grounding_dino

Yes

Grounding DINO

Open vocabulary multi-modal object detection model trained on commercial data.

nvidia/tao/grounding_dino

Yes

Mask Auto Label

Pretrained model to generate semantic segmentation labels.

nvidia/tao/mask_auto_label

Yes

FoundationPose

6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box.

nvidia/tao/foundationpose

No

TAO Toolkit ODISE 1.1

Open vocabulary multi-modal instance segmentation model trained on non-commercial datasets.

nvidia/tao/odise

Yes (from source only)

Purpose-built models#

Model name

Description

NGC Instance

TAO Finetuning

BEVFusion for 3D Object Detection

BEVFusion model to detect 3D objects from point cloud and RGB data.

nvidia/tao/bevfusion

Yes

Action Recognition Net

5 class action recognition network to recognize what people do in an image.

nvidia/tao/actionrecognitionnet

Yes

ReidentifcationNet Transformer

SWIN Transformer based Re-Identification network to generate embeddings for identifying persons in different scenes.

nvidia/tao/reidentificationnet_transformer

Yes

ReIdentificationNet

Re-Identification network to generate embeddings for identifying persons in different scenes.

nvidia/tao/reidentificationnet

Yes

CenterPose - ISAAC Ros

3 pose detection model for retail objects.

nvidia/tao/centerpose_ros

Yes

Optical Character Recognition

Model to recognise characters from a preceding OCDNet model.

nvidia/tao/ocrnet

Yes

Retail Object Detection

DINO (DETR with Improved DeNoising Anchor Boxes) based object detection network to detect retail objects on a checkout counter.

nvidia/tao/retail_object_detection

Yes

Pose Classification

Pose classification network to classify poses of people from their skeletons.

nvidia/tao/poseclassificationnet

Yes

PointPillarNet

Model to detect one or more objects from a LIDAR point cloud file and return 3D bounding boxes.

nvidia/tao/pointpillarnet

Yes

PeopleSegNet

1 class instance segmentation network to detect and segment instances of people in an image.

nvidia/tao/peoplesegnet

Yes

Multiple 3D CenterPose

Single-stage, keypoint-based method for category-level object pose estimation.

nvidia/tao/multiclass_3d_centerpose

Yes

Retail Object Recognition

Embedding generator model to recognize objects on a checkout counter.

nvidia/tao/retail_object_recognition

Yes

PeopleSemSegformer

Model to segment persons in an image.

nvidia/tao/peoplesemsegformer

Yes

PeopleNet Transformer v2.0

3 class object detection network to detect people in an image.

nvidia/tao/peoplenet_transformer_v2

Yes

Visual ChangeNet - Segmentation

Visual ChangeNet - Segmentation

nvidia/tao/visual_changenet_segmentation_landsatscd

Yes

CenterPose

3 pose detection model for retail objects.

nvidia/tao/centerpose

Yes

Visual ChangeNet Segmentation - MvTEC

Change segmentation model.

nvidia/tao/visual_changenet_segmentation_mvtec

Yes

PCB Defect Classification

Model to classify defects in soldered components on a printed circuit board.

nvidia/tao/pcb_classification

Yes

Visual ChangeNet Segmentation - (Research-only)

Visual ChangeNet-Segmentation (research-only)

nvidia/tao/visual_changenet_segmentation_levircd

Yes

Visual ChangeNet Classification

Visual ChangeNet - Classification Models

nvidia/tao/visual_changenet_classification

Yes

Optical Inspection

Model to detect defects in soldered components on a printed circuit board.

nvidia/tao/optical_inspection

Yes

CitySemSegFormer

Semantic segmentation of persons in an image.

nvidia/tao/citysemsegformer

Yes

BodyPoseNet

Detect body pose from an image.

nvidia/tao/bodyposenet

Yes

Optical Character Detection

Network to detect characters in an image.

nvidia/tao/ocdnet

Yes

TrafficCamNet

4 class object detection network to detect cars in an image.

nvidia/tao/trafficcamnet

Yes

BodyPose3DNet

3D human pose estimation network to predict 34 keypoints in 3D of a person in an image.

nvidia/tao/bodypose3dnet

No

Pre-trained Models#

Model name

Description

NGC Instance

Mask2Former

Binary instance segmentation model trained on COCO data.

nvidia/tao/mask2former

Pre-trained SegFormer ImageNet weights

Pre-trained SegFormer weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_segformer_imagenet

Pre-trained EfficientDet NvImageNet backbones

Pre-trained EfficientNet backbone weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_efficientdet_tf2_nvimagenet

Pre-trained EfficientDet Model trained on COCO

Pre-trained EfficientDet models trained on COCO to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_efficientdet_tf2_coco

Deformable DETR

Model trained on COCO2017 to detect and classify objects.

nvidia/tao/pretrained_deformable_detr_coco

DINO

Model trained on COCO2017 to detect and classify objects.

nvidia/tao/pretrained_dino_coco

TAO Pretrained DINO with Foundational Model Backbone

TAO Pretrained DINO with Foundational Model Backbone

nvidia/tao/dino_with_fm_backbone

DINO with NVDinoV2 backbone on COCO

Model trained on COCO2017 to detect and classify objects.

nvidia/tao/pretrained_nvdinov2_dino_coco

Pre-trained FasterViT based ImageNet Classification weights

Pre-trained FasterViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_fastervit_classification_imagenet

Pre-trained FasterViT based NVImageNet Classification weights

Pre-trained FasterViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_fastervit_classification_nvimagenet

Pre-trained GCViT ImageNet Classification weights

Pre-trained GCViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_gcvit_classification_imagenet

Pre-trained Deformable DETR NvImageNet weights

Pre-trained deformable_detr weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_deformable_detr_nvimagenet

Pre-trained FAN based ImageNet Classification weights

Pre-trained FAN weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_fan_classification_imagenet

Pre-trained FAN based NVImageNet Classification weights

Pre-trained FAN weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_fan_classification_nvimagenet

Pre-trained DINO ImageNet weights

Pre-trained DINO weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_dino_imagenet

Pre-trained Segformer - CityScapes

Pre-trained segformer models trained on CityScapes.

nvidia/tao/pretrained_segformer_cityscapes

Pre-trained SegFormer NvImageNet weights

Pre-trained SegFormer weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_segformer_nvimagenet

Pre-trained GCViT NVImageNet Classification weights

Pre-trained GcViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_gcvit_classification_nvimagenet

Pre-trained DINO NvImageNet weights

Pre-trained DINO weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_dino_nvimagenet

Pre-trained Deformable DETR ImageNet weights

Pre-trained deformable_detr weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_deformable_detr_imagenet

Pretrained Mask Auto Label

Pretrained model to facilitate transfer learning for MAL on TAO Toolkit

nvidia/tao/pretrained_mask_auto_label

Pre-trained ImageNet Back bones weights

Pre-trained weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_imagenet_backbones

Pre-trained NVImageNet Back bones weights

Pre-trained weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.

nvidia/tao/pretrained_nvimagenet_backbones

TAO Pretrained EfficientDet-TF2

Pretrained efficientnet backbones for TAO Toolkit’s efficientdet-tf2

nvidia/tao/pretrained_efficientdet_tf2

TAO Pretrained Classification-TF2

Pretrained backbones for TAO Toolkit TF2 image classification

nvidia/tao/pretrained_classification_tf2

Performance Metrics#

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

16

400

16

300

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

314

32

226

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

16

140

32

70

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

16

457

16

352

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

479

64

358

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

64

2588

64

1700

VehilceMakeNet

ResNet18

224x224x3

INT8

64

4261

64

2218

VehicleTypeNet

ResNet18

224x224x3

INT8

64

3391

64

2044

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

32

613

32

492

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

32

32

License Plate Recognition

ResNet

96x48x3

FP16

128

1498

Facial landmark

80x80x1

FP16

32

1606

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

64

1241

GestureNet

160x160x3

FP16

64

5420

BodyPose

288x384x3

INT8

16

195

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

32

577

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

4

38

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

16

826

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

4

42

Point Pillar

Point Pillar

FP16

1

38

Pose classification

ST-GCN

FP16

8

105

3D Pose - Accuracy

FP16

16

241

3D Pose - Performance

FP16

16

295

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

16

289

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

4

27

PeopleNet Transformer

D-DETR

960x544x3

FP16

2

13

ReidentificationNet

ResNet50

3x256x18

FP16

64

1049

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

16

48

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

16

45

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

32

544

CitySemSegFormer v1.0

SegFormer

FP16

1

0.6

ImageNet Classification

FAN-T-H

224x224x3

FP16

4

146.6

ImageNet Classification

FAN-S-H

224x224x3

FP16

4

94.5

ImageNet Classification

FAN-B-H

224x224x3

FP16

4

62

ImageNet Classification

FAN-L-H

224x224x3

FP16

4

43.9

ImageNet Classification

FAN-XL-H

224x224x3

FP16

4

39.9

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

8

212

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

8

163

ImageNet Classification

GC-ViT-T

224x224x3

FP16

8

110

ImageNet Classification

GC-ViT-S

224x224x3

FP16

ImageNet Classification

GC-ViT-B

224x224x3

FP16

4

51.2

ImageNet Classification

GC-ViT-L

224x224x3

FP16

4

28.8

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

1

10.4

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

4

23.4

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

ImageNet Classification

EfficientNetB0

224x224x3

FP16

64

870

ImageNet Classification

EfficientNetB1

224x224x3

FP16

64

344

ImageNet Classification

EfficientNetB2

224x224x3

FP16

32

313

ImageNet Classification

EfficientNetB3

224x224x3

FP16

32

234

ImageNet Classification

EfficientNetB4

224x224x3

FP16

32

175

ImageNet Classification

EfficientNetB5

224x224x3

FP16

16

123

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

9.6

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

1

5.2

COCO Object Detection

DINO + RN50

960x544x3

FP16

1

8.4

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

1

4.4

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

1

4.9

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

2.6

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

3.4

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

3.4

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

4

59.8

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

4

41.9

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

575

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

16

482

OCDNet

DCN-RN18

640x640x3

FP16

2

45

OCDNet

DCN-RN50

640x640x3

FP16

2

22

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

64

1373

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

64

1051

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

4

198

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

1.9

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

1.78

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

44.7

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

21.92

CenterPose

DLA34

512x512x3

FP16

1

27.74

CenterPose

FAN-S-H

512x512x3

FP16

1

11.35

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

DLA1 + DLA2 BS

DLA1 + DLA2 FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

1116

32

528

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

890

32

404

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

16

421

32

104

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

1268

32

594

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

1308

64

587

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

128

7462

128

2720

VehilceMakeNet

ResNet18

224x224x3

INT8

128

11872

128

3956

VehicleTypeNet

ResNet18

224x224x3

INT8

128

9815

128

3494

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

64

1700

64

870

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

64

64

License Plate Recognition

ResNet

96x48x3

FP16

128

4118

Facial landmark

80x80x1

FP16

64

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

128

3226

GestureNet

160x160x3

FP16

128

15133

BodyPose

288x384x3

INT8

16

559

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

64

1577

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

8

105

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

32

1702

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

4

109

Point Pillar

Point Pillar

FP16

1

90

Pose classification

ST-GCN

FP16

16

262

3D Pose - Accuracy

FP16

16

597

3D Pose - Performance

FP16

16

711

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

32

703

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

4

75

PeopleNet Transformer

D-DETR

960x544x3

FP16

2

33

ReidentificationNet

ResNet50

3x256x18

FP16

64

2772

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

16

121

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

16

114

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

32

1355

CitySemSegFormer v1.0

SegFormer

FP16

1

1.5

ImageNet Classification

FAN-T-H

224x224x3

FP16

8

385

ImageNet Classification

FAN-S-H

224x224x3

FP16

8

255

ImageNet Classification

FAN-B-H

224x224x3

FP16

8

166

ImageNet Classification

FAN-L-H

224x224x3

FP16

8

117

ImageNet Classification

FAN-XL-H

224x224x3

FP16

8

106

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

16

559

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

16

431

ImageNet Classification

GC-ViT-T

224x224x3

FP16

16

293

ImageNet Classification

GC-ViT-S

224x224x3

FP16

ImageNet Classification

GC-ViT-B

224x224x3

FP16

8

138

ImageNet Classification

GC-ViT-L

224x224x3

FP16

8

80.7

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

4

28.1

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

8

61.2

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

ImageNet Classification

EfficientNetB0

224x224x3

FP16

64

2152

ImageNet Classification

EfficientNetB1

224x224x3

FP16

64

860

ImageNet Classification

EfficientNetB2

224x224x3

FP16

64

811

ImageNet Classification

EfficientNetB3

224x224x3

FP16

64

609

ImageNet Classification

EfficientNetB4

224x224x3

FP16

64

451

ImageNet Classification

EfficientNetB5

224x224x3

FP16

32

318

COCO Object Detection

DDETR+RN50

960x544x3

FP16

4

25.8

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

4

14.2

COCO Object Detection

DINO + RN50

960x544x3

FP16

4

22

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

4

11.2

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

4

13

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

6.2

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

8.1

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

8.1

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

8

161

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

8

112

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

32

1541

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

32

1538

OCDNet

DCN-RN18

640x640x3

FP16

4

120

OCDNet

DCN-RN50

640x640x3

FP16

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

3876

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

128

2950

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

16

560

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

4.8

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

4.4

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

113.2

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

55

CenterPose

DLA34

512x512x3

FP16

1

57.04

CenterPose

FAN-S-H

512x512x3

FP16

1

28.03

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification

FAN-T-H

224x224x3

FP16

4

99.1

ImageNet Classification

FAN-S-H

224x224x3

FP16

4

64.5

ImageNet Classification

FAN-B-H

224x224x3

FP16

4

42.3

ImageNet Classification

FAN-L-H

224x224x3

FP16

4

29.8

ImageNet Classification

FAN-XL-H

224x224x3

FP16

4

27

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

8

146

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

8

111

ImageNet Classification

GC-ViT-T

224x224x3

FP16

8

75

ImageNet Classification

GC-ViT-S

224x224x3

FP16

ImageNet Classification

GC-ViT-B

224x224x3

FP16

4

35.2

ImageNet Classification

GC-ViT-L

224x224x3

FP16

4

19.8

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

4

16

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

ImageNet Classification

EfficientNetB0

224x224x3

FP16

32

571

ImageNet Classification

EfficientNetB1

224x224x3

FP16

32

227

ImageNet Classification

EfficientNetB2

224x224x3

FP16

16

208

ImageNet Classification

EfficientNetB3

224x224x3

FP16

16

155

ImageNet Classification

EfficientNetB4

224x224x3

FP16

16

116

ImageNet Classification

EfficientNetB5

224x224x3

FP16

8

81

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

6.6

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

1

3.6

COCO Object Detection

DINO + RN50

960x544x3

FP16

1

5.7

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

1

3.1

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

1

3.3

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

1.8

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

2.3

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

2.3

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

4

40.2

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

4

28.4

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

393

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

16

399

OCDNet

DCN-RN18

640x640x3

FP16

OCDNet

DCN-RN50

640x640x3

FP16

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

64

935

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

64

715

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

4

133.9

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

1.36

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

1.2

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

31

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

15.2

CenterPose

DLA34

512x512x3

FP16

1

19.16

CenterPose

FAN-S-H

512x512x3

FP16

1

7.94

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

128

8500

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

64

6245

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

64

3291

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

256

9717

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

256

9500

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

256

51600

VehilceMakeNet

ResNet18

224x224x3

INT8

1024

88300

VehicleTypeNet

ResNet18

224x224x3

INT8

512

72300

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

256

14900

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

256

23200

License Plate Recognition

ResNet

96x48x3

FP16

256

27200

Facial landmark

80x80x1

FP16

256

19600

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

1024

25394

GestureNet

160x160x3

FP16

1024

94555

BodyPose

288x384x3

INT8

16

3180

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

32

12600

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

16

797

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

64

17535

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

16

899

Point Pillar

Point Pillar

FP16

1

425

Pose classification

ST-GCN

FP16

64

2144.84

3D Pose - Accuracy

FP16

32

3466.34

3D Pose - Performance

FP16

32

4176.37

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

64

5745.79

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

16

496.34

PeopleNet Transformer

D-DETR

960x544x3

FP16

16

267

ReidentificationNet

ResNet50

3x256x18

FP16

256

20781

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

64

1012

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

64

947

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

64

9851

CitySemSegFormer v1.0

SegFormer

FP16

4

13

ImageNet Classification

FAN-T-H

224x224x3

FP16

128

3867

ImageNet Classification

FAN-S-H

224x224x3

FP16

128

2576

ImageNet Classification

FAN-B-H

224x224x3

FP16

64

1692

ImageNet Classification

FAN-L-H

224x224x3

FP16

64

1196

ImageNet Classification

FAN-XL-H

224x224x3

FP16

32

1046

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

128

6863

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

128

5162

ImageNet Classification

GC-ViT-T

224x224x3

FP16

128

3600

ImageNet Classification

GC-ViT-S

224x224x3

FP16

64

2435

ImageNet Classification

GC-ViT-B

224x224x3

FP16

64

1663

ImageNet Classification

GC-ViT-L

224x224x3

FP16

64

917

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

32

344

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

64

577

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

64

402

ImageNet Classification

EfficientNetB0

224x224x3

FP16

256

17909

ImageNet Classification

EfficientNetB1

224x224x3

FP16

256

6994

ImageNet Classification

EfficientNetB2

224x224x3

FP16

256

6665

ImageNet Classification

EfficientNetB3

224x224x3

FP16

256

4987

ImageNet Classification

EfficientNetB4

224x224x3

FP16

64

3577

ImageNet Classification

EfficientNetB5

224x224x3

FP16

256

2583

COCO Object Detection

DDETR+RN50

960x544x3

FP16

32

296

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

32

184

COCO Object Detection

DINO + RN50

960x544x3

FP16

32

244

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

32

121

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

32

165

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

16

70.6

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

64

2381

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

64

1808

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

64

1338

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

64

1000

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

32

94.3

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

32

94.2

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

64

1621

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

64

1155

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

64

15584

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

128

15660

OCDNet

DCN-RN18

640x640x3

FP16

64

724

OCDNet

DCN-RN50

640x640x3

FP16

64

351

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

512

28297

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

512

21212

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

128

7095

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

8

62.2

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

8

50.4

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

32

254

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

920

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

435

CenterPose

DLA34

512x512x3

FP16

1

82.16

CenterPose

FAN-S-H

512x512x3

FP16

1

49.99

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

4228

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

3160

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

1603

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

5082

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

4900

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

128

27100

VehilceMakeNet

ResNet18

224x224x3

INT8

256

46200

VehicleTypeNet

ResNet18

224x224x3

INT8

128

37200

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

64

7700

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

128

12500

License Plate Recognition

ResNet

96x48x3

FP16

128

12400

Facial landmark

80x80x1

FP16

128

12400

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

12321

GestureNet

160x160x3

FP16

512

47361

BodyPose

288x384x3

INT8

32

1596

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

16

6000

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

4

380

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

32

8940

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

16

461

Point Pillar

Point Pillar

FP16

1

271

Pose classification

ST-GCN

FP16

64

1121.68

3D Pose - Accuracy

FP16

32

1913.92

3D Pose - Performance

FP16

32

2241.83

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

64

2862.76

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

16

253.77

PeopleNet Transformer

D-DETR

960x544x3

FP16

16

135

ReidentificationNet

ResNet50

3x256x18

FP16

128

9649

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

64

485

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

64

450

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

64

4791

CitySemSegFormer v1.0

SegFormer

FP16

1

6.4

ImageNet Classification

FAN-T-H

224x224x3

FP16

32

1806

ImageNet Classification

FAN-S-H

224x224x3

FP16

32

1185

ImageNet Classification

FAN-B-H

224x224x3

FP16

16

769

ImageNet Classification

FAN-L-H

224x224x3

FP16

16

548

ImageNet Classification

FAN-XL-H

224x224x3

FP16

16

423

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

32

3105

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

32

2339

ImageNet Classification

GC-ViT-T

224x224x3

FP16

32

1658

ImageNet Classification

GC-ViT-S

224x224x3

FP16

16

1077

ImageNet Classification

GC-ViT-B

224x224x3

FP16

16

711

ImageNet Classification

GC-ViT-L

224x224x3

FP16

16

388

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

8

151

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

16

260

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

8

179

ImageNet Classification

EfficientNetB0

224x224x3

FP16

64

8361

ImageNet Classification

EfficientNetB1

224x224x3

FP16

64

3245

ImageNet Classification

EfficientNetB2

224x224x3

FP16

64

3078

ImageNet Classification

EfficientNetB3

224x224x3

FP16

64

2320

ImageNet Classification

EfficientNetB4

224x224x3

FP16

64

1669

ImageNet Classification

EfficientNetB5

224x224x3

FP16

64

1203

COCO Object Detection

DDETR+RN50

960x544x3

FP16

8

141

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

8

87

COCO Object Detection

DINO + RN50

960x544x3

FP16

8

115

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

4

56

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

8

77

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

8

33.4

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

16

1098

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

16

834

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

16

606

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

16

465

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

8

44.2

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

8

44

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

16

734

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

16

522

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

5886

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

32

6318

OCDNet

DCN-RN18

640x640x3

FP16

16

387

OCDNet

DCN-RN50

640x640x3

FP16

8

186.6

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

12122

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

128

9656

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

32

3221

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

4

29.3

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

4

23.7

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

8

116.8

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

436

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

204

CenterPose

DLA34

512x512x3

FP16

1

57.46

CenterPose

FAN-S-H

512x512x3

FP16

1

26.73

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

3819

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

2568

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

1007

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

4754

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

4600

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

128

26900

VehilceMakeNet

ResNet18

224x224x3

INT8

256

44800

VehicleTypeNet

ResNet18

224x224x3

INT8

256

31500

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

64

6000

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

256

13900

License Plate Recognition

ResNet

96x48x3

FP16

256

9000

Facial landmark

80x80x1

FP16

512

9600

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

10718

GestureNet

160x160x3

FP16

512

35371

BodyPose

288x384x3

INT8

32

1334

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

16

4600

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

4

265

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

32

6500

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

16

284

Point Pillar

Point Pillar

FP16

1

246

Pose classification

ST-GCN

FP16

64

825.75

3D Pose - Accuracy

FP16

32

1286.05

3D Pose - Performance

FP16

32

1558.21

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

64

2429.62

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

16

180.04

PeopleNet Transformer

D-DETR

960x544x3

FP16

16

106

ReidentificationNet

ResNet50

3x256x18

FP16

128

6129

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

64

367

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

64

341

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

64

3105

CitySemSegFormer v1.0

SegFormer

FP16

1

4.5

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

749

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

581

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

231

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

916

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

32

865

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

64

4982

VehilceMakeNet

ResNet18

224x224x3

INT8

128

8000

VehicleTypeNet

ResNet18

224x224x3

INT8

128

6302

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

32

1174

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

128

2570

License Plate Recognition

ResNet

96x48x3

FP16

128

2180

Facial landmark

80x80x1

FP16

256

2800

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

256

2488

GestureNet

160x160x3

FP16

256

7690

BodyPose

288x384x3

INT8

16

278

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

8

1044

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

4

56

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

16

1419

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

2

58

Point Pillar

Point Pillar

FP16

1

63

Pose classification

ST-GCN

FP16

64

211.5

3D Pose - Accuracy

FP16

32

370.13

3D Pose - Performance

FP16

32

471.81

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

16

631.31

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

16

44.09

PeopleNet Transformer

D-DETR

960x544x3

FP16

4

27

ReidentificationNet

ResNet50

3x256x18

FP16

64

1570

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

32

106

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

32

99

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

32

803

CitySemSegFormer v1.0

SegFormer

FP16

1

1.3

ImageNet Classification

FAN-T-H

224x224x3

FP16

16

399

ImageNet Classification

FAN-S-H

224x224x3

FP16

16

256

ImageNet Classification

FAN-B-H

224x224x3

FP16

16

168

ImageNet Classification

FAN-L-H

224x224x3

FP16

16

118

ImageNet Classification

FAN-XL-H

224x224x3

FP16

8

103

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

16

666

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

32

492

ImageNet Classification

GC-ViT-T

224x224x3

FP16

32

336

ImageNet Classification

GC-ViT-S

224x224x3

FP16

16

213

ImageNet Classification

GC-ViT-B

224x224x3

FP16

16

137

ImageNet Classification

GC-ViT-L

224x224x3

FP16

16

67

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

2

26

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

8

55.5

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

1

38

ImageNet Classification

EfficientNetB0

224x224x3

FP16

64

1870

ImageNet Classification

EfficientNetB1

224x224x3

FP16

64

726

ImageNet Classification

EfficientNetB2

224x224x3

FP16

64

690

ImageNet Classification

EfficientNetB3

224x224x3

FP16

64

518

ImageNet Classification

EfficientNetB4

224x224x3

FP16

32

363

ImageNet Classification

EfficientNetB5

224x224x3

FP16

32

265

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

27.7

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

1

17.9

COCO Object Detection

DINO + RN50

960x544x3

FP16

1

22.5

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

1

11.7

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

1

15.7

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

6.7

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

16

230

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

16

174

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

16

129

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

16

97.7

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

8.9

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

8.8

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

16

161

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

16

113

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

1200

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

32

1391

OCDNet

DCN-RN18

640x640x3

FP16

8

93

OCDNet

DCN-RN50

640x640x3

FP16

1

45.5

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

2094

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

128

1482

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

32

688

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

5.8

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

4.4

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

1

23.3

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

95.81

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

36.02

CenterPose

DLA34

512x512x3

FP16

1

15.69

CenterPose

FAN-S-H

512x512x3

FP16

1

5.80

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification

FAN-T-H

224x224x3

FP16

128

6555

ImageNet Classification

FAN-S-H

224x224x3

FP16

128

4393

ImageNet Classification

FAN-B-H

224x224x3

FP16

64

2833

ImageNet Classification

FAN-L-H

224x224x3

FP16

64

1982

ImageNet Classification

FAN-XL-H

224x224x3

FP16

32

1692

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

128

11942

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

128

9094

ImageNet Classification

GC-ViT-T

224x224x3

FP16

128

6381

ImageNet Classification

GC-ViT-S

224x224x3

FP16

64

4145

ImageNet Classification

GC-ViT-B

224x224x3

FP16

64

3040

ImageNet Classification

GC-ViT-L

224x224x3

FP16

64

1618

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

32

613

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

64

985

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

64

681

ImageNet Classification

EfficientNetB0

224x224x3

FP16

256

28765

ImageNet Classification

EfficientNetB1

224x224x3

FP16

256

11363

ImageNet Classification

EfficientNetB2

224x224x3

FP16

256

10830

ImageNet Classification

EfficientNetB3

224x224x3

FP16

256

8115

ImageNet Classification

EfficientNetB4

224x224x3

FP16

64

5915

ImageNet Classification

EfficientNetB5

224x224x3

FP16

256

4323

COCO Object Detection

DDETR+RN50

960x544x3

FP16

32

526

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

32

317

COCO Object Detection

DINO + RN50

960x544x3

FP16

32

442

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

32

213

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

32

290

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

32

125.5

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

64

4189

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

64

3143

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

64

2306

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

64

1732

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

32

167

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

32

167

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

64

2686

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

64

1919

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

64

29070

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

128

24107

OCDNet

DCN-RN18

640x640x3

FP16

64

1468

OCDNet

DCN-RN50

640x640x3

FP16

64

706

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

512

55717

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

512

48492

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

128

12273

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

8

108

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

8

89.5

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

32

454

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

64

1581

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

32

841

CenterPose

DLA34

512x512x3

FP16

1

105.84

CenterPose

FAN-S-H

512x512x3

FP16

1

76.26

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification

FAN-T-H

224x224x3

FP16

16

4249

ImageNet Classification

FAN-S-H

224x224x3

FP16

8

2647

ImageNet Classification

FAN-B-H

224x224x3

FP16

8

1720

ImageNet Classification

FAN-L-H

224x224x3

FP16

8

1154

ImageNet Classification

FAN-XL-H

224x224x3

FP16

8

1005

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

16

6282

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

16

4632

ImageNet Classification

GC-ViT-T

224x224x3

FP16

16

3118

ImageNet Classification

GC-ViT-S

224x224x3

FP16

8

2022

ImageNet Classification

GC-ViT-B

224x224x3

FP16

8

1282

ImageNet Classification

GC-ViT-L

224x224x3

FP16

4

628

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

2

247

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

4

558

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

4

366

ImageNet Classification

EfficientNetB0

224x224x3

FP16

32

19577

ImageNet Classification

EfficientNetB1

224x224x3

FP16

16

7723

ImageNet Classification

EfficientNetB2

224x224x3

FP16

16

7234

ImageNet Classification

EfficientNetB3

224x224x3

FP16

16

5063

ImageNet Classification

EfficientNetB4

224x224x3

FP16

16

3740

ImageNet Classification

EfficientNetB5

224x224x3

FP16

16

2671

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

283

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

1

182

COCO Object Detection

DINO + RN50

960x544x3

FP16

1

215

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

1

119.5

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

1

151

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

68.5

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

8

1985

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

8

1510

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

8

1144

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

8

840

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

88.5

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

88.6

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

8

1648

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

8

1124

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

32

9407

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

128

10390

OCDNet

DCN-RN18

640x640x3

FP16

4

940

OCDNet

DCN-RN50

640x640x3

FP16

4

445

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

18971

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

256

13950

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

16

6619

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

47.3

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

40.9

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

2

210

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

4

980

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

8

364

CenterPose

DLA34

512x512x3

FP16

1

106.70

CenterPose

FAN-S-H

512x512x3

FP16

1

56.65

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

ImageNet Classification

FAN-T-H

224x224x3

FP16

8

1604

ImageNet Classification

FAN-S-H

224x224x3

FP16

8

949

ImageNet Classification

FAN-B-H

224x224x3

FP16

4

632

ImageNet Classification

FAN-L-H

224x224x3

FP16

4

442

ImageNet Classification

FAN-XL-H

224x224x3

FP16

4

365

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

8

2440

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

8

1797

ImageNet Classification

GC-ViT-T

224x224x3

FP16

16

1266

ImageNet Classification

GC-ViT-S

224x224x3

FP16

8

817

ImageNet Classification

GC-ViT-B

224x224x3

FP16

8

546

ImageNet Classification

GC-ViT-L

224x224x3

FP16

8

268

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

4

101

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

4

207

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

4

145

ImageNet Classification

EfficientNetB0

224x224x3

FP16

16

7072

ImageNet Classification

EfficientNetB1

224x224x3

FP16

16

2454

ImageNet Classification

EfficientNetB2

224x224x3

FP16

16

2306

ImageNet Classification

EfficientNetB3

224x224x3

FP16

16

1527

ImageNet Classification

EfficientNetB4

224x224x3

FP16

16

1134

ImageNet Classification

EfficientNetB5

224x224x3

FP16

16

780

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

107

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

1

70

COCO Object Detection

DINO + RN50

960x544x3

FP16

1

79.6

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

1

44

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

1

56.6

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

1

26.4

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

4

750

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

4

581

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

4

442

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

4

341

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

1

34.1

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

1

34.1

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

4

603

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

4

430

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

4375

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

32

2821

OCDNet

DCN-RN18

640x640x3

FP16

1

333

OCDNet

DCN-RN50

640x640x3

FP16

1

169

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

8036

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

64

6045

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

8

2543

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

17.8

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

15.7

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

1

83.4

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

4

344

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

8

131.48

CenterPose

DLA34

512x512x3

FP16

1

40.68

CenterPose

FAN-S-H

512x512x3

FP16

1

18.69

Model Name

Arch

Inference resolution

Precision

GPU BS

GPU FPS

PeopleNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

1379

PeopleNet (v2.3)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

1064

PeopleNet (v2.5 unpruned)

DetectNet_v2 - ResNet34

960x544x3

INT8

32

465

TrafficCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

1725

DashCamNet

DetectNet_v2 - ResNet18

960x544x3

INT8

64

1676

FaceDetect-IR

DetectNet_v2 - ResNet18

384x240x3

INT8

128

9810

VehilceMakeNet

ResNet18

224x224x3

INT8

256

16500

VehicleTypeNet

ResNet18

224x224x3

INT8

128

12500

FaceDetect (pruned)

DetectNet_v2 - ResNet18

736x416x3

INT8

64

2578

License Plate Detection v1.0

DetectNet_v2 - ResNet18

640x480x3

INT8

128

6123

License Plate Recognition

ResNet

96x48x3

FP16

128

3959

Facial landmark

80x80x1

FP16

128

4622

GazeNet

224x224x1, 224x224x1, 224x224x1, 25x25x1

FP16

512

4563

GestureNet

160x160x3

FP16

512

15377

BodyPose

288x384x3

INT8

32

598

Action Recognition 2D RGB

2D CNN

224x224x96

FP16

16

1897

Action Recognition 3D RGB

3D CNN

224x224x32x3

FP16

4

139

Action Recognition 2D OF

2D Optical Flow

224x224x96

FP16

32

3320

Action Recognition 3D OF

3D Optical Flow

224x224x32x3

FP16

16

192

Point Pillar

Point Pillar

FP16

1

111

Pose classification

ST-GCN

FP16

64

376.4

3D Pose - Accuracy

FP16

32

614.98

3D Pose - Performance

FP16

32

712.94

PeopleSemSegNet_v2 - Shuffle

UNET - Shuffle

960x544x3

FP16

64

1027.85

PeopleSemSegNet_v2 - Vanilla

UNET - Vanilla

960x544x3

FP16

16

79.08

PeopleNet Transformer

D-DETR

960x544x3

FP16

4

48

ReidentificationNet

ResNet50

3x256x18

FP16

64

2586

RetailObjectDetection v1.0 - 100 class

EfficientNet-D5

416x416x3

FP16

32

162

RetailObjectDetection v1.0 - binary

EfficientNet-D5

416x416x3

FP16

32

151

RetailObjectEmbedding v1.0

ResNet101

3x224x224

FP16

32

1270

CitySemSegFormer v1.0

SegFormer

FP16

1

2

ImageNet Classification

FAN-T-H

224x224x3

FP16

16

624

ImageNet Classification

FAN-S-H

224x224x3

FP16

8

403

ImageNet Classification

FAN-B-H

224x224x3

FP16

8

259

ImageNet Classification

FAN-L-H

224x224x3

FP16

16

188

ImageNet Classification

FAN-XL-H

224x224x3

FP16

16

166

ImageNet Classification

GC-ViT-xxTiny

224x224x3

FP16

16

999

ImageNet Classification

GC-ViT-xTiny

224x224x3

FP16

32

739

ImageNet Classification

GC-ViT-T

224x224x3

FP16

16

517

ImageNet Classification

GC-ViT-S

224x224x3

FP16

8

335

ImageNet Classification

GC-ViT-B

224x224x3

FP16

4

223

ImageNet Classification

GC-ViT-L

224x224x3

FP16

16

122

ImageNet Classification

GC-ViT-L-384

384x384x3

FP16

4

45.1

ImageNet Classification

FAN-B-H-384 (384 resolution)

384x384x3

FP16

8

91

ImageNet Classification

FAN-L-H-384

384x384x3

FP16

4

62

ImageNet Classification

EfficientNetB0

224x224x3

FP16

64

2735

ImageNet Classification

EfficientNetB1

224x224x3

FP16

64

1099

ImageNet Classification

EfficientNetB2

224x224x3

FP16

32

1006

ImageNet Classification

EfficientNetB3

224x224x3

FP16

32

753

ImageNet Classification

EfficientNetB4

224x224x3

FP16

16

526

ImageNet Classification

EfficientNetB5

224x224x3

FP16

16

375

COCO Object Detection

DDETR+RN50

960x544x3

FP16

1

46.5

COCO Object Detection

DDETR + GCViT-T

960x544x3

FP16

4

30.3

COCO Object Detection

DINO + RN50

960x544x3

FP16

4

38.9

COCO Object Detection

DINO + FAN-S

960x544x3

FP16

4

20

COCO Object Detection

DINO + GC-ViT-T

960x544x3

FP16

8

26.7

COCO Object Detection

DINO + FAN-L

960x544x3

FP16

4

10.9

Cityscapes Segmentation

SegFormer + FAN-T-H

224x224x3

FP16

16

374

Cityscapes Segmentation

Segformer + FAN-S-H

224x224x3

FP16

8

272

Cityscapes Segmentation

SegFormer + FAN-B-H

224x224x3

FP16

8

198

Cityscapes Segmentation

SegFormer + FAN-L-H

224x224x3

FP16

16

156.7

RetailObjectDetection v2.0 - binary

Efficientdet-D5

960x544x3

FP16

RetailObjectDetection v2.0 - binary

DINO-FAN_base

960x544x3

FP16

2

15.2

RetailObjectDetection v2.0 - Meta

DINO-FAN_base

960x544x3

FP16

4

15.4

RetailObjectEmbedding v2.0

FAN-B-H

224x224x3

FP16

8

253

RetailObjectEmbedding v2.0

FAN-L-H

224x224x3

FP16

16

184

RetailObjectEmbedding v2.0

NVCLIP-B

224x224x3

FP16

16

1838

Siamese Optical Inspection

Siamese CNN

2x512x128x3

FP16

32

2314

OCDNet

DCN-RN18

640x640x3

FP16

16

155

OCDNet

DCN-RN50

640x640x3

FP16

1

72.5

OCRNet

ResNet50 + Bi-LSTM pruned

32x100x3

FP16

128

3649

OCRNet

ResNet50 + Bi-LSTM unpruned

32x100x3

FP16

128

2673

PCB Inspection

GC-ViT-xxTiny

224x224x3

FP16

16

1012

CitySemSegFormer v2.0

Segformer + MIT

1024x1024x3

FP16

1

9.4

CitySemSegFormer v2.0

SegFormer + FAN -B-H

1024x1024x3

FP16

1

7.3

PeopleSemSegFormer v2.0

SegFormer + FAN-B-H

512x512x3

FP16

4

39.6

Visual ChangeNet Classification

Visual ChangeNet + FAN -S-H

512x128x3

FP16

16

146

Visual ChangeNet Segmentation

Visual ChangeNet + FAN -B-H

256x256x3

FP16

16

59.7

CenterPose

DLA34

512x512x3

FP16

1

24.27

CenterPose

FAN-S-H

512x512x3

FP16

1

9.77

General purpose computer vision models#

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

  • For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.

  • For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.

  • For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below.

Backbone

Image classification

NVCLIP

C-RADIOv2

NvDINOv2

GcViT

ViT

FAN

FasterViT

ResNet

Swin

EfficientNet

Backbone

DINO

D-DETR

Grounding DINO

RT-DETR

EfficientDet

C-RADIOv2

ConvNext

NvDINOv2

GcViT

ViT

FAN

ResNet

Swin

EfficientNet

Backbone

MAL

Mask GroundingDINO

Mask2Former

ViT

Swin

Backbone

SegFormer

Mask2Former

C-RADIOv2

x

NvDINOv2

x

FAN

Swin

MIT-b

Backbone

Mask2Former

Swin

Backbone

OCD

OCR

FAN

ResNet

Backbone

Classification

Segmentation

C-RADIOv2

NvDINOv2

ViT

FAN

Backbone

Pose Classification

ST-GCN (graph convolutional network)

Backbone

Re-identification

Metric Learning Recognition

NvDINOv2

ViT

ResNet

Swin

Computer Vision Feature Summary#

The table below summarizes the computer vision models and the features enabled.

Feature Summary#

CV Task

Model

New in 22-05

New in TAO 4.0

New in TAO 5.0

New in TAO 5.5

New in TAO 6.0

Pruning

Knowledge Distillation

QAT

AutoML

REST API

Channel-wise QAT

Class weighting

Visualization (TB)

MLOPs integration (W&B/ClearML)

Multi-node

Multi-GPU

AMP

Framework

Annotation Format

DLA

Self Supervised Learning

Mask Auto Encoder

No

No

No

No

Yes

No

yes

No

yes

yes

no

no

yes

wandb

yes

yes

yes

pyt

Unstructured/ImageNet

no

Self Supervised Learning

NvDINOv2

No

No

No

No

Yes

No

yes

No

yes

yes

no

no

yes

wandb

yes

yes

yes

pyt

Unstructured/ImageNet

no

Classification

ResNet18/34/50

No

No

No

No

No

yes

No

No

yes

yes

no

no

yes

clearml,wandb

yes

yes

yes

tf2

ImageNet

yes

Classification

EfficientNet_B0-B5

No

No

No

No

No

yes

No

No

yes

yes

yes

no

yes

clearml,wandb

yes

yes

yes

tf2

ImageNet

yes

Classification

GcViT

No

No

Yes

No

No

No

yes

No

yes

yes

no

yes

no

wandb

yes

yes

yes

pyt

ImageNet

no

Classification

FAN

No

No

Yes

No

No

No

yes

No

yes

yes

no

yes

no

wandb

yes

yes

yes

pyt

ImageNet

no

Classification

NvDINOv2

No

No

Yes

No

No

No

yes

No

yes

yes

no

yes

no

wandb

yes

yes

yes

pyt

ImageNet

no

Classification

NVCLIP

No

No

Yes

No

No

No

yes

No

yes

yes

no

yes

no

wandb

yes

yes

yes

pyt

ImageNet

no

Classification

C-RADIOv2

No

No

Yes

No

No

No

yes

No

yes

yes

no

yes

no

wandb

yes

yes

yes

pyt

ImageNet

no

Detection

EfficientDet

No

No

No

No

No

yes

No

no

yes

no

yes

no

yes

wandb

yes

yes

yes

tf2

COCO

yes

Detection

RT-DETR

Yes

no

No

No

Yes

no

yes

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

COCO

no

Detection

Deformable DETR

Yes

no

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

COCO

no

Detection

DINO

No

no

Yes

No

No

no

yes

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

COCO

no

Detection

GroundingDINO

No

no

No

Yes

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

ODVG/COCO

no

Detection

BevFusion

No

No

No

Yes

No

no

No

no

yes

no

no

no

no

wandb

yes

yes

yes

pyt

KITTI/Nuscenes

no

Instance/Panoptic/Semantic Segmentation

Mask2Former

No

No

No

Yes

No

no

No

no

yes

no

no

no

no

wandb

yes

yes

yes

pyt

COCO

no

Semantic Segmentation

Segformer

Yes

no

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

CityScape - PNG

no

Point Cloud

PointPillars

Yes

No

No

No

No

Yes

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

KITTI

no

Action Recognition

2D action recognition RGB

No

No

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

Custom

no

Action Recognition

3D action recognition RGB

No

No

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

Custom

no

Action Recognition

2D action recognition OF

No

No

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

Custom

no

Action Recognition

3D action recognition OF

No

No

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

Custom

no

Other

Pose action classification

Yes

No

No

No

No

no

No

no

yes

yes

no

no

no

wandb

yes

yes

yes

pyt

COCO

no

Feature Embedding

ResNet50

Yes

no

No

No

No

no

No

no

yes

no

no

no

no

no

yes

yes

yes

pyt

Market1501

No

Other

OpticalInspection

No

No

Yes

No

No

No

No

No

yes

yes

no

Yes (False Positive Rate Sampling)

yes

wandb

yes

yes

no

pyt

NVIDIA Defined

Yes

Other

Optical Character Detection

No

No

Yes

No

No

Yes

No

No

yes

yes

no

no

no

wandb

Yes

Yes

no

pyt

NVDIA defined

no

OCR

Optical Character Recognition

No

No

Yes

No

No

Yes

No

No

yes

yes

no

no

no

wandb

yes

Yes

no

pyt

NVIDIA Defined

no

Anomaly Detection

Visual Changenet

No

No

Yes

No

No

Yes

No

No

yes

yes

no

no

no

wandb

yes

Yes

no

pyt

NVIDIA Defined

no