Overview#

TAO provides an extensive model zoo containing both foundational models and purpose-built models for numerous computer vision use cases.

Computer Vision Model Zoo#

There are three types of pre-trained models that you can start with:

Foundation models: Foundation models are large scale Machine Learning models that are trained on vast quantities of data at scale. These models are often trained using some form of self-supervised of semi-supervised training algorithms. The primary goal of a foundational model is to serve as a starter that can be adapted to a variety of downstream tasks.
Purpose-built pre-trained models: These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain-focused models can either be used directly for inference or can be used with TAO for transfer learning on your own dataset.
Pre-trained weights: The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision use cases, these pre-trained weights are trained on Open Image datasets, and they provide a much better starting point for training versus starting from a random initialization of weights.

../../_images/tao_toolkit_models_tree.png

You can choose from 100+ permutations of model architecture and backbone with the general purpose vision models.

Foundation Models#

Model name	Description	NGC Instance	TAO Finetuning
TAO Commercial C-RADIOv2 Model	RADIOv2 model distilled from multiiple teachers on commercial data.	nvidia/tao/cradiov2	Downstream finetuning with classification_pyt, RT-DETR, SegFormer and visual changenet
TAO Commercial ConvNext Model	ConvNext series of models pretrained using FC-MAE on commercial data.	nvidia/tao/convnext_v2	Domain adaptation and linear probe with `mae` and downstream finetuning with `rtdetr` for object detection.
TAO Commercial Pretrained NV-Dinov2 Model	TAO Commercial Pretrained NV-Dinov2 Model ViT-G backbone	nvidia/tao/nvdinov2_vitg	Yes
TAO Commercial Pretrained NV-CLIP Model	TAO Commercial Pretrained NV-CLIP ViT-H Model	nvidia/tao/nvclip_vit	No
SegIC	In-context segmentation model trained on commercial data.	nvidia/tao/segic	No
Mask Grounding DINO	Open vocabulary multi-modal instance segmentation model trained on commercial data.	nvidia/tao/mask_grounding_dino	Yes
Grounding DINO	Open vocabulary multi-modal object detection model trained on commercial data.	nvidia/tao/grounding_dino	Yes
Mask Auto Label	Pretrained model to generate semantic segmentation labels.	nvidia/tao/mask_auto_label	Yes
FoundationPose	6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box.	nvidia/tao/foundationpose	No
TAO Toolkit ODISE 1.1	Open vocabulary multi-modal instance segmentation model trained on non-commercial datasets.	nvidia/tao/odise	Yes (from source only)

Purpose-built models#

Model name	Description	NGC Instance	TAO Finetuning
BEVFusion for 3D Object Detection	BEVFusion model to detect 3D objects from point cloud and RGB data.	nvidia/tao/bevfusion	Yes
Action Recognition Net	5 class action recognition network to recognize what people do in an image.	nvidia/tao/actionrecognitionnet	Yes
ReidentifcationNet Transformer	SWIN Transformer based Re-Identification network to generate embeddings for identifying persons in different scenes.	nvidia/tao/reidentificationnet_transformer	Yes
ReIdentificationNet	Re-Identification network to generate embeddings for identifying persons in different scenes.	nvidia/tao/reidentificationnet	Yes
CenterPose - ISAAC Ros	3 pose detection model for retail objects.	nvidia/tao/centerpose_ros	Yes
Optical Character Recognition	Model to recognise characters from a preceding OCDNet model.	nvidia/tao/ocrnet	Yes
Retail Object Detection	DINO (DETR with Improved DeNoising Anchor Boxes) based object detection network to detect retail objects on a checkout counter.	nvidia/tao/retail_object_detection	Yes
Pose Classification	Pose classification network to classify poses of people from their skeletons.	nvidia/tao/poseclassificationnet	Yes
PointPillarNet	Model to detect one or more objects from a LIDAR point cloud file and return 3D bounding boxes.	nvidia/tao/pointpillarnet	Yes
PeopleSegNet	1 class instance segmentation network to detect and segment instances of people in an image.	nvidia/tao/peoplesegnet	Yes
Multiple 3D CenterPose	Single-stage, keypoint-based method for category-level object pose estimation.	nvidia/tao/multiclass_3d_centerpose	Yes
Retail Object Recognition	Embedding generator model to recognize objects on a checkout counter.	nvidia/tao/retail_object_recognition	Yes
PeopleSemSegformer	Model to segment persons in an image.	nvidia/tao/peoplesemsegformer	Yes
PeopleNet Transformer v2.0	3 class object detection network to detect people in an image.	nvidia/tao/peoplenet_transformer_v2	Yes
Visual ChangeNet - Segmentation	Visual ChangeNet - Segmentation	nvidia/tao/visual_changenet_segmentation_landsatscd	Yes
CenterPose	3 pose detection model for retail objects.	nvidia/tao/centerpose	Yes
Visual ChangeNet Segmentation - MvTEC	Change segmentation model.	nvidia/tao/visual_changenet_segmentation_mvtec	Yes
PCB Defect Classification	Model to classify defects in soldered components on a printed circuit board.	nvidia/tao/pcb_classification	Yes
Visual ChangeNet Segmentation - (Research-only)	Visual ChangeNet-Segmentation (research-only)	nvidia/tao/visual_changenet_segmentation_levircd	Yes
Visual ChangeNet Classification	Visual ChangeNet - Classification Models	nvidia/tao/visual_changenet_classification	Yes
Optical Inspection	Model to detect defects in soldered components on a printed circuit board.	nvidia/tao/optical_inspection	Yes
CitySemSegFormer	Semantic segmentation of persons in an image.	nvidia/tao/citysemsegformer	Yes
BodyPoseNet	Detect body pose from an image.	nvidia/tao/bodyposenet	Yes
Optical Character Detection	Network to detect characters in an image.	nvidia/tao/ocdnet	Yes
TrafficCamNet	4 class object detection network to detect cars in an image.	nvidia/tao/trafficcamnet	Yes
BodyPose3DNet	3D human pose estimation network to predict 34 keypoints in 3D of a person in an image.	nvidia/tao/bodypose3dnet	No

Pre-trained Models#

Model name	Description	NGC Instance
Mask2Former	Binary instance segmentation model trained on COCO data.	nvidia/tao/mask2former
Pre-trained SegFormer ImageNet weights	Pre-trained SegFormer weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_segformer_imagenet
Pre-trained EfficientDet NvImageNet backbones	Pre-trained EfficientNet backbone weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_efficientdet_tf2_nvimagenet
Pre-trained EfficientDet Model trained on COCO	Pre-trained EfficientDet models trained on COCO to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_efficientdet_tf2_coco
Deformable DETR	Model trained on COCO2017 to detect and classify objects.	nvidia/tao/pretrained_deformable_detr_coco
DINO	Model trained on COCO2017 to detect and classify objects.	nvidia/tao/pretrained_dino_coco
TAO Pretrained DINO with Foundational Model Backbone	TAO Pretrained DINO with Foundational Model Backbone	nvidia/tao/dino_with_fm_backbone
DINO with NVDinoV2 backbone on COCO	Model trained on COCO2017 to detect and classify objects.	nvidia/tao/pretrained_nvdinov2_dino_coco
Pre-trained FasterViT based ImageNet Classification weights	Pre-trained FasterViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_fastervit_classification_imagenet
Pre-trained FasterViT based NVImageNet Classification weights	Pre-trained FasterViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_fastervit_classification_nvimagenet
Pre-trained GCViT ImageNet Classification weights	Pre-trained GCViT weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_gcvit_classification_imagenet
Pre-trained Deformable DETR NvImageNet weights	Pre-trained deformable_detr weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_deformable_detr_nvimagenet
Pre-trained FAN based ImageNet Classification weights	Pre-trained FAN weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_fan_classification_imagenet
Pre-trained FAN based NVImageNet Classification weights	Pre-trained FAN weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_fan_classification_nvimagenet
Pre-trained DINO ImageNet weights	Pre-trained DINO weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_dino_imagenet
Pre-trained Segformer - CityScapes	Pre-trained segformer models trained on CityScapes.	nvidia/tao/pretrained_segformer_cityscapes
Pre-trained SegFormer NvImageNet weights	Pre-trained SegFormer weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_segformer_nvimagenet
Pre-trained GCViT NVImageNet Classification weights	Pre-trained GcViT weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_gcvit_classification_nvimagenet
Pre-trained DINO NvImageNet weights	Pre-trained DINO weights trained on NvImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_dino_nvimagenet
Pre-trained Deformable DETR ImageNet weights	Pre-trained deformable_detr weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_deformable_detr_imagenet
Pretrained Mask Auto Label	Pretrained model to facilitate transfer learning for MAL on TAO Toolkit	nvidia/tao/pretrained_mask_auto_label
Pre-trained ImageNet Back bones weights	Pre-trained weights trained on ImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_imagenet_backbones
Pre-trained NVImageNet Back bones weights	Pre-trained weights trained on NVImageNet to facilitate transfer learning using TAO Toolkit.	nvidia/tao/pretrained_nvimagenet_backbones
TAO Pretrained EfficientDet-TF2	Pretrained efficientnet backbones for TAO Toolkit’s efficientdet-tf2	nvidia/tao/pretrained_efficientdet_tf2
TAO Pretrained Classification-TF2	Pretrained backbones for TAO Toolkit TF2 image classification	nvidia/tao/pretrained_classification_tf2

Performance Metrics#

The performance of these pretrained models across various NVIDIA platforms is summarized in the table below. The numbers in the table are the inference performance measured using the trtexec tool in TensorRT samples.

Orin NX 16GB

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	16	400	16	300
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	314	32	226
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	16	140	32	70
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	16	457	16	352
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	479	64	358
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	64	2588	64	1700
VehilceMakeNet	ResNet18	224x224x3	INT8	64	4261	64	2218
VehicleTypeNet	ResNet18	224x224x3	INT8	64	3391	64	2044
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	32	613	32	492
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	32		32
License Plate Recognition	ResNet	96x48x3	FP16	128	1498	–	–
Facial landmark		80x80x1	FP16	32	1606	–	–
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	64	1241	–	–
GestureNet		160x160x3	FP16	64	5420	–	–
BodyPose		288x384x3	INT8	16	195	–	–
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	32	577	–	–
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	4	38	–	–
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	16	826	–	–
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	4	42	–	–
Point Pillar	Point Pillar		FP16	1	38	–	–
Pose classification	ST-GCN		FP16	8	105	–	–
3D Pose - Accuracy			FP16	16	241	–	–
3D Pose - Performance			FP16	16	295	–	–
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	16	289	–	–
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	4	27	–	–
PeopleNet Transformer	D-DETR	960x544x3	FP16	2	13	–	–
ReidentificationNet	ResNet50	3x256x18	FP16	64	1049	–	–
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	16	48	–	–
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	16	45	–	–
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	32	544	–	–
CitySemSegFormer v1.0	SegFormer		FP16	1	0.6	–	–
ImageNet Classification	FAN-T-H	224x224x3	FP16	4	146.6	–	–
ImageNet Classification	FAN-S-H	224x224x3	FP16	4	94.5	–	–
ImageNet Classification	FAN-B-H	224x224x3	FP16	4	62	–	–
ImageNet Classification	FAN-L-H	224x224x3	FP16	4	43.9	–	–
ImageNet Classification	FAN-XL-H	224x224x3	FP16	4	39.9	–	–
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	8	212	–	–
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	8	163	–	–
ImageNet Classification	GC-ViT-T	224x224x3	FP16	8	110	–	–
ImageNet Classification	GC-ViT-S	224x224x3	FP16			–	–
ImageNet Classification	GC-ViT-B	224x224x3	FP16	4	51.2	–	–
ImageNet Classification	GC-ViT-L	224x224x3	FP16	4	28.8	–	–
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	1	10.4	–	–
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	4	23.4	–	–
ImageNet Classification	FAN-L-H-384	384x384x3	FP16			–	–
ImageNet Classification	EfficientNetB0	224x224x3	FP16	64	870	–	–
ImageNet Classification	EfficientNetB1	224x224x3	FP16	64	344	–	–
ImageNet Classification	EfficientNetB2	224x224x3	FP16	32	313	–	–
ImageNet Classification	EfficientNetB3	224x224x3	FP16	32	234	–	–
ImageNet Classification	EfficientNetB4	224x224x3	FP16	32	175	–	–
ImageNet Classification	EfficientNetB5	224x224x3	FP16	16	123	–	–
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	9.6	–	–
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	1	5.2	–	–
COCO Object Detection	DINO + RN50	960x544x3	FP16	1	8.4	–	–
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	1	4.4	–	–
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	1	4.9	–	–
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	2.6	–	–
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16			–	–
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16			–	–
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16			–	–
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16			–	–
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16			–	–
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	3.4	–	–
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	3.4	–	–
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	4	59.8	–	–
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	4	41.9	–	–
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	575	–	–
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	16	482	–	–
OCDNet	DCN-RN18	640x640x3	FP16	2	45	–	–
OCDNet	DCN-RN50	640x640x3	FP16	2	22	–	–
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	64	1373	–	–
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	64	1051	–	–
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	4	198	–	–
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	1.9	–	–
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	1.78	–	–
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16			–	–
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	44.7	–	–
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	21.92	–	–
CenterPose	DLA34	512x512x3	FP16	1	27.74	–	–
CenterPose	FAN-S-H	512x512x3	FP16	1	11.35	–	–

AGX ORIN 64GB

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS	DLA1 + DLA2 BS	DLA1 + DLA2 FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	1116	32	528
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	890	32	404
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	16	421	32	104
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	1268	32	594
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	1308	64	587
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	128	7462	128	2720
VehilceMakeNet	ResNet18	224x224x3	INT8	128	11872	128	3956
VehicleTypeNet	ResNet18	224x224x3	INT8	128	9815	128	3494
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	64	1700	64	870
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	64		64
License Plate Recognition	ResNet	96x48x3	FP16	128	4118	–	–
Facial landmark		80x80x1	FP16	64		–	–
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	128	3226	–	–
GestureNet		160x160x3	FP16	128	15133	–	–
BodyPose		288x384x3	INT8	16	559	–	–
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	64	1577	–	–
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	8	105	–	–
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	32	1702	–	–
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	4	109	–	–
Point Pillar	Point Pillar		FP16	1	90	–	–
Pose classification	ST-GCN		FP16	16	262	–	–
3D Pose - Accuracy			FP16	16	597	–	–
3D Pose - Performance			FP16	16	711	–	–
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	32	703	–	–
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	4	75	–	–
PeopleNet Transformer	D-DETR	960x544x3	FP16	2	33	–	–
ReidentificationNet	ResNet50	3x256x18	FP16	64	2772	–	–
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	16	121	–	–
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	16	114	–	–
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	32	1355	–	–
CitySemSegFormer v1.0	SegFormer		FP16	1	1.5	–	–
ImageNet Classification	FAN-T-H	224x224x3	FP16	8	385	–	–
ImageNet Classification	FAN-S-H	224x224x3	FP16	8	255	–	–
ImageNet Classification	FAN-B-H	224x224x3	FP16	8	166	–	–
ImageNet Classification	FAN-L-H	224x224x3	FP16	8	117	–	–
ImageNet Classification	FAN-XL-H	224x224x3	FP16	8	106	–	–
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	16	559	–	–
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	16	431	–	–
ImageNet Classification	GC-ViT-T	224x224x3	FP16	16	293	–	–
ImageNet Classification	GC-ViT-S	224x224x3	FP16			–	–
ImageNet Classification	GC-ViT-B	224x224x3	FP16	8	138	–	–
ImageNet Classification	GC-ViT-L	224x224x3	FP16	8	80.7	–	–
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	4	28.1	–	–
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	8	61.2	–	–
ImageNet Classification	FAN-L-H-384	384x384x3	FP16			–	–
ImageNet Classification	EfficientNetB0	224x224x3	FP16	64	2152	–	–
ImageNet Classification	EfficientNetB1	224x224x3	FP16	64	860	–	–
ImageNet Classification	EfficientNetB2	224x224x3	FP16	64	811	–	–
ImageNet Classification	EfficientNetB3	224x224x3	FP16	64	609	–	–
ImageNet Classification	EfficientNetB4	224x224x3	FP16	64	451	–	–
ImageNet Classification	EfficientNetB5	224x224x3	FP16	32	318	–	–
COCO Object Detection	DDETR+RN50	960x544x3	FP16	4	25.8	–	–
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	4	14.2	–	–
COCO Object Detection	DINO + RN50	960x544x3	FP16	4	22	–	–
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	4	11.2	–	–
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	4	13	–	–
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	6.2	–	–
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16			–	–
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16			–	–
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16			–	–
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16			–	–
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16			–	–
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	8.1	–	–
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	8.1	–	–
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	8	161	–	–
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	8	112	–	–
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	32	1541	–	–
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	32	1538	–	–
OCDNet	DCN-RN18	640x640x3	FP16	4	120	–	–
OCDNet	DCN-RN50	640x640x3	FP16			–	–
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	3876	–	–
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	128	2950	–	–
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	16	560	–	–
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	4.8	–	–
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	4.4	–	–
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16			–	–
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	113.2	–	–
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	55	–	–
CenterPose	DLA34	512x512x3	FP16	1	57.04	–	–
CenterPose	FAN-S-H	512x512x3	FP16	1	28.03	–	–

Orin Nano 8GB

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
ImageNet Classification	FAN-T-H	224x224x3	FP16	4	99.1
ImageNet Classification	FAN-S-H	224x224x3	FP16	4	64.5
ImageNet Classification	FAN-B-H	224x224x3	FP16	4	42.3
ImageNet Classification	FAN-L-H	224x224x3	FP16	4	29.8
ImageNet Classification	FAN-XL-H	224x224x3	FP16	4	27
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	8	146
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	8	111
ImageNet Classification	GC-ViT-T	224x224x3	FP16	8	75
ImageNet Classification	GC-ViT-S	224x224x3	FP16
ImageNet Classification	GC-ViT-B	224x224x3	FP16	4	35.2
ImageNet Classification	GC-ViT-L	224x224x3	FP16	4	19.8
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	4	16
ImageNet Classification	FAN-L-H-384	384x384x3	FP16
ImageNet Classification	EfficientNetB0	224x224x3	FP16	32	571
ImageNet Classification	EfficientNetB1	224x224x3	FP16	32	227
ImageNet Classification	EfficientNetB2	224x224x3	FP16	16	208
ImageNet Classification	EfficientNetB3	224x224x3	FP16	16	155
ImageNet Classification	EfficientNetB4	224x224x3	FP16	16	116
ImageNet Classification	EfficientNetB5	224x224x3	FP16	8	81
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	6.6
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	1	3.6
COCO Object Detection	DINO + RN50	960x544x3	FP16	1	5.7
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	1	3.1
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	1	3.3
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	1.8
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	2.3
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	2.3
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	4	40.2
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	4	28.4
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	393
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	16	399
OCDNet	DCN-RN18	640x640x3	FP16
OCDNet	DCN-RN50	640x640x3	FP16
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	64	935
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	64	715
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	4	133.9
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	1.36
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	1.2
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	31
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	15.2
CenterPose	DLA34	512x512x3	FP16	1	19.16
CenterPose	FAN-S-H	512x512x3	FP16	1	7.94

A100

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	128	8500
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	64	6245
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	64	3291
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	256	9717
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	256	9500
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	256	51600
VehilceMakeNet	ResNet18	224x224x3	INT8	1024	88300
VehicleTypeNet	ResNet18	224x224x3	INT8	512	72300
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	256	14900
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	256	23200
License Plate Recognition	ResNet	96x48x3	FP16	256	27200
Facial landmark		80x80x1	FP16	256	19600
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	1024	25394
GestureNet		160x160x3	FP16	1024	94555
BodyPose		288x384x3	INT8	16	3180
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	32	12600
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	16	797
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	64	17535
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	16	899
Point Pillar	Point Pillar		FP16	1	425
Pose classification	ST-GCN		FP16	64	2144.84
3D Pose - Accuracy			FP16	32	3466.34
3D Pose - Performance			FP16	32	4176.37
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	64	5745.79
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	16	496.34
PeopleNet Transformer	D-DETR	960x544x3	FP16	16	267
ReidentificationNet	ResNet50	3x256x18	FP16	256	20781
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	64	1012
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	64	947
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	64	9851
CitySemSegFormer v1.0	SegFormer		FP16	4	13
ImageNet Classification	FAN-T-H	224x224x3	FP16	128	3867
ImageNet Classification	FAN-S-H	224x224x3	FP16	128	2576
ImageNet Classification	FAN-B-H	224x224x3	FP16	64	1692
ImageNet Classification	FAN-L-H	224x224x3	FP16	64	1196
ImageNet Classification	FAN-XL-H	224x224x3	FP16	32	1046
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	128	6863
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	128	5162
ImageNet Classification	GC-ViT-T	224x224x3	FP16	128	3600
ImageNet Classification	GC-ViT-S	224x224x3	FP16	64	2435
ImageNet Classification	GC-ViT-B	224x224x3	FP16	64	1663
ImageNet Classification	GC-ViT-L	224x224x3	FP16	64	917
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	32	344
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	64	577
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	64	402
ImageNet Classification	EfficientNetB0	224x224x3	FP16	256	17909
ImageNet Classification	EfficientNetB1	224x224x3	FP16	256	6994
ImageNet Classification	EfficientNetB2	224x224x3	FP16	256	6665
ImageNet Classification	EfficientNetB3	224x224x3	FP16	256	4987
ImageNet Classification	EfficientNetB4	224x224x3	FP16	64	3577
ImageNet Classification	EfficientNetB5	224x224x3	FP16	256	2583
COCO Object Detection	DDETR+RN50	960x544x3	FP16	32	296
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	32	184
COCO Object Detection	DINO + RN50	960x544x3	FP16	32	244
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	32	121
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	32	165
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	16	70.6
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	64	2381
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	64	1808
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	64	1338
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	64	1000
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	32	94.3
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	32	94.2
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	64	1621
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	64	1155
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	64	15584
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	128	15660
OCDNet	DCN-RN18	640x640x3	FP16	64	724
OCDNet	DCN-RN50	640x640x3	FP16	64	351
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	512	28297
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	512	21212
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	128	7095
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	8	62.2
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	8	50.4
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	32	254
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	920
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	435
CenterPose	DLA34	512x512x3	FP16	1	82.16
CenterPose	FAN-S-H	512x512x3	FP16	1	49.99

A30

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	4228
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	3160
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	1603
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	5082
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	4900
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	128	27100
VehilceMakeNet	ResNet18	224x224x3	INT8	256	46200
VehicleTypeNet	ResNet18	224x224x3	INT8	128	37200
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	64	7700
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	128	12500
License Plate Recognition	ResNet	96x48x3	FP16	128	12400
Facial landmark		80x80x1	FP16	128	12400
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	12321
GestureNet		160x160x3	FP16	512	47361
BodyPose		288x384x3	INT8	32	1596
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	16	6000
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	4	380
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	32	8940
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	16	461
Point Pillar	Point Pillar		FP16	1	271
Pose classification	ST-GCN		FP16	64	1121.68
3D Pose - Accuracy			FP16	32	1913.92
3D Pose - Performance			FP16	32	2241.83
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	64	2862.76
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	16	253.77
PeopleNet Transformer	D-DETR	960x544x3	FP16	16	135
ReidentificationNet	ResNet50	3x256x18	FP16	128	9649
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	64	485
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	64	450
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	64	4791
CitySemSegFormer v1.0	SegFormer		FP16	1	6.4
ImageNet Classification	FAN-T-H	224x224x3	FP16	32	1806
ImageNet Classification	FAN-S-H	224x224x3	FP16	32	1185
ImageNet Classification	FAN-B-H	224x224x3	FP16	16	769
ImageNet Classification	FAN-L-H	224x224x3	FP16	16	548
ImageNet Classification	FAN-XL-H	224x224x3	FP16	16	423
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	32	3105
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	32	2339
ImageNet Classification	GC-ViT-T	224x224x3	FP16	32	1658
ImageNet Classification	GC-ViT-S	224x224x3	FP16	16	1077
ImageNet Classification	GC-ViT-B	224x224x3	FP16	16	711
ImageNet Classification	GC-ViT-L	224x224x3	FP16	16	388
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	8	151
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	16	260
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	8	179
ImageNet Classification	EfficientNetB0	224x224x3	FP16	64	8361
ImageNet Classification	EfficientNetB1	224x224x3	FP16	64	3245
ImageNet Classification	EfficientNetB2	224x224x3	FP16	64	3078
ImageNet Classification	EfficientNetB3	224x224x3	FP16	64	2320
ImageNet Classification	EfficientNetB4	224x224x3	FP16	64	1669
ImageNet Classification	EfficientNetB5	224x224x3	FP16	64	1203
COCO Object Detection	DDETR+RN50	960x544x3	FP16	8	141
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	8	87
COCO Object Detection	DINO + RN50	960x544x3	FP16	8	115
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	4	56
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	8	77
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	8	33.4
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	16	1098
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	16	834
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	16	606
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	16	465
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	8	44.2
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	8	44
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	16	734
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	16	522
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	5886
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	32	6318
OCDNet	DCN-RN18	640x640x3	FP16	16	387
OCDNet	DCN-RN50	640x640x3	FP16	8	186.6
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	12122
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	128	9656
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	32	3221
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	4	29.3
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	4	23.7
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	8	116.8
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	436
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	204
CenterPose	DLA34	512x512x3	FP16	1	57.46
CenterPose	FAN-S-H	512x512x3	FP16	1	26.73

A10

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	3819
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	2568
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	1007
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	4754
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	4600
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	128	26900
VehilceMakeNet	ResNet18	224x224x3	INT8	256	44800
VehicleTypeNet	ResNet18	224x224x3	INT8	256	31500
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	64	6000
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	256	13900
License Plate Recognition	ResNet	96x48x3	FP16	256	9000
Facial landmark		80x80x1	FP16	512	9600
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	10718
GestureNet		160x160x3	FP16	512	35371
BodyPose		288x384x3	INT8	32	1334
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	16	4600
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	4	265
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	32	6500
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	16	284
Point Pillar	Point Pillar		FP16	1	246
Pose classification	ST-GCN		FP16	64	825.75
3D Pose - Accuracy			FP16	32	1286.05
3D Pose - Performance			FP16	32	1558.21
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	64	2429.62
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	16	180.04
PeopleNet Transformer	D-DETR	960x544x3	FP16	16	106
ReidentificationNet	ResNet50	3x256x18	FP16	128	6129
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	64	367
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	64	341
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	64	3105
CitySemSegFormer v1.0	SegFormer		FP16	1	4.5

A2

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	749
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	581
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	231
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	916
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	32	865
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	64	4982
VehilceMakeNet	ResNet18	224x224x3	INT8	128	8000
VehicleTypeNet	ResNet18	224x224x3	INT8	128	6302
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	32	1174
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	128	2570
License Plate Recognition	ResNet	96x48x3	FP16	128	2180
Facial landmark		80x80x1	FP16	256	2800
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	256	2488
GestureNet		160x160x3	FP16	256	7690
BodyPose		288x384x3	INT8	16	278
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	8	1044
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	4	56
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	16	1419
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	2	58
Point Pillar	Point Pillar		FP16	1	63
Pose classification	ST-GCN		FP16	64	211.5
3D Pose - Accuracy			FP16	32	370.13
3D Pose - Performance			FP16	32	471.81
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	16	631.31
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	16	44.09
PeopleNet Transformer	D-DETR	960x544x3	FP16	4	27
ReidentificationNet	ResNet50	3x256x18	FP16	64	1570
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	32	106
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	32	99
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	32	803
CitySemSegFormer v1.0	SegFormer		FP16	1	1.3
ImageNet Classification	FAN-T-H	224x224x3	FP16	16	399
ImageNet Classification	FAN-S-H	224x224x3	FP16	16	256
ImageNet Classification	FAN-B-H	224x224x3	FP16	16	168
ImageNet Classification	FAN-L-H	224x224x3	FP16	16	118
ImageNet Classification	FAN-XL-H	224x224x3	FP16	8	103
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	16	666
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	32	492
ImageNet Classification	GC-ViT-T	224x224x3	FP16	32	336
ImageNet Classification	GC-ViT-S	224x224x3	FP16	16	213
ImageNet Classification	GC-ViT-B	224x224x3	FP16	16	137
ImageNet Classification	GC-ViT-L	224x224x3	FP16	16	67
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	2	26
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	8	55.5
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	1	38
ImageNet Classification	EfficientNetB0	224x224x3	FP16	64	1870
ImageNet Classification	EfficientNetB1	224x224x3	FP16	64	726
ImageNet Classification	EfficientNetB2	224x224x3	FP16	64	690
ImageNet Classification	EfficientNetB3	224x224x3	FP16	64	518
ImageNet Classification	EfficientNetB4	224x224x3	FP16	32	363
ImageNet Classification	EfficientNetB5	224x224x3	FP16	32	265
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	27.7
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	1	17.9
COCO Object Detection	DINO + RN50	960x544x3	FP16	1	22.5
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	1	11.7
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	1	15.7
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	6.7
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	16	230
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	16	174
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	16	129
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	16	97.7
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	8.9
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	8.8
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	16	161
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	16	113
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	1200
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	32	1391
OCDNet	DCN-RN18	640x640x3	FP16	8	93
OCDNet	DCN-RN50	640x640x3	FP16	1	45.5
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	2094
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	128	1482
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	32	688
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	5.8
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	4.4
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	1	23.3
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	95.81
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	36.02
CenterPose	DLA34	512x512x3	FP16	1	15.69
CenterPose	FAN-S-H	512x512x3	FP16	1	5.80

H100

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
ImageNet Classification	FAN-T-H	224x224x3	FP16	128	6555
ImageNet Classification	FAN-S-H	224x224x3	FP16	128	4393
ImageNet Classification	FAN-B-H	224x224x3	FP16	64	2833
ImageNet Classification	FAN-L-H	224x224x3	FP16	64	1982
ImageNet Classification	FAN-XL-H	224x224x3	FP16	32	1692
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	128	11942
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	128	9094
ImageNet Classification	GC-ViT-T	224x224x3	FP16	128	6381
ImageNet Classification	GC-ViT-S	224x224x3	FP16	64	4145
ImageNet Classification	GC-ViT-B	224x224x3	FP16	64	3040
ImageNet Classification	GC-ViT-L	224x224x3	FP16	64	1618
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	32	613
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	64	985
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	64	681
ImageNet Classification	EfficientNetB0	224x224x3	FP16	256	28765
ImageNet Classification	EfficientNetB1	224x224x3	FP16	256	11363
ImageNet Classification	EfficientNetB2	224x224x3	FP16	256	10830
ImageNet Classification	EfficientNetB3	224x224x3	FP16	256	8115
ImageNet Classification	EfficientNetB4	224x224x3	FP16	64	5915
ImageNet Classification	EfficientNetB5	224x224x3	FP16	256	4323
COCO Object Detection	DDETR+RN50	960x544x3	FP16	32	526
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	32	317
COCO Object Detection	DINO + RN50	960x544x3	FP16	32	442
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	32	213
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	32	290
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	32	125.5
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	64	4189
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	64	3143
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	64	2306
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	64	1732
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	32	167
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	32	167
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	64	2686
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	64	1919
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	64	29070
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	128	24107
OCDNet	DCN-RN18	640x640x3	FP16	64	1468
OCDNet	DCN-RN50	640x640x3	FP16	64	706
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	512	55717
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	512	48492
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	128	12273
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	8	108
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	8	89.5
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	32	454
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	64	1581
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	32	841
CenterPose	DLA34	512x512x3	FP16	1	105.84
CenterPose	FAN-S-H	512x512x3	FP16	1	76.26

L40

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
ImageNet Classification	FAN-T-H	224x224x3	FP16	16	4249
ImageNet Classification	FAN-S-H	224x224x3	FP16	8	2647
ImageNet Classification	FAN-B-H	224x224x3	FP16	8	1720
ImageNet Classification	FAN-L-H	224x224x3	FP16	8	1154
ImageNet Classification	FAN-XL-H	224x224x3	FP16	8	1005
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	16	6282
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	16	4632
ImageNet Classification	GC-ViT-T	224x224x3	FP16	16	3118
ImageNet Classification	GC-ViT-S	224x224x3	FP16	8	2022
ImageNet Classification	GC-ViT-B	224x224x3	FP16	8	1282
ImageNet Classification	GC-ViT-L	224x224x3	FP16	4	628
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	2	247
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	4	558
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	4	366
ImageNet Classification	EfficientNetB0	224x224x3	FP16	32	19577
ImageNet Classification	EfficientNetB1	224x224x3	FP16	16	7723
ImageNet Classification	EfficientNetB2	224x224x3	FP16	16	7234
ImageNet Classification	EfficientNetB3	224x224x3	FP16	16	5063
ImageNet Classification	EfficientNetB4	224x224x3	FP16	16	3740
ImageNet Classification	EfficientNetB5	224x224x3	FP16	16	2671
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	283
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	1	182
COCO Object Detection	DINO + RN50	960x544x3	FP16	1	215
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	1	119.5
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	1	151
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	68.5
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	8	1985
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	8	1510
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	8	1144
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	8	840
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	88.5
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	88.6
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	8	1648
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	8	1124
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	32	9407
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	128	10390
OCDNet	DCN-RN18	640x640x3	FP16	4	940
OCDNet	DCN-RN50	640x640x3	FP16	4	445
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	18971
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	256	13950
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	16	6619
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	47.3
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	40.9
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	2	210
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	4	980
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	8	364
CenterPose	DLA34	512x512x3	FP16	1	106.70
CenterPose	FAN-S-H	512x512x3	FP16	1	56.65

L4

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
ImageNet Classification	FAN-T-H	224x224x3	FP16	8	1604
ImageNet Classification	FAN-S-H	224x224x3	FP16	8	949
ImageNet Classification	FAN-B-H	224x224x3	FP16	4	632
ImageNet Classification	FAN-L-H	224x224x3	FP16	4	442
ImageNet Classification	FAN-XL-H	224x224x3	FP16	4	365
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	8	2440
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	8	1797
ImageNet Classification	GC-ViT-T	224x224x3	FP16	16	1266
ImageNet Classification	GC-ViT-S	224x224x3	FP16	8	817
ImageNet Classification	GC-ViT-B	224x224x3	FP16	8	546
ImageNet Classification	GC-ViT-L	224x224x3	FP16	8	268
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	4	101
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	4	207
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	4	145
ImageNet Classification	EfficientNetB0	224x224x3	FP16	16	7072
ImageNet Classification	EfficientNetB1	224x224x3	FP16	16	2454
ImageNet Classification	EfficientNetB2	224x224x3	FP16	16	2306
ImageNet Classification	EfficientNetB3	224x224x3	FP16	16	1527
ImageNet Classification	EfficientNetB4	224x224x3	FP16	16	1134
ImageNet Classification	EfficientNetB5	224x224x3	FP16	16	780
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	107
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	1	70
COCO Object Detection	DINO + RN50	960x544x3	FP16	1	79.6
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	1	44
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	1	56.6
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	1	26.4
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	4	750
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	4	581
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	4	442
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	4	341
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	1	34.1
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	1	34.1
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	4	603
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	4	430
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	4375
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	32	2821
OCDNet	DCN-RN18	640x640x3	FP16	1	333
OCDNet	DCN-RN50	640x640x3	FP16	1	169
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	8036
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	64	6045
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	8	2543
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	17.8
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	15.7
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	1	83.4
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	4	344
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	8	131.48
CenterPose	DLA34	512x512x3	FP16	1	40.68
CenterPose	FAN-S-H	512x512x3	FP16	1	18.69

T4

Model Name	Arch	Inference resolution	Precision	GPU BS	GPU FPS
PeopleNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	1379
PeopleNet (v2.3)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	1064
PeopleNet (v2.5 unpruned)	DetectNet_v2 - ResNet34	960x544x3	INT8	32	465
TrafficCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	1725
DashCamNet	DetectNet_v2 - ResNet18	960x544x3	INT8	64	1676
FaceDetect-IR	DetectNet_v2 - ResNet18	384x240x3	INT8	128	9810
VehilceMakeNet	ResNet18	224x224x3	INT8	256	16500
VehicleTypeNet	ResNet18	224x224x3	INT8	128	12500
FaceDetect (pruned)	DetectNet_v2 - ResNet18	736x416x3	INT8	64	2578
License Plate Detection v1.0	DetectNet_v2 - ResNet18	640x480x3	INT8	128	6123
License Plate Recognition	ResNet	96x48x3	FP16	128	3959
Facial landmark		80x80x1	FP16	128	4622
GazeNet		224x224x1, 224x224x1, 224x224x1, 25x25x1	FP16	512	4563
GestureNet		160x160x3	FP16	512	15377
BodyPose		288x384x3	INT8	32	598
Action Recognition 2D RGB	2D CNN	224x224x96	FP16	16	1897
Action Recognition 3D RGB	3D CNN	224x224x32x3	FP16	4	139
Action Recognition 2D OF	2D Optical Flow	224x224x96	FP16	32	3320
Action Recognition 3D OF	3D Optical Flow	224x224x32x3	FP16	16	192
Point Pillar	Point Pillar		FP16	1	111
Pose classification	ST-GCN		FP16	64	376.4
3D Pose - Accuracy			FP16	32	614.98
3D Pose - Performance			FP16	32	712.94
PeopleSemSegNet_v2 - Shuffle	UNET - Shuffle	960x544x3	FP16	64	1027.85
PeopleSemSegNet_v2 - Vanilla	UNET - Vanilla	960x544x3	FP16	16	79.08
PeopleNet Transformer	D-DETR	960x544x3	FP16	4	48
ReidentificationNet	ResNet50	3x256x18	FP16	64	2586
RetailObjectDetection v1.0 - 100 class	EfficientNet-D5	416x416x3	FP16	32	162
RetailObjectDetection v1.0 - binary	EfficientNet-D5	416x416x3	FP16	32	151
RetailObjectEmbedding v1.0	ResNet101	3x224x224	FP16	32	1270
CitySemSegFormer v1.0	SegFormer		FP16	1	2
ImageNet Classification	FAN-T-H	224x224x3	FP16	16	624
ImageNet Classification	FAN-S-H	224x224x3	FP16	8	403
ImageNet Classification	FAN-B-H	224x224x3	FP16	8	259
ImageNet Classification	FAN-L-H	224x224x3	FP16	16	188
ImageNet Classification	FAN-XL-H	224x224x3	FP16	16	166
ImageNet Classification	GC-ViT-xxTiny	224x224x3	FP16	16	999
ImageNet Classification	GC-ViT-xTiny	224x224x3	FP16	32	739
ImageNet Classification	GC-ViT-T	224x224x3	FP16	16	517
ImageNet Classification	GC-ViT-S	224x224x3	FP16	8	335
ImageNet Classification	GC-ViT-B	224x224x3	FP16	4	223
ImageNet Classification	GC-ViT-L	224x224x3	FP16	16	122
ImageNet Classification	GC-ViT-L-384	384x384x3	FP16	4	45.1
ImageNet Classification	FAN-B-H-384 (384 resolution)	384x384x3	FP16	8	91
ImageNet Classification	FAN-L-H-384	384x384x3	FP16	4	62
ImageNet Classification	EfficientNetB0	224x224x3	FP16	64	2735
ImageNet Classification	EfficientNetB1	224x224x3	FP16	64	1099
ImageNet Classification	EfficientNetB2	224x224x3	FP16	32	1006
ImageNet Classification	EfficientNetB3	224x224x3	FP16	32	753
ImageNet Classification	EfficientNetB4	224x224x3	FP16	16	526
ImageNet Classification	EfficientNetB5	224x224x3	FP16	16	375
COCO Object Detection	DDETR+RN50	960x544x3	FP16	1	46.5
COCO Object Detection	DDETR + GCViT-T	960x544x3	FP16	4	30.3
COCO Object Detection	DINO + RN50	960x544x3	FP16	4	38.9
COCO Object Detection	DINO + FAN-S	960x544x3	FP16	4	20
COCO Object Detection	DINO + GC-ViT-T	960x544x3	FP16	8	26.7
COCO Object Detection	DINO + FAN-L	960x544x3	FP16	4	10.9
Cityscapes Segmentation	SegFormer + FAN-T-H	224x224x3	FP16	16	374
Cityscapes Segmentation	Segformer + FAN-S-H	224x224x3	FP16	8	272
Cityscapes Segmentation	SegFormer + FAN-B-H	224x224x3	FP16	8	198
Cityscapes Segmentation	SegFormer + FAN-L-H	224x224x3	FP16	16	156.7
RetailObjectDetection v2.0 - binary	Efficientdet-D5	960x544x3	FP16
RetailObjectDetection v2.0 - binary	DINO-FAN_base	960x544x3	FP16	2	15.2
RetailObjectDetection v2.0 - Meta	DINO-FAN_base	960x544x3	FP16	4	15.4
RetailObjectEmbedding v2.0	FAN-B-H	224x224x3	FP16	8	253
RetailObjectEmbedding v2.0	FAN-L-H	224x224x3	FP16	16	184
RetailObjectEmbedding v2.0	NVCLIP-B	224x224x3	FP16	16	1838
Siamese Optical Inspection	Siamese CNN	2x512x128x3	FP16	32	2314
OCDNet	DCN-RN18	640x640x3	FP16	16	155
OCDNet	DCN-RN50	640x640x3	FP16	1	72.5
OCRNet	ResNet50 + Bi-LSTM pruned	32x100x3	FP16	128	3649
OCRNet	ResNet50 + Bi-LSTM unpruned	32x100x3	FP16	128	2673
PCB Inspection	GC-ViT-xxTiny	224x224x3	FP16	16	1012
CitySemSegFormer v2.0	Segformer + MIT	1024x1024x3	FP16	1	9.4
CitySemSegFormer v2.0	SegFormer + FAN -B-H	1024x1024x3	FP16	1	7.3
PeopleSemSegFormer v2.0	SegFormer + FAN-B-H	512x512x3	FP16	4	39.6
Visual ChangeNet Classification	Visual ChangeNet + FAN -S-H	512x128x3	FP16	16	146
Visual ChangeNet Segmentation	Visual ChangeNet + FAN -B-H	256x256x3	FP16	16	59.7
CenterPose	DLA34	512x512x3	FP16	1	24.27
CenterPose	FAN-S-H	512x512x3	FP16	1	9.77

General purpose computer vision models#

With general purpose models, you can train an image classification model, object detection model, or an instance segmentation model.

For classification, you can train using one of the available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet, or DarkNet.
For object detection tasks, you can choose from the popular YOLOv3/v4/v4-tiny, FasterRCNN, SSD, RetinaNet, and DSSD architectures, as well as NVIDIA’s own DetectNet_v2 architecture.
For instance segmentation, you can use MaskRCNN for instance segmentation or UNET for semantic segmentation.

This gives you the flexibility and control to build AI models for any number of applications, from smaller, light-weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below.

Image Classification

Backbone	Image classification
NVCLIP	✓
C-RADIOv2	✓
NvDINOv2	✓
GcViT	✓
ViT	✓
FAN	✓
FasterViT	✓
ResNet	✓
Swin	✓
EfficientNet	✓

Object Detection

Backbone	DINO	D-DETR	Grounding DINO	RT-DETR	EfficientDet
C-RADIOv2				✓
ConvNext				✓
NvDINOv2	✓
GcViT	✓	✓
ViT	✓	✓
FAN	✓
ResNet	✓	✓		✓
Swin			✓
EfficientNet					✓

Instance Segmentation

Backbone	MAL	Mask GroundingDINO	Mask2Former
ViT	✓
Swin		✓	✓

Semantic Segmentation

Backbone	SegFormer	Mask2Former
C-RADIOv2	x
NvDINOv2	x
FAN	✓
Swin		✓
MIT-b	✓

Panoptic Segmentation

Backbone	Mask2Former
Swin	✓

Character Recognition

Backbone	OCD	OCR
FAN	✓	✓
ResNet	✓	✓

Visual ChangeNet

Backbone	Classification	Segmentation
C-RADIOv2	✓
NvDINOv2	✓
ViT	✓	✓
FAN	✓	✓

Pose Classification

Backbone	Pose Classification
ST-GCN (graph convolutional network)	✓

Object Recognition

Backbone	Re-identification	Metric Learning Recognition
NvDINOv2		✓
ViT		✓
ResNet	✓	✓
Swin	✓

Computer Vision Feature Summary#

The table below summarizes the computer vision models and the features enabled.

Feature Summary#
CV Task	Model	New in 22-05	New in TAO 4.0	New in TAO 5.0	New in TAO 5.5	New in TAO 6.0	Pruning	Knowledge Distillation	QAT	AutoML	REST API	Channel-wise QAT	Class weighting	Visualization (TB)	MLOPs integration (W&B/ClearML)	Multi-node	Multi-GPU	AMP	Framework	Annotation Format	DLA
Self Supervised Learning	Mask Auto Encoder	No	No	No	No	Yes	No	yes	No	yes	yes	no	no	yes	wandb	yes	yes	yes	pyt	Unstructured/ImageNet	no
Self Supervised Learning	NvDINOv2	No	No	No	No	Yes	No	yes	No	yes	yes	no	no	yes	wandb	yes	yes	yes	pyt	Unstructured/ImageNet	no
Classification	ResNet18/34/50	No	No	No	No	No	yes	No	No	yes	yes	no	no	yes	clearml,wandb	yes	yes	yes	tf2	ImageNet	yes
Classification	EfficientNet_B0-B5	No	No	No	No	No	yes	No	No	yes	yes	yes	no	yes	clearml,wandb	yes	yes	yes	tf2	ImageNet	yes
Classification	GcViT	No	No	Yes	No	No	No	yes	No	yes	yes	no	yes	no	wandb	yes	yes	yes	pyt	ImageNet	no
Classification	FAN	No	No	Yes	No	No	No	yes	No	yes	yes	no	yes	no	wandb	yes	yes	yes	pyt	ImageNet	no
Classification	NvDINOv2	No	No	Yes	No	No	No	yes	No	yes	yes	no	yes	no	wandb	yes	yes	yes	pyt	ImageNet	no
Classification	NVCLIP	No	No	Yes	No	No	No	yes	No	yes	yes	no	yes	no	wandb	yes	yes	yes	pyt	ImageNet	no
Classification	C-RADIOv2	No	No	Yes	No	No	No	yes	No	yes	yes	no	yes	no	wandb	yes	yes	yes	pyt	ImageNet	no
Detection	EfficientDet	No	No	No	No	No	yes	No	no	yes	no	yes	no	yes	wandb	yes	yes	yes	tf2	COCO	yes
Detection	RT-DETR	Yes	no	No	No	Yes	no	yes	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	COCO	no
Detection	Deformable DETR	Yes	no	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	COCO	no
Detection	DINO	No	no	Yes	No	No	no	yes	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	COCO	no
Detection	GroundingDINO	No	no	No	Yes	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	ODVG/COCO	no
Detection	BevFusion	No	No	No	Yes	No	no	No	no	yes	no	no	no	no	wandb	yes	yes	yes	pyt	KITTI/Nuscenes	no
Instance/Panoptic/Semantic Segmentation	Mask2Former	No	No	No	Yes	No	no	No	no	yes	no	no	no	no	wandb	yes	yes	yes	pyt	COCO	no
Semantic Segmentation	Segformer	Yes	no	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	CityScape - PNG	no
Point Cloud	PointPillars	Yes	No	No	No	No	Yes	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	KITTI	no
Action Recognition	2D action recognition RGB	No	No	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	Custom	no
Action Recognition	3D action recognition RGB	No	No	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	Custom	no
Action Recognition	2D action recognition OF	No	No	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	Custom	no
Action Recognition	3D action recognition OF	No	No	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	Custom	no
Other	Pose action classification	Yes	No	No	No	No	no	No	no	yes	yes	no	no	no	wandb	yes	yes	yes	pyt	COCO	no
Feature Embedding	ResNet50	Yes	no	No	No	No	no	No	no	yes	no	no	no	no	no	yes	yes	yes	pyt	Market1501	No
Other	OpticalInspection	No	No	Yes	No	No	No	No	No	yes	yes	no	Yes (False Positive Rate Sampling)	yes	wandb	yes	yes	no	pyt	NVIDIA Defined	Yes
Other	Optical Character Detection	No	No	Yes	No	No	Yes	No	No	yes	yes	no	no	no	wandb	Yes	Yes	no	pyt	NVDIA defined	no
OCR	Optical Character Recognition	No	No	Yes	No	No	Yes	No	No	yes	yes	no	no	no	wandb	yes	Yes	no	pyt	NVIDIA Defined	no
Anomaly Detection	Visual Changenet	No	No	Yes	No	No	Yes	No	No	yes	yes	no	no	no	wandb	yes	Yes	no	pyt	NVIDIA Defined	no