Purpose-Built Pretrained Models
===============================

.. _purpose-built_models:

The purpose-built AI models packaged with TLT may be broadly categorized into two categories, namely:

1. :ref:`Computer Vision<computer_vision>`
2. :ref:`Conversational AI<conversational_ai>` 


Computer Vision
---------------

.. _computer_vision:

The purpose built model shipped with the TLT-CV package can be used in smart cities, retail, public safety, healthcare
and are trained on thousands of images. There are 2 versions of these models - trainable and deployable. 
Both version of these models are available on NGC. 

The trainable models or sometimes referred to as :code:`trainable` or :code:`unpruned` models are used 
with TLT to re-train with your dataset. 
On the other hand, :code:`deployable` or :code:`pruned` models are deployment ready that allows you to
directly deploy on your edge device. In addition, the deployable model could also contain a calibration
table for running inference in INT8 precision. The pruned INT8 model will provide the highest inference throughput.

The table below shows the network architecture, number of classes and accuracy measured on our dataset.

+--------------------------+---------------------------------+-----------------------+----------------+
|**Model Name**            | **Network Architecture**        | **Number of classes** | **Accuracy**   |
+--------------------------+---------------------------------+-----------------------+----------------+
|TrafficCamNet             | DetectNet_v2-ResNet18           | 4                     | 83.5% mAP      |
+--------------------------+---------------------------------+-----------------------+----------------+
|PeopleNet                 | DetectNet_v2-ResNet34           | 3                     | 84% mAP        |
|                          +---------------------------------+-----------------------+----------------+
|                          | DetectNet_v2-ResNet18           | 3                     | 80% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|DashCamNet                | DetectNet_v2-ResNet18           | 4                     | 80% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|FaceDetect-IR             | DetectNet_v2-ResNet18           | 1                     | 96% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|VehicleMakeNet            | ResNet18                        | 20                    | 91% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|VehicleTypeNet            | ResNet18                        | 6                     | 96% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|Emotion Recognition       | 5 Fully Connected Layers        | 6                     | 0.91 F1 score  |
+--------------------------+---------------------------------+-----------------------+----------------+
|Gesture Recognition       | ResNet18                        | 6                     | 0.85 F1 score  |
+--------------------------+---------------------------------+-----------------------+----------------+
|License Plate Detection   | DetectNet_v2-ResNet18           | 1                     | 98% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|License Plate Recognition | Tuned ResNet18                  | 36(US) / 68(CH)       | 97%(US)/99%(CH)|
+--------------------------+---------------------------------+-----------------------+----------------+
|Gaze Estimation           | Four branch AlexNet based model | NA                    | 6.5 RMSE       |
+--------------------------+---------------------------------+-----------------------+----------------+
|Facial Landmark           | Recombinator networks           | NA                    | 6.1 pixel error|
+--------------------------+---------------------------------+-----------------------+----------------+
|FaceDetect                | DetectNet_v2-ResNet18           | 1                     | 85.3 mAP       |
+--------------------------+---------------------------------+-----------------------+----------------+
|PeopleSegNet              | MaskRCNN-ResNet50               | 1                     | 85% mAP        |
+--------------------------+---------------------------------+-----------------------+----------------+
|Heart Rate Estimation     | Two branch model with attention | NA                    | 0.7 BP         |
+--------------------------+---------------------------------+-----------------------+----------------+

Training
^^^^^^^^

The PeopleNet, TrafficCamNet, DashCamNet, FaceDetect-IR and License Plate Detection are detection models based on
DetectNet_v2 and either ResNet18 or ResNet34 backbone. To re-train these models with your data,
use the unpruned model from NGC and follow the DetectNet_v2 object detection training guidelines
from chapters :ref:`Preparing the Input Data Structure <dummy_header>` to
:ref:`Exporting the model<dummy_header>`. The entire training workflow is given in the
prior section. You can also download the DetectNet_v2 Jupyter notebook from NGC resources. 

The VehicleMakeNet and VehicleTypeNet are classification models based on the ResNet18 backbone.
To re-train these models, use the unpruned model from NGC and follow the Image classification
training guideline from chapters :ref:`Preparing the Input Data Structure
<dummy_header>` to :ref:`Exporting the model <dummy_header>`. You can also use the classification Jupyter notebook from NGC resources.

The table below shows more information about trainability of the pre-trained models. 
It shows image requirement, annotation format, model output format, and if the training pipeline supports pruning and INT8 quantization. 

+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|**Model Name**            | **Image Type**         | **Annotation format** | **Model output format** |**Prunable**|**INT8 supported**|
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|TrafficCamNet             | RGB                    | KITTI                 | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|PeopleNet                 | RGB                    | KITTI                 | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|DashCamNet                | RGB                    | KITTI                 | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|FaceDetect-IR             | IR                     | KITTI                 | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|VehicleMakeNet            | RGB                    |                       | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|VehicleTypeNet            | RGB                    |                       | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|Emotion Recognition       | 68 facial points       | JSON file (NV format) | Encrypted ONNX (.etlt)  |   No       |     No           |
|                          | (no image required)    |                       |                         |            |                  |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|Gesture Recognition       | RGB                    | JSON file (NV format) | Encrypted ONNX (.etlt)  |   No       |     No           |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|License Plate Detection   | RGB                    | KITTI                 | Encrypted UFF (.etlt)   |   Yes      |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|License Plate Recognition | RGB                    | TXT                   | Encrypted ONNX (.etlt)  |   No       |     No           | 
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|Gaze Estimation           | Grayscale (1 channel)  | JSON file (NV format) | Encrypted ONNX (.etlt)  |   No       |     No           |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|Facial Landmark           | Grayscale (1 channel)  | JSON file (NV format) | Encrypted ONNX (.etlt)  |   No       |     No           |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|FaceDetect                | Grayscale (3 channels) | JSON file (NV format) | Encrypted UFF (.etlt)   |   No       |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|PeopleSegNet              | RGB                    | COCO                  | Encrypted UFF (.etlt)   |   No       |     Yes          |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+
|Heart Rate Estimation     | BGR                    | JSON file (NV format) | Encrypted ONNX (.etlt)  |   No       |     No           |
+--------------------------+------------------------+-----------------------+-------------------------+------------+------------------+

Deployment
^^^^^^^^^^

You can deploy your own trained or the provided pruned or deployable model on any edge device using DeepStream or TLT CV inference pipeline. 
To deploy on DeepStream, more information can be found in the individual model section. For deploying using the TLT CV inference pipeline, check out the TLT CV inference pipeline section in this guide. 

The performance across various NVIDIA platforms is summarized in the table below.The performance in the table is inference
performance measured using the `trtexec`_ tool in TensorRT samples. 

.. _trtexec: https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

.. image:: ../content/perf_table.png
   :align: center

The CV models below can be deployed on any NVIDIA datacenter GPUs such as A100, T4, Quadro or on embedded platform like NVIDIA Jetson. 

The table below shows the inference SDK along with if inference is supported on DLA (Deep learning accelerator) on Jetson AGX Xavier or Xavier NX. 

+--------------------------+---------------------------+-------------------+
|**Model Name**            | **Inference Pipeline**    | **DLA supported** |
+--------------------------+---------------------------+-------------------+
|TrafficCamNet             | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|PeopleNet                 | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|DashCamNet                | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|FaceDetect-IR             | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|VehicleMakeNet            | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|VehicleTypeNet            | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|Emotion Recognition       | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+
|Gesture Recognition       | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+
|License Plate Detection   | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|License Plate Recognition | DeepStream                |   No              |
+--------------------------+---------------------------+-------------------+
|Gaze Estimation           | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+
|Facial Landmark           | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+
|FaceDetect                | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+
|PeopleSegNet              | DeepStream                |   Yes             |
+--------------------------+---------------------------+-------------------+
|Heart Rate Estimation     | TLT CV Inference pipeline |   No              |
+--------------------------+---------------------------+-------------------+

TrafficCamNet
~~~~~~~~~~~~~

`TrafficCamNet`_ is a 4-class object detection network built on NVIDIA’s detectnet_v2 architecture
with ResNet18 as the backbone feature extractor. It’s trained on 544x960 RGB images to detect
cars, persons, road signs and two wheelers. The dataset contains images from real traffic
intersections from cities in the US (at about 20ft vantage point). This model is trained to
overcome the problem of separating a line of cars as they come to stop at a red traffic light or
a stop sign. This model is ideal for smart city applications, where you want to count the number
of cars on the road and understand flow of traffic.

.. _TrafficCamNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_trafficcamnet

PeopleNet
~~~~~~~~~

`PeopleNet`_ is a 3-class object detection network built on NVIDIA’s detectnet_v2 architecture with
ResNet34 as the backbone feature extractor. It’s trained on 544x960 RGB images to detect person,
bag, and face. Several million images of both indoor and outdoor scenes were labeled in-house to
adapt to a variety of use cases, such as airports, shopping malls and retail stores. This
dataset contains images from various vantage points. PeopleNet can be used for smart places or
building applications where you need to accurately count people in a crowded environment for
security or higher level business insights.

.. _PeopleNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet

DashCamNet
~~~~~~~~~~

`DashCamNet`_ is a 4-class object detection network built on NVIDIA’s detectnet_v2 architecture
with ResNet18 as the backbone feature extractor. It’s trained on 544x960 RGB images to detect
cars, pedestrians, traffic signs and two wheelers. The training data for this network contains
real images collected, annotated and curated in-house from different dashboard cameras in cars
at about 4-5ft height in vantage point. Unlike the other models the camera in this case is
moving. The use case for this model is to identify objects from a moving object, which can be
a car or a robot.

.. _DashCamNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_dashcamnet

FaceDetect-IR
~~~~~~~~~~~~~

`FaceDetect_IR`_ is a single class face detection network built on NVIDIA’s detectnet_v2
architecture with ResNet18 as the backbone feature extractor. The model is trained on 384x240x3
IR (infrared) images augmented with synthetic noises and is trained for use cases where the
person’s face is close to the camera, such as a laptop camera during video conferencing or a
camera placed inside a vehicle to observe a distracted driver. When infrared illuminators are
used this model can continue to work even when visible light conditions are considered too dark
for normal color cameras.

.. _FaceDetect_IR: https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetectir

VehicleMakeNet
~~~~~~~~~~~~~~

`VehicleMakeNet`_ is a classification network based on ResNet18, which aims to classify car images
of size 224 x 224. This model can identify 20 popular car makes. VehicleMakeNet is generally
cascaded with DashCamNet or TrafficCamNet for smart city applications. For example, DashCamNet
or TrafficCamNet acts as a primary detector, detecting the objects of interest and for each
detected car the VehicleMakeNet acts as a secondary classifier determining the make of the car.
Businesses such as smart parking or gas stations can use the insights of the make of vehicles
to understand their customers.

.. _VehicleMakeNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_vehiclemakenet

VehicleTypeNet
~~~~~~~~~~~~~~

`VehicleTypeNet`_ is a classification network based on ResNet18, which aims to classify cropped
vehicle images of size 224 x 224 into 6 classes: Coupe, Large Vehicle, Sedan, SUV, Truck, and
Vans. The typical use case for this model is in smart city applications such as smart garage
or toll booth, where you can charge based on size of the vehicle.

.. _VehicleTypeNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_vehicletypenet

Emotion Recognition (EmotionNet)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`EmotionNet`_ is a classification network based on 5 Fully Connected Layers, which aims to classify
human emotion into 6 classes: disgust, happy, neutral, scream, squint, and surprise. One use case for
this model is with deployment for retail vision and analytics. With person detection,
speech recognition, and emotion detection, customers can build a solution to understand store activity.
The goal is for all retailers to analyze and optimize activity in physical store locations.

.. _EmotionNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_emotionnet

Gaze Estimation (GazeNet)
~~~~~~~~~~~~~~~~~~~~~~~~~

`GazeNet`_ is a network that can predict eye gaze point of regards and gaze vector. The network requires four inputs,
including face, left eye, right eye, and facegrid. The architecture of the model has four branches. The face, left eye,
and right eye branch are AlexNet based architecture. The facegrid branch has two Fully Connected layers. One use case of
GazeNet in deployment is to use gaze values as threshold to determine when to enable speech recognition. This
"Looking At" feature enables speech recognition only when the subjects are looking at the camera.

.. _GazeNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_gazenet

Heart Rate Estimation (HeartRateNet)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`HeartRateNet`_ is a heart rate pulse estimation network, which aims to estimate heart rate pulse
from RGB facial videos. This is a two branch model with an attention mechanism that takes in a
motion map of size 72 x 72 x 3 and an appearance map size of 72 x 72 x 3 both derived from RGB
face videos.

.. _HeartRateNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_heartratenet

Facial Landmark (FPENet)
~~~~~~~~~~~~~~~~~~~~~~~~

`FPENet`_ is a fiducual point estimation network, which aims to predict the (x,y) location of keypoints for a given input face image. FPEnet is generally used in conjuction with a face detector and the output is commonly used for face alignment, head pose estimation, emotion detection, eye blink detection, gaze estimation, among others.

.. _FPENet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_fpenet

Gesture Recognition (GestureNet)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`GestureNet`_ is a classification network based on ResNet18, which aims to classify cropped
hand images of size 160 x 160 into 6 classes: Thumbs Up, Fist, Stop, Ok, Two and Random. One deployment use case for
this model is for human machine interaction.

.. _GestureNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_gesturenet

License Plate Detection (LPDNet)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`LPDNet`_ is a license plate detection network. 

.. _LPDNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_lpdnet

License Plate Recognition (LPRNet)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`LPRNet`_ is a license plate recognition network trained on license plates in US and China. 

.. _LPRNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_lprnet

FaceDetect 
~~~~~~~~~~

`FaceDetect`_ is a face detection model that takes in 3 channel RGB images and detect person's face. 

.. _FaceDetect: https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetect

PeopleSegNet
~~~~~~~~~~~~

`PeopleSegNet`_ is a instance segmentation network to detect and localize people in a crowded environment. 

.. _PeopleSegNet: https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplesegnet


Conversational AI
-----------------

.. _conversational_ai:

The purpose built models shipped with the TLT - Conversational AI package can be used directly in tasks like
answering questions across multiple domains, improving sentence semantics and more or can be re-trained or fine
tuned to deploy a Conversational AI like a Virtual Assistant to service customers in varied fields like financial
services, legal services, insurance, customer service and many more!

The table below shows the network architecture and the application area in which the model is trained. These
models can be re-trained or fine tuned to change the domain/language according to the user's requirements

.. csv-table:: Purpose Built Models for Conversational AI
   :file: ../content/conversational_ai_pbm.csv
   :widths: 30,30,40,60
   :class: longtable
   :header-rows: 1