Overview¶

The NVIDIA Transfer Learning Toolkit (TLT) is used with NVIDIA pre-trained models to create custom Computer Vision (CV) and Conversational AI models with the user’s own data. Training AI models using TLT does not require expertise in AI or deep learning. A simplified Command Line Interface (CLI) abstracts away AI framework complexity enabling users to build production quality AI models using a simple spec file and one of the NVIDIA pre-trained models. With a basic understanding of deep learning and minimal to zero coding required, TLT users are able to:

Fine-tune models for CV use cases such as object detection, image classification, segmentation, CR, Key point estimation using NVIDIA pre-trained CV models.
Fine-tune models for Conversational AI use cases such as Automatic Speech Recognition (ASR) or Natural Language Processing (NLP) using NVIDIA pre-trained Conversational AI models.
Add new classes to an existing pre-trained model.
Re-train a model to adapt to different use cases.
Use the model pruning capability on CV models to reduce the overall size of the model.

TLT users create custom AI models by modifying the training hyperparameters defined in each spec file. This guide includes sample spec files and paramter definitions for all TLT supported models. Refer to the corresponding Computer Vision or Conversational AI section for desired model. Along with creating accurate AI models, the TLT is also capable of optimizing models for inference to achieve the highest throughput for deployment.

Note

What is Transfer Learning?
Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where a model trained on one task is re-trained for use on a different task. This works surprisingly well as many of the early layers in a neural network are the same for similar tasks. For example, many of the early layers in a convolutional neural network used for a CV model are primarily used to identify outlines, curves, and other features in an image. The learned features from these layers can be applied to similar tasks carrying out the same identification in other domains. With transfer learning, less data is required to accurately train a model compared to training that same model from scratch. To learn more about transfer learning, read this blog.

The TLT is a Python package hosted on the NVIDIA Python Package Index. It interacts with lower-level TLT dockers available from the NVIDIA GPU Accelerated Container Registry (NGC); TLT containers come pre-installed with all dependencies required for training. The CLI is run from Jupyter notebooks packaged inside each docker container and consists of a few simple commands such as data augmentation, train, evaluate, infer, prune, and export. The output of the TLT workflow is a trained model that can be deployed for inference on NVIDIA devices using DeepStream, TensorRT, Jarvis, and the TLT CV Inference Pipeline.

The TLT application layer is built on top of CUDA-X which contains all the lower level NVIDIA libraries. These include NVIDIA Container Runtime for GPU acceleration, CUDA and cuDNN for deep learning (DL) operations, and TensorRT for optimization and generating TensorRT compatible models for deployment. TensorRT is the NVIDIA inference optimization and runtime engine. Models that are generated with TLT are completely accelerated for TensorRT, so users can expect maximum inference performance without any extra effort.

With TLT, users can train CV models for tasks such as object detection, image classification instance and semantic segmentation, optical character recognition (OCR), Facial landmark estimation, gaze estimation and more. Vision modality is used to extract insights from images or videos. For more information about all the CV models refer to the Computer Vision section of this documentation.

Users can also train Conversational AI models. Coversational AI is a term used to refer to a set of technologies which enable building intelligence capable of having a free flowing conversation. Building Conversational AI systems and applications is generally hard, and tailoring them to the needs of your enterprise is even harder. If you are interested in using TLT for Conversational AI applications, you can find more details in the ASR and NLP sections of this documentation. To learn more about deployment with Jarvis, refer to the Jarvis documentation and the section on Integrating TLT Trained Models to Jarvis; here you can find more details on how to use Jarvis for deploying Conversational AI models trained using TLT.

Model pruning is one of the key differentiators for the TLT. Pruning involves removing nodes in the neural network which contribute less to the overall accuracy of the model. With pruning, users are able to reduce the overall size of the model significantly resulting in a much lower memory footprint and higher inference throughput, which are very important for edge deployment. Currently pruning is only supported on most CV models and not on Conversational AI models. The following graph provides an example of performance gains achieved when going from an unpruned CV model to a pruned CV model (inference was run on an NVIDIA T4; TrafficCamNet, DashCamNet and PeopleNet are three of the custom pre-trained models that are available on NGC).

Pre-trained Models¶

There are two types of pre-trained models that users can start with:

Purpose-built pre-trained models. These are highly accurate models that are trained on thousands of data inputs for a specific task. These domain focused models can either be used directly for inference or can be used with TLT for transfer learning on users own dataset.
General purpose vision models. The pre-trained weights for these models merely act as a starting point to build more complex models. For computer vision usecases, these pre-trained weights are trained on Open image datasets and they provide a much better starting point for training versus starting from random initialization of weights.

Users are able to choose from 100+ permutations of model architecture and backbone with the general purpose vision models. For more information on finetuning models for conversational AI use cases, see the pretrained models section for Conversational AI.

Purpose-built models are built for high accuracy and performance. These models can be deployed out of the box for applications in smart city, retail, public safety, healthcare, and others or can also be used to re-train with user’s own data. All models are trained on thousands of proprietary images and can achieve very high accuracy on NVIDIA test data. More information about each of these models is available in individual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Model Name	Network Architecture	Number of classes	Accuracy	Use Case
TrafficCamNet	DetectNet_v2-ResNet18	4	84% mAP	Detect and track cars.
PeopleNet	DetectNet_v2-ResNet18/34	3	84% mAP	People counting, heatmap generation, social distancing.
DashCamNet	DetectNet_v2-ResNet18	4	80% mAP	Identify objects from a moving object.
FaceDetectIR	DetectNet_v2-ResNet18	1	96% mAP	Detect face in a dark environment with IR camera.
VehicleMakeNet	ResNet18	20	91% mAP	Classifying car models.
VehicleTypeNet	ResNet18	6	96% mAP	Classifying type of cars as coupe, sedan, truck, etc.
PeopleSegNet	MaskRCNN-ResNet50	1	85% mAP	Creates segmentation masks around people, provides pixel
License Plate Detection	DetectNet_v2-ResNet18	1	98% mAP	Detecting and localizing License plates on vehicles
License Plate Recognition	Tuned ResNet18	36(US) / 68(CH)	97%(US)/99%(CH)	Recognize License plates numbers
Gaze Estimation	Four branch AlexNet based model	NA	6.5 RMSE	Detects person’s eye gaze
Facial Landmark	Recombinator networks	NA	6.1 pixel error	Estimates key points on person’s face
Heart Rate Estimation	Two branch model with attention	NA	0.7 BPM	Estimates person’s heartrate from RGB video
Gesture Recognition	ResNet18	6	0.85 F1 score	Recognize hand gestures
Emotion Recognition	5 Fully Connected Layers	6	0.91 F1 score	Recognize facial Emotion
FaceDetect	DetectNet_v2-ResNet18	1	85.3 mAP	Detect faces from RGB or grayscale image

With the general purpose models, users can train an image classification model, object detection model or an instance segmentation model. For classification, users can train using one of 15 available architectures such as ResNet, EfficientNet, VGG, MobileNet, GoogLeNet, SqueezeNet or DarkNet architecture. For object detection tasks, users can choose from wildly popular YOLOV3/V4, FasterRCNN, SSD as well as RetinaNet, DSSD and NVIDIA’s own DetectNet_v2 architecture. Finally, for instance segmentation, users can use the MaskRCNN for instance segmentation or UNET for semantic segmentation. This gives users the flexibility and control to build AI models for any number of applications, from smaller light weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, refer to the table below and see the Open Model Architectures section.

TLT Workflow Overview¶

The goal of TLT is to train and fine-tune a model using the user’s own dataset. In the workflow diagram shown below, a user typically starts with a pre-trained model from NGC; either the highly accurate purpose-built model or just the pre-trained weights of the architecture of their choice. The other input is the user’s own dataset. The dataset is fed into the data converter, which can augment the data while training to introduce variations in the dataset. This is very important in training as the data variation improves the overall quality of the model and prevents overfitting. Users can also do offline augmentation with TLT, where the dataset is augmented before training. More information about offline augmentation is provided in Augmenting a Dataset.

Once the dataset is prepared and augmented, the next step in the training process is to start training. The training hyperparameters are chosen through the spec file. More information about these hyperparameters can be found in the respective model section below. After the first training phase, users evaluate the model against a test set to see how the model works on the data it has never seen before. Once the model is deemed accurate, the next step is model pruning. If accuracy is not as expected, then the user might have to tune some hyperparameters and re-train. Training is a very iterative process, so you might have to try a few times before converging on the right model.

In model pruning, TLT will algorithmically remove neurons from the neural network which does not contribute significantly to the overall accuracy. Model pruning is only supported on CV models, so if you are training a Conv AI model, you can skip this step and go directly to model export. The model pruning step will inadvertently reduce the accuracy of the model. So after pruning, the next step is to re-train the model on the same dataset to recover the lost accuracy. After re-train, the user will evaluate the model on the same test set. If the accuracy is back to what was before pruning, then the user can move to the model export step. At this point, the user feels confident in accuracy of the model as well as inference performance. The exported model will be in ‘.etlt’ format which can be deployed directly on any NVIDIA GPU using DeepStream, Jarvis and TensorRT. In the export step, users can optionally generate an INT8 calibration cache that quantizes the floating-point weights to integer. Running inference at INT8 precision can provide more than 2x performance over FP16 or FP32 precision without sacrificing the accuracy of the model. To learn more about model export and deployment, see the Exporting the model and Deploying to DeepStream sections per task.

Once exported, any computer vision model in TLT may be deployed to a TensorRT optimized engine using the TLT converter. The following table outlines the compatibility of TLT converter with TensorRT across the x86 and jetson platforms.

TLT Converter Support Matrix for x86¶
CUDA/CUDNN	TensorRT	Platform
10.2/8.0	7.2	cuda102-cudnn80-trt72
11.0/8.0	7.2	cuda110-cudnn80-trt72
11.1/8.0	7.2	cuda111-cudnn80-trt72
10.2/8.0	7.1	cuda102-cudnn80-trt71
11.0/8.0	7.1	cuda110-cudnn80-trt71

TLT Converter Support Matrix for Jetson¶
JetPack Version	Availability
4.4	cuda102-trt71-jp44
4.5	cuda102-trt71-jp45

To learn more about how to use TLT, read the technical blogs which provide step-by-step guide to training with TLT:

Learn to Train with PeopleNet and other pre-trained model using TLT.
Learn how to train Instance segmentation model using MaskRCNN with TLT.
Learn how to improve INT8 accuracy using Quantization aware training(QAT) with TLT.
Learn how to Create a real time license plate detection and recognition app
Learn how to Prepare state of the art models for classification and object detection with TLT
Learn more on Building and Deploying Conversational AI models Using the NVIDIA Transfer Learning Toolkit

If you have any questions when using TLT to train a model and deploy to Jarvis or DeepStream, please log them at the