NVIDIA TAO Release tlt.20

The Transfer Learning Toolkit for Intelligent Video Analytics Guide provides instruction on using transfer learning for video and image analysis.

“Transfer learning” is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. It works surprisingly well since a lot of the early layers in a neural network are primarily used to identify outlines, curves, and other features in an image. This can easily be transferred to other domains. An example would be if you want to identify different breeds of dogs, but you only have few images per breed. So, what you can do is take a model that was trained on recognizing animals and apply transfer learning to train the model to recognize breeds of dogs with your own images of dogs. Features to recognize animals can be transferred over for your use case.

Transfer learning is very useful when data collection and annotation is difficult or expensive. With transfer learning, less data is required to train accurately as compared to if you were to train from scratch. This reduces the overall training time and cost. Because you are running over a smaller dataset, you can train quicker and minimize the cost of collecting and annotating data. To learn more about transfer learning, read this blog.


NVIDIA Transfer Learning Toolkit (TLT) is a simple, easy-to-use training toolkit that requires minimal to zero coding to create vision AI models using the user’s own data. Using TLT users can transfer learning from NVIDIA pre-trained models to your own model. Users can add new classes to an existing pre-trained model, or they can re-train the model to adapt to their use case. Users can use model pruning capability to reduce the overall size of the model.

Getting started with TLT is very easy. Training AI models using TLT does not require expertise in AI or deep learning. Users with basic knowledge of deep learning, can get started building their own custom models using a simple spec file and pre-trained model.


Transfer Learning Toolkit is a simplified toolkit where users start with our pre-trained models and their own custom dataset. Transfer learning toolkit is available in a docker container that can be downloaded from NGC, NVIDIA GPU cloud registry. The container comes with all the dependencies required to train. For more information about TLT requirements and installation, see TLT Requirements and Installation. The pre-trained models can also be downloaded from NGC. The toolkit consists of a command line interface (CLI) that can be run from the Jupyter notebooks, which are packaged inside the docker container. TLT consists of a few simple commands such as data augmentation, training, pruning and model export. The output of TLT is a trained model that can be deployed for inference on NVIDIA edge devices using DeepStream and TensorRT.

TLT builds on top of CUDA-X stack which contains all the lower level NVIDIA libraries. These are NVIDIA container RT for GPU acceleration from within the containers, CUDA and cuDNN for a lot of DL operations and TensorRT for generating TensorRT compatible models for deployment. TensorRT is NVIDIA’s inference runtime which optimizes the runtime model based on the targeted hardware. The models that are generated with TLT are completely accelerated with TensorRT, so users can expect maximum inference performance without any extra effort.

TLT is designed to run on x86 systems with a NVIDIA GPU such as a GPU-powered workstation or a DGX system or can be run in any cloud with a NVIDIA GPU. For inference, models can be deployed on any edge device such as the embedded Jetson platform or in data center GPUs like T4.

Model pruning is one of the key differentiators for TLT. Pruning means removing nodes in the neural network which contribute less to the overall accuracy of the model. With pruning, users can reduce the overall size of the model significantly which results in much lower memory footprint and higher inference throughput, which are very important for edge deployment. The graph below shows the performance gain from going from an unpruned model to a pruned model on NVIDIA T4. TrafficCamNet, DashCamNet and PeopleNet are 3 of the custom pre-trained models that are available with NGC. More on these models below.


There are 2 types of pre-trained models that users can start with. One is the purpose-built pre-trained models. These are highly accurate models that are trained on millions of objects for a specific task. The other type of models are meta-architecture vision models. The pre-trained weights for these models merely act as a starting point to build more complex models. These pre-trained weights are trained on Open image dataset and they provide a much better starting point for training versus starting from scratch or starting from random weights. With the latter choice, users can choose from 80+ permutations of model architecture and backbone. See the illustration below.


The purpose-built models are built for high accuracy and performance. These models can be deployed out of the box for applications in smart city or smart places or can also be used to re-train with user’s own data. All 6 models are trained on millions of objects and can achieve more than 80% accuracy on our test data. More information about each of these models is available in the Purpose-Built Models section or in the individual model cards. Typical use cases and some model KPIs are provided in the table below. PeopleNet can be used for detecting and counting people in smart buildings, retail, hospitals, etc. For smart traffic applications, TrafficCamNet and DashCamNet can be used to detect and track vehicles on the road.

Model Name

Network Architecture

Number of classes


Use Case





Detect and track cars.





People counting, heatmap generation, social distancing.





Identify objects from a moving object.





Detect face in a dark environment with IR camera.





Classifying car models.





Classifying type of cars as coupe, sedan, truck, etc.

In the architecture specific models bucket, users can train an image classification model, an object detection model or an instance segmentation model. For classification, users can train using one of 13 available architectures such as ResNet, VGG, MobileNet, GoogLeNet, SqueezeNet or DarkNet architecture. For object detection tasks, users can choose from wildly popular YOLOV3, FasterRCNN, SSD as well as RetinaNet, DSSD and NVIDIA’s own DetectNet_v2 architecture. Finally, for instance segmentation, users can use the MaskRCNN architecture. This gives users the flexibility and control to build AI models for any number of applications, from smaller light weight models for edge GPUs to larger models for more complex tasks. For all the permutations and combinations, see the table below and see the Supported Model Architectures section.


The goal of TLT is to train and fine-tune a model using the user’s own dataset. In the workflow diagram shown below, a user typically starts with a pre-trained model from NGC; either the highly accurate purpose-built model or just the pre-trained weights of the architecture of their choice. The other input is the user’s own dataset. The dataset is fed into the data converter, which can augment the data while training to introduce variations in the dataset. This is very important in training as the data variation improves the overall quality of the model and prevents overfitting. Users can also do offline augmentation with TLT, where the dataset is augmented before training. More information about offline augmentation is provided in Augmenting a Dataset.


Once the dataset is prepared and augmented, the next step in the training process is to start training. The training hyperparameters are chosen through the spec file. To learn about all the knobs that users can tune, see the Creating an Experiment Spec File section. After the first training phase, users evaluate the model against a test set to see how the model works on the data it has never seen before. Once the model is deemed accurate, the next step is model pruning. If accuracy is not as expected, then the user might have to tune some hyperparameters and re-train. Training is a very iterative process, so you might have to try a few times before converging on the right model.

In model pruning, TLT will algorithmically remove neurons from the neural network which does not contribute significantly to the overall accuracy. The model pruning step will inadvertently reduce the accuracy of the model. So after pruning, the next step is to re-train the model on the same dataset to recover the lost accuracy. After re-train, the user will evaluate the model on the same test set. If the accuracy is back to what was before pruning, then the user can move to the model export step. At this point, the user feels confident in accuracy of the model as well as inference performance. The exported model will be in ‘.etlt’ format which can be deployed directly on any NVIDIA GPU using DeepStream and TensorRT. In the export step, users can optionally generate an INT8 calibration cache that quantizes the floating-point weights to integer. Running inference at INT8 precision can provide more than 2x performance over FP16 or FP32 precision without sacrificing the accuracy of the model. To learn more about model export and deployment, see the Exporting the model and Deploying to DeepStream sections.

To learn more about how to use TLT, read the technical blogs which provide step-by-step guide to training with TLT:

Use the Transfer Learning Toolkit to perform these tasks:

  1. Download the model: Download pre-trained models.

  2. Prepare the dataset: Evaluate models for target predictions.

  3. Train the model: Train or re-train data to create and refine models.

  4. Evaluate the model: Evaluate models for target predictions.

  5. Prune the model: Prune models to reduce size.

  6. Export the model: Export models for TensorRT inference.

© Copyright 2020, NVIDIA. Last updated on Nov 18, 2020.