Release Notes
NVIDIA TAO Toolkit is a Python package to enable NVIDIA customers the ability to fine-tune pretrained models with customer’s own data and export them for TensorRT based inference through an edge device.
NVIDIA Transfer Learning Toolkit has been renamed to TAO Toolkit. For detailed migration guide go to this section.
Key Features
Bring your own models into TAO Toolkit using TAO BYOM converter.
Deploy TAO as a service on a Kubernetes cluster, detailed in this section
Integrate TAO into your workflow using RestAPIs
TensorBoard visualization is available for select models, as detailed in this section.
Train object detection networks from a pointcloud data file via PointPillars.
Train a classification network to classify poses from a pose skeleton via a Graph convolutional network.
Intermediate checkpointing is available for ASR and TTS models.
Support Conformer-CTC for ASR: train, finetune, evaluate, infer, and export.
Compute Stack
TF 1.15.4 Container
container name: nvcr.io/nvidia/tao/tao-toolkit-tf tag: v3.22.05-tf1.15.4-py3
Software |
Version |
python |
3.6 |
CUDA |
11.4 |
CuDNN |
8.2.1.32 |
TensorRT |
8.2.5.1 |
TF 1.15.5 Container
container name: nvcr.io/nvidia/tao/tao-toolkit-tf tag: v3.22.05-tf1.15.5-py3
Software |
Version |
python |
3.6 |
CUDA |
11.6 |
CuDNN |
8.2.1.32 |
TensorRT |
8.2.5.1 |
PyTorch Container
container name: nvcr.io/nvidia/tao/tao-toolkit-pyt tag: v3.22.05-py3
Software |
Version |
python |
3.8 |
CUDA |
11.5 |
CuDNN |
8.2.1.32 |
TensorRT |
8.2.5.1 |
Language Model Container
container name: nvcr.io/nvidia/tao/tao-toolkit-lm tag: v3.22.05-py3
Software |
Version |
python |
3.8 |
CUDA |
11.5 |
CuDNN |
8.2.1.32 |
TensorRT |
8.2.5.1 |
Model Updates
Computer Vision
Image Classification
Add verification for custom classmap file input.
Add classmap file input to train.
Add classmap file as optional input for evaluate.
Add status callback and
results_dir
command line argument for evaluate and inference.Support TensorBoard visualization for
train
endpoint.Perform initial updates for BYOM custom layer.
Add EFF package.
Add EFF package and model loading.
Enable BYOM in image classification.
DetectNet_v2
Limit GPU memory usage during
tao detectnet_v2 evaluate
,Add native support to convert COCO Dataset to TFRecords,
Bring sampling mode parameter out in the spec file under
dataset_config
,Enable tensorboard visualization,
Add configuration element for
visualizer
indataset_config
.Fix success state for TFRecords generation.
Add status logging to all tasks as long as the
--results_dir
argument is set via command line.
UNet
Update the
--gen_ds_config
option during UNet export.Add the
dataset_convert
endpoint to UNet.Add support for converting COCO Dataset to TFRecords.
Support evaluation on a pruned model.
Add graph collect for functions to improve memory consumption.
Optimize ONNX for UNet inference.
Fix bugs for re-training a pruned model.
Add unified
status_logging
to UNet endpoints.Support custom layer pruning and direct evaluate from
.tltb
via BYOM.Enable Bring Your Own Model for UNet.
Implement support for Quantization Aware Training (QAT).
Add end-to-end support for ShuffleNet.
Enable status logging during training via
StatusCallBack
.Improve the operation of dataloader during training.
Enable TensorBoard visualization during training.
Add a warning for
output_width
.Enable support for training with early stopping.
BYOM
Enable custom layer pruning for Bring You Own Model (BYOM).
Common features
Fix error handling in
model_io
.Support COCO TFRecord conversion for object detection and segmentation networks.
Fix a typo in SoftStartAnnealingLearningRateScheduler.
Implement status-logging callback.
YOLOv4
Enable smoothing to object loss.
Support exponential moving average (EMA).
Fix the YOLOv4 neck and head structure.
Configure NMS per data-loader configuration.
Fix YOLOv3 and YOLOv4 shapes.
Enable manually setting class weighting.
Enable TensorBoard visualization.
MaskRCNN
Enable
skip_crowd_during_training=False
.Add an evaluation summary and patch exporter.
Enable TensorBoard visualization.
EfficientDet
Fix a typo in TRT inferencer.
SSD
Enable status logging for all endpoints when
--results_dir
is added to the command lineEnable support for training with early stopping.
DSSD
Enable status logging for all endpoints when
--results_dir
is added to the command line.Enable support for training with early stopping.
RetinaNet
Enable support for training with early stopping.
Enable status logging for all endpoints when
--results_dir
is added to the command line.Fix a bug with resume checkpoint via sequence dataloader.
Enable backward compatibility for a TLT 2.0 trained model.
Enable Tensorboard visualization during training.
Enable manually setting class weights.
FasterRCNN
Enable status logging for all endpoints when
--results_dir
is added to the command line.Enable model as a CLI argument of evaluation and inference for TAO API.
Enable Tensorboard visualization during training
Conversational AI
Generic
Add status logging to TTS models similar to TAO Toolkit CV models
Fix issue in QA model evaluation for Chinese SQuAD*style dataset
Fix bug of create_tokenizer on always using old corpus silently
Update backend to use NeMo 1.7.0
TTS
Remove duration check for TTS dataset from Riva Custom Voice Recorder
Fix infer onnx endpoint when running infer from finetuned model
Fix error handling for Vocoder
Enable intermediate .tlt model checkpoint
PointPillars
Enabled transfer learning with pretrained models
Use TensorRT oss 22.02 from GitHub
Action Recognition
Update metrics module
ASR
Support Early Stopping
Finetune on NeMo models
Enable intermediate .tlt model checkpoint
Pretrained models
New models
PointPillarNet
PoseClassificationNet
Updated models
PeopleNet
PeopleSemSegNet
PeopleSegNet
LPDNet
Known Issues/Limitations
TAO DSSD/FasterRCNN/RetinaNet/YOLOv3/YOLOv4 can have intermittent illegal memory access errors with export or converter CLI commands. The root cause is unknown. In this case, simply run it again to resolve this issue.
The TAO BYOM Semantic Segmentation worflow is only supported with UNet and Image Classification.
TAO Image Classification networks require driver 510 or greater for training.
TAO Toolkit as a Service doesn’t support user authentication and per-user workspace management.
TTS Finetuning is only supported for data originating from the NVIDIA Custom Voice Recorder.
Key Features
Features included in this release
TAO Resources
Jupyter notebook example for showing the end-to-end workflow for the following model
TAO Conversational AI
Support for finetuning a FastPitch and HiFiGAN from a pretrained model
Update FastPitch and HiFiGAN export and infer endpoint to interface with RIVA
Known Issues/Limitations
TAO FastPitch finetuning is only supported on text transcripts that are defined in the NVIDIA Custom Voice Recorder.
The data from the NVIDIA Custom Voice Recorder can only be used for fine tuning a
FastPitch
orHiFiGAN
model.For finetuning FastPitch, you are required to resample the new speaker data to the sampling rate of the dataset used to train the pretrained model.
Key Features
Features included in this release:
TAO Resources:
Jupyter notebook examples showing the end-to-end workflow for the following models
ActionRecognitionNet
EfficientDet
Text-To-Speech using FastPitch and HiFiGAN
TAO CV:
Pretrained models for several public architectures and reference applications serving computer vision related object classification, detection and segmentation use cases.
Support for YOLOv4-tiny and EfficienetDet object detection models.
Support for pruning EfficientDet models
New pretrained models released on NGC
Converter utility to generate device specific optimized TensorRT engines
Jetson JP4.6
x86 + dGPU - TensorRT 8.0.1.6 with CUDA 11.4
TAO Conversational AI:
Support for training FastPitch and HiFiGAN model from scratch
Adding new encoders for Natural Language Processing tasks
DistilBERT
BioMegatron-BERT
Known Issues/Limitations
TAO CV
Transfer Learning is not supported on pruned models across all applications.
When training with multiple GPUs, you might need to scale down the batch_size and/or scale up the learning rate to get the same accuracy seen in single GPU training.
When training DetectNet_v2 for object detection use-cases with more than 10 classes, you may need to either update the cost_weight parameter in the cost_function_config, or balance the number of samples per class in the dataset for better training.
When training a DetectNet_v2 network for datasets with less than 20,000 images, please use smaller batch-sizes (1, 2 or 4) to get better accuracy.
The infer subtask of DetectNet_v2 doesn’t output confidence and generates 0. as value. You may ignore these values and only consider the bbox and class labels as valid outputs.
ResNet101 pre-trained weights from NGC is not supported on YOLOv3, YOLOv4, YOLOv4-tiny, SSD, DSSD and RetinaNet.
When generating int8 engine with
tao-converter
, please use-s
if there is TensorRT error message saying weights are outside of fp16 range.Due to the complexity of larger EfficientDet models, the pruning process will take significantly longer to finish. For example, pruning the EfficientDet-D5 model may take at least 25 minutes on a V100 server.
When generating a TensorRT INT8 engine on A100 GPUs using the
tao-converter
for MaskRCNN, enable--strict_data_type
Our EfficientDet codebase has source code taken from the automl github repo
TAO Conversational AI
When running convAI models on a cloud VM, users should have root access to the VM
Text-To-Speech pipelines only support training from scratch for a single speaker
Text-To-Speech training pipeline requires the audio files to be
.wav
formatTAO Toolkit 3.0-21.11 exported .riva files will not be supported in RIVA < 21.09
BioMegatron-BERT and Megatron based NLP tasks doesn’t support resuming a previously completed model with more number of epochs than the previously completed experiment
When running the end to end sample of Text-to-Speech, you may have to use expand abbreviations
Resolved Issues
TAO CV
YOLOv4, YOLOv3, UNet and LPRNet exported
.etlt
model files can be integrated directly into DeepStream 6.0.
TAO Conversational AI
ASR model support generating intermediate
.tlt
model files during training
Deprecated Features
The TAO Computer Vision Inference Pipeline is deprecated. Users can now use DeepStream to deploy the following out-of-the-box models via reference applications provided here:
Release Contents
Components included in this release:
TAO Launcher pip package
TAO - TF docker
TAO - Pytorch Docker
TAO - Language Model Docker
Jupyter notebook with sample workflows
Getting Started Guide containing usage and installation instructions
tao-converter for x86 + discrete GPU platforms
tao-converter for Jetson (ARM64) available here.
Pre-trained weights trained on Open Image dataset available on NGC
Unpruned and Pruned models for Purpose-built models - Pruned models can be deployed out-of-box with DeepStream and unpruned models can be used for re-training.
Trainable and out-of-box Deployable models for:
Key Features
Transfer Learning Toolkit has been renamed to TAO Toolkit
TAO Toolkit Launcher:
Python3 pip package as a unified Command Line Interface (CLI)
Support for docker hosted from different registries
TAO Resources:
Jupyter notebook examples showing the end-to-end workflow for the following models
N-Gram Language model
TAO CV:
Support for MaskRCNN Instance segmentation model
Support for pruning MaskRCNN models
Support for serializing a template DeepStream config and labels file
Support for training highly accurate purpose-built models:
BodyPose Estimation
Instructions for running TAO in the cloud with Azure
Converter utility to generate device specific optimized TensorRT engines
New backbones added to UNet training
Vanilla UNet Dynamic
Efficient UNet
TAO Conversational AI:
Added support for validating an exported model for compliance with RIVA
Training an N-Gram language model implemented in KenLM
Known Issues/Limitations
TAO CV
Transfer Learning is not supported on pruned models across all applications.
When training with multiple GPUs, you might need to scale down the batch_size and/or scale up the learning rate to get the same accuracy seen in single GPU training.
When training DetectNet_v2 for object detection use-cases with more than 10 classes, you may need to either update the cost_weight parameter in the cost_function_config, or balance the number of samples per class in the dataset for better training.
When training a DetectNet_v2 network for datasets with less than 20,000 images, please use smaller batch-sizes (1, 2 or 4) to get better accuracy.
The infer subtask of DetectNet_v2 doesn’t output confidence and generates 0. as value. You may ignore these values and only consider the bbox and class labels as valid outputs.
ResNet101 pre-trained weights from NGC is not supported on YOLOv3, YOLOv4, YOLOv4-tiny, SSD, DSSD and RetinaNet.
When generating int8 engine with
tao-converter
, please use-s
if there is TensorRT error message saying weights are outside of fp16 range.
TAO Conversational AI
When running convAI models on a cloud VM, users should have root access to the VM.
TAO Conv AI models cannot generate intermediate model.tlt files.