TensorFlow Release 21.08

The NVIDIA container image of TensorFlow, release 21.08, is available on NGC.

Contents of the TensorFlow container

This container image includes the complete source of the NVIDIA version of TensorFlow in /opt/tensorflow. It is pre-built and installed as a system Python module.

To achieve optimum TensorFlow performance, for image based training, the container includes a sample script that demonstrates efficient training of convolutional neural networks (CNNs). The sample script may need to be modified to fit your application.

The container also includes the following:

Ubuntu 20.04

Note:

Container image 21.08-tf1-py3 and 21.08-tf2-py3 contains Python 3.8
NVIDIA CUDA 11.4.1
cuBLAS 11.5.4
NVIDIA cuDNN 8.2.2.26
NVIDIA NCCL 2.10.3 (optimized for NVLink™ )
Horovod 0.22.0
rdma-core 36.0
OpenMPI 4.1.1+
OpenUCX 1.11.0rc1
GDRCopy 2.2
NVIDIA HPC-X 2.9
Nsight Systems 2021.2.4.12
TensorRT 8.0.1.6
TensorBoard
- 21.08-tf1-py3 includes version 1.15.0
- 21.08-tf2-py3 includes version TensorBoard 2.6.0
OpenSeq2Seq at commit 8f040a49
- Included only in 21.08-tf1-py3
DALI 1.4
DLProf 1.4.0
- Included only in 21.08-tf1-py3
XLA-Lite
JupyterLab 2.3.1 including Jupyter-TensorBoard

Driver Requirements

Release 21.08 is based on NVIDIA CUDA 11.4.1, which requires NVIDIA Driver release 470 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 21.08 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the NVIDIA Pascal, Volta, Turing, and Ampere Architecture GPU families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This TensorFlow release includes the following key features and enhancements.

TensorFlow container images version 21.08 are based on Tensorflow 1.15.5 and 2.5.0
Experimental integration of the cuTENSOR library for einsum operations is included in the 21.08-tf2-py3 container. This should improve performance for many einsum operations. To enable, export TF_ENABLE_CUTENSOR_EINSUM=1.
Added XLA feature to de-select compilation candidates based on shape inference. To enable this feature, use the environment variable TF_XLA_DO_NOT_COMPILE_POSSIBLE_DYNAMIC_OPS.
Bug fixes for the cudaMallocAsync GPU memory allocator.
MKL is enabled for better performance in CPU-only workloads. To enable, set OMP_NUM_THREADS to a value >= 1.

Announcements

The TensorCore example models are no longer provided in the core container (previously shipped in /workspace/nvidia-examples). Instead they can be obtained from Github or the NVIDIA GPU Cloud (NGC). Some python packages, included in previous containers to support these example models, have also been removed. Depending on their specific use cases, users may need to add some packages that were previously pre-installed.

NVIDIA TensorFlow Container Versions

The following table shows what versions of Ubuntu, CUDA, TensorFlow, and TensorRT are supported in each of the NVIDIA containers for TensorFlow. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Ubuntu	CUDA Toolkit	TensorFlow	TensorRT
21.08	20.04	NVIDIA CUDA 11.4.1	2.5.0 1.15.5	TensorRT 8.0.1.6
21.07		NVIDIA CUDA 11.4.0	2.5.0 1.15.5	TensorRT 8.0.1.6
21.06		NVIDIA CUDA 11.3.1	2.5.0 1.15.5	TensorRT 7.2.3.4
21.05		NVIDIA CUDA 11.3.0	2.4.0 1.15.5
21.04		NVIDIA CUDA 11.3.0
21.03		NVIDIA CUDA 11.2.1		TensorRT 7.2.2.3
21.02		NVIDIA CUDA 11.2.0		TensorRT 7.2.2.3+cuda11.1.0.024
20.12		NVIDIA CUDA 11.1.1	2.3.1 1.15.4	TensorRT 7.2.2
20.11	18.04	NVIDIA CUDA 11.1.0		TensorRT 7.2.1
20.10		NVIDIA CUDA 11.1.0		TensorRT 7.2.1
20.09		NVIDIA CUDA 11.0.3	2.3.0 1.15.3	TensorRT 7.1.3
20.08		NVIDIA CUDA 11.0.3	2.2.0 1.15.3
20.07		NVIDIA CUDA 11.0.194	2.2.0 1.15.3
20.06		NVIDIA CUDA 11.0.167	2.2.0 1.15.2	TensorRT 7.1.2
20.03 20.02		NVIDIA CUDA 10.2.89	2.1.0 1.15.2	TensorRT 7.0.0
20.01			2.0.0 1.15.0	TensorRT 7.0.0
19.12 19.11			2.0.0 1.15.0	TensorRT 6.0.1
19.10		NVIDIA CUDA 10.1.243	1.14.0
19.09
19.08				TensorRT 5.1.5

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

Note:

If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.
The OpenSeq2Seq toolkit is deprecated and will be removed starting in the 21.09-tf1-py3 release. This only affects the TensorFlow 1.x release.
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.08 includes a feature that works around this issue, but TF1 21.08 does not include that feature and may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40. This will be fixed in the next version of TensorRT.
A known regression can reduce the training performance of VGG-16 by up to 12% at certain batch sizes.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.