TensorFlow Release 20.02
The NVIDIA container image of TensorFlow, release 20.02, is available on NGC.
Contents of the TensorFlow container
This container image contains the complete source of the version of NVIDIA TensorFlow in
/opt/tensorflow. It is pre-built and installed as a system Python module.
To achieve optimum TensorFlow performance, for image based training, the container includes a sample script that demonstrates efficient training of convolutional neural networks (CNNs). The sample script may need to be modified to fit your application. The container also includes the following:
- Ubuntu 18.04
Note: Container image
20.02-tf2-py3contains Python 3.6
- NVIDIA CUDA 10.2.89 including cuBLAS 10.2.2.89
- NVIDIA cuDNN 7.6.5
- NVIDIA NCCL 2.5.6 (optimized for NVLink™ )
- Horovod 0.19.0
- OpenMPI 3.1.4
- OpenSeq2Seq at commit a81babd
- Included only in
- Included only in
- TensorRT 7.0.0
- DALI 0.18.0 Beta
- DLProf 20.02
- Included only in
- Included only in
- Nsight Compute 2019.5.0
- Nsight Systems 2020.1.1
- Tensor Core optimized examples: (Included only in
- Jupyter and JupyterLab:
- Jupyter Client 5.3.4
- Jupyter Core 4.6.1
- Jupyter Notebook
- JupyterLab Server
Release 20.02 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.33.01. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
Release 20.02 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.
Key Features and Enhancements
This TensorFlow release includes the following key features and enhancements.
- TensorFlow container image version 20.02 is based on TensorFlow 1.15.2 and TensorFlow 2.1.0.
- Latest version of DLProf 20.02
- Latest version of DALI 0.18.0 Beta
20.02-tf2-py3includes version 2.1.0
- Latest version of Nsight Systems 2020.1.1
- Latest version of Horovod 0.19.0
- Ubuntu 18.04 with January 2020 updates
- Improved AMP logging messages to include instructions for tweaking AMP lists.
- Added nvtx markers in TF 2.1 eager execution path for improved profiling with nvtx.
- Python 2.7 is no longer supported in this TensorFlow container release.
- The TF_ENABLE_AUTO_MIXED_PRECISION environment variables are no longer supported in the tf2 container because it is not possible to automatically enable loss scaling in many cases in the tf 2.x API. Instead tf.train.experimental.enable_mixed_precision_graph_rewrite() should be used to enable AMP.
- Deep learning framework containers 19.11 and later include experimental support for Singularity v3.0.
NVIDIA TensorFlow Container Versions
The following table shows what versions of Ubuntu, CUDA, TensorFlow, and TensorRT are supported in each of the NVIDIA containers for TensorFlow. For older container versions, refer to the Frameworks Support Matrix.
|Container Version||Ubuntu||CUDA Toolkit||TensorFlow||TensorRT|
|NVIDIA CUDA 10.2.89||TensorRT 7.0.0|
|19.10||NVIDIA CUDA 10.1.243||1.14.0|
Accelerating Inference In TensorFlow With TensorRT (TF-TRT)
For step-by-step instructions on how to use TF-TRT, see Accelerating Inference In TensorFlow With TensorRT User Guide.
- Known Issues
We have observed a regression in the performance of certain TF-TRT benchmarks in TensorFlow 1.15 including image classification models with precision INT8. We are still investigating this. Since 19.11 comes with a new version of TensorFlow (1.15), which includes a lot of changes in the TensorFlow backend, it’s very possible that the regression is caused by a change in the TensorFlow backend.
CUDA 10.2 and NCCL 2.5.x libraries require slightly more device memory than previous releases. As a result, some models that ran previously may exhaust device memory.
The accuracy of Faster RCNN with the backbone ResNet-50 using TensorRT6.0 INT8 calibration is lower than expected. This will be fixed in future releases of TensorRT.
The following warning is issued when the method
build()from the API is not called. This warning can be ignored.
OP_REQUIRES failed at trt_engine_resource_ops.cc:183 : Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_...
The following warning is issued because internally TensorFlow calls the TensorRT optimizer for certain objects unnecessarily. This warning can be ignored.
TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
We have seen failures when using INT8 calibration (post-training) within the same process that does FP32/FP16 conversion. We recommend using separate processes for different precisions until this issue gets resolved.
We have seen failures when calling the TensorRT optimizer on models that are already optimized by TensorRT. This issue will be fixed in a future release.
In case you import nets from models/slim, you might see the following error:
AttributeError: module 'tensorflow_core.contrib' has no attribute 'tensorrt'
Changing the order of imports can fix the issue. Therefore, import TensorRT before importing nets as follows:
import tensorflow.contrib.tensorrt as trt import nets.nets_factory
Automatic Mixed Precision (AMP)
Automatic mixed precision converts certain float32 operations to operate in float16 which can run much faster on Tensor Cores. Automatic mixed precision is built on two components:
- a loss scaling optimizer
- graph rewriter
For models already using an optimizer from
tf.keras.optimizers for both
apply_gradients() operations (for example, by calling
model.fit(), automatic mixed precision can be enabled by wrapping the optimizer with
For more information on this function, see the TensorFlow documentation here.
For backward compatibility with previous container releases, AMP can also be enabled for
tf.train optimizers by defining the following environment variable:
For more information about how to access and enable Automatic mixed precision for TensorFlow, see Automatic Mixed Precision Training In TensorFlow from the TensorFlow User Guide, along with Training With Mixed Precision.
Tensor Core Examples
The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.
- U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
- ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
There are known issues since the 19.11 release for NCF inference with XLA and VGG16 training without XLA; these benchmarks have performance that is lower than expected.
There are known issues regarding TF-TRT INT8 accuracy issues. See the Accelerating Inference In TensorFlow With TensorRT (TF-TRT) section above for more information.
For BERT Large training with the 19.08 release on Tesla V100 boards with 16 GB memory, performance with batch size 3 per GPU is lower than expected; batch size 2 per GPU may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.
TensorBoard has a bug in its IPv6 support which can result in the following error:
Tensorboard could not bind to unsupported address family ::. To workaround this error, pass the
--host <IP>flag when starting TensorBoard.
A known issue in TensorFlow results in the error
Cannot take the length of Shape with unknown rankwhen training variable sized images with the Keras
model.fitAPI. Details are provided here and a fix will be available in a future release.
Support for CUDNN float32 Tensor Op Math mode first introduced in the 18.09 release is now deprecated in favor of Automatic Mixed Precision. It is scheduled to be removed after the 19.11 release.
There is a known issue when your NVIDIA driver release is older than 418.xx since the 19.10 release, the Nsight Systems profiling tool (for example, the
nsys) might cause
CUDA runtime API error. A fix will be included in a future release.