TensorFlow Release 19.06

The NVIDIA container image of TensorFlow, release 19.06, is available on NGC.

Contents of the TensorFlow container

This container image contains the complete source of the version of NVIDIA TensorFlow in /opt/tensorflow. It is pre-built and installed as a system Python module.

To achieve optimum TensorFlow performance, for image based training, the container includes a sample script that demonstrates efficient training of convolutional neural networks (CNNs). The sample script may need to be modified to fit your application.

Driver Requirements

Release 19.06 is based on NVIDIA CUDA 10.1.168, which requires NVIDIA Driver release 418.xx. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you may use NVIDIA driver release 384.111+ or 410. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.06 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This TensorFlow release includes the following key features and enhancements.
  • TensorFlow container image version 19.06 is based on TensorFlow 1.13.1.
  • Latest version of NVIDIA CUDA 10.1.168 including cuBLAS 10.2.0.168
  • Latest version of NVIDIA NCCL 2.4.7
  • Latest version of DALI 0.10.0 Beta
  • Latest version of JupyterLab 0.35.6
  • Latest version of Horovod 0.16.2
  • Latest version of Nsight Compute 10.1.168
  • Latest OpenSeq2Seq at commit 27346d1
  • Added DLProf 19.06 software. Deep Learning Profiler (DLProf) is a tool for profiling deep learning models to help data scientists understand and improve performance of their models visually via TensorBoard or by analyzing text reports.
  • Determinism - Setting the environment variable TF_CUDNN_DETERMINISM=1 forces the selection of deterministic cuDNN convolution and max-pooling algorithms. When this is enabled, the algorithm selection procedure itself is also deterministic.

    Alternatively, setting TF_DETERMINISTIC_OPS=1 has the same effect and additionally makes any bias addition that is based on tf.nn.bias_add() (for example, in Keras layers) operate deterministically on GPU. If you set TF_DETERMINISTIC_OPS=1 then there is no need to also set TF_CUDNN_DETERMINISM=1.

    Selecting these deterministic options may reduce performance.

  • Ubuntu 16.04 with May 2019 updates (see Announcements)

Accelerating Inference In TensorFlow With TensorRT (TF-TRT)

For step-by-step instructions on how to use TF-TRT, see Accelerating Inference In TensorFlow With TensorRT User Guide.
Key Features And Enhancements
  • Integrated TensorRT 5.1.5 into TensorFlow. See the TensorRT 5.1.5 Release Notes for a full list of new features.

  • Improved examples at GitHub: TF-TRT, including README files, build scripts, benchmark mode, ResNet models from TensorFlow official model zoo, etc...

Automatic Mixed Precision (AMP)

Automatic mixed precision converts certain float32 operations to operate in float16 which can run much faster on Tensor Cores. Automatic mixed precision is built on two components:
  • a loss scaling optimizer
  • graph rewriter
For models already using a tf.Optimizer() for both compute_gradients() and apply_gradients() operations, automatic mixed precision can be enabled by defining the following environment variable before calling the usual float32 training script:
export TF_ENABLE_AUTO_MIXED_PRECISION=1
Models implementing their own optimizers can use the graph rewriter on its own (while implementing loss scaling manually) with the following environment variable:
export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=1

For more information about how to access and enable Automatic mixed precision for TensorFlow, see Automatic Mixed Precision Training In TensorFlow from the TensorFlow User Guide, along with Training With Mixed Precision.

Tensor Core Examples

These examples focus on achieving the best performance and convergence from NVIDIA Volta Tensor Cores by using the latest deep learning example networks for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.
  • U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration.

  • SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository.

  • Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item.

  • BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy.

  • U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007.

  • GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep.

  • ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, tensor cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training.

Announcements

In the next release, we will no longer support Ubuntu 16.04. Release 19.07 will instead support Ubuntu 18.04.

Known Issues

  • There is a known performance regression with TensorFlow 1.13.1 for some networks when run with small batch sizes. As a workaround, increase the batch size.
  • The AMP preview implementation is not compatible with Distributed Strategies. We recommend using Horovod for parallel training with AMP.
  • AMP is not compatible with models the use ResourceVariables for the global_step passed to the tf.train.Optimizer.apply_gradients. This will be fixed in the 19.07 NGC release.
  • A known issue in TensorFlow results in the error Cannot take the length of Shape with unknown rank when training variable sized images with the Keras model.fit API. Details are provided here and a fix will be available in a future release.
  • Support for CUDNN float32 Tensor Op Math mode first introduced in the 18.09 release is now deprecated in favor of Automatic Mixed Precision. It is scheduled to be removed after the 19.11 release.
  • DLProf and Nsight Systems in the container will not work with GPU drivers newer than r418.
  • There is a known issue when running the 19.06 TensorFlow container on a DGX-2 (or other systems having more than 8 GPUs) with RHEL 7.x (as opposed to Ubuntu) as the operating system. The known issue is that in some circumstances you will be shown the following message:
    E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS