TensorFlow Wheel Platform

This TensorFlow Wheel release is intended for use on the NVIDIA Ampere Architecture GPU, NVIDIA Turing Architecture GPUs, NVIDIA Volta Architecture GPUs, and NVIDIA Pascal Architecture GPU.

TensorFlow Wheel Release 21.01

The NVIDIA container image release for TensorFlow Wheel 21.01 has been canceled. The next release will be the 21.02 release which is expected to be released at the end of February.

TensorFlow Wheel Release 20.12

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 1. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product		Version
NVIDIA CUDA cuBLAS	nvidia-cublas	11.3.0.106
NVIDIA CUDA CUPTI	nvidia-cublas-cupti	11.1.105
NVIDIA CUDA NVCC	nvidia-cuda-nvcc	11.1.105
NVIDIA CUDA NVRTC	nvidia-cuda-nvrtc	11.*
NVIDIA CUDA Runtime	nvidia-cuda-runtime	11.1.74
NVIDIA CUDA cuDNN	nvidia-cudnn	8.0.5.43
NVIDIA CUDA cuFFT	nvidia-cufft	10.3.0.105
NVIDIA CUDA cuRAND	nvidia-curand	10.2.2.105
NVIDIA CUDA cuSOLVER	nvidia-cusolver	11.0.1.105
NVIDIA CUDA cuSPARSE	nvidia-cusparse	11.3.0.10
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.28.0+nv20.12
NVIDIA DALI for CUDA 11.0	nvidia-dali-cuda110	0.28.0
NVIDIA DLprof binary installation	nvidia-dlprof	0.18.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC	nvidia-nsys-cli	2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch	nvidia-horovod	0.20.2+nv20.12
NVIDIA CUDA NCCL	nvidia-nccl	2.8.3
DLprof TensorBoard plugin	nvidia-tensorboard-plugin-dlprof	0.10
TensorBoard lets you watch Tensors Flow	nvidia-tensorboard	1.15.0+nv20.12
NVIDIA TensorFlow	nvidia-tensorflow	1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library	nvidia-tensorrt	7.2.2.1

Driver Requirements

Release 20.12 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx, 440.xx, or 450.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.12 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 20.04 (64-bit)
Python 3.8
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.12 is based on 1.15.4.
CUDA 11.1.1 including cuBLAS 11.3.0
NVIDIA cuDNN 8.0.5
NCCL 2.8.3 (optimized for NVLink)
TensorRT 7.2.2
Nsight Systems 2020.4.1.117
OpenMPI 4.0.5
DLProf 0.18.0
Ubuntu 20.04

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

In certain cases running on Pascal GPUs may result in out-of-memory errors which may present as apparent job hangs. This can be worked around by exporting the following environment variable:
Copy

Copied!
```
            
            TF_DEVICE_MIN_SYS_MEMORY_IN_MB=550
        
```
A regression in cuDNN’s fused Convolution+Bias+Activation implementation can cause performance regressions of up to 24% in some models such as UNet Medical. This will be fixed in a future cuDNN release.
Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.
A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.
There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.
There is a known issue of OOM (Out-Of-Memory) when training the UNET3D models when batch size = 1 in TensorFlow (TF1) container.

TensorFlow Wheel Release 20.11

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 2. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product		Version
NVIDIA CUDA cuBLAS	nvidia-cublas	11.2.1.74
NVIDIA CUDA CUPTI	nvidia-cublas-cupti	11.1.69
NVIDIA CUDA NVCC	nvidia-cuda-nvcc	11.1.74
NVIDIA CUDA NVRTC	nvidia-cuda-nvrtc	11.*
NVIDIA CUDA Runtime	nvidia-cuda-runtime	11.1.74
NVIDIA CUDA cuDNN	nvidia-cudnn	8.0.4.30
NVIDIA CUDA cuFFT	nvidia-cufft	10.3.0.74
NVIDIA CUDA cuRAND	nvidia-curand	10.2.2.74
NVIDIA CUDA cuSOLVER	nvidia-cusolver	11.0.0.74
NVIDIA CUDA cuSPARSE	nvidia-cusparse	11.2.0.275
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.27.0+nv20.11
NVIDIA DALI for CUDA 11.0	nvidia-dali-cuda110	0.27.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.27.0+nv20.11
NVIDIA DLprof binary installation	nvidia-dlprof	0.17.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC	nvidia-nsys-cli	2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch	nvidia-horovod	0.20.2+nv20.11
NVIDIA CUDA NCCL	nvidia-nccl	2.8.2
DLprof TensorBoard plugin	nvidia-tensorboard-plugin-dlprof	0.9
TensorBoard lets you watch Tensors Flow	nvidia-tensorboard	1.15.0+nv20.11
NVIDIA TensorFlow	nvidia-tensorflow	1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library	nvidia-tensorrt	7.2.1.6

Driver Requirements

Release 20.11 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx, 440.xx, or 450.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.11 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.11 is based on 1.15.4 and 2.3.1.
CUDA 11.1.0
cuDNN 8.0.4
TensorRT 7.2.1
DALI 0.27
DLProf 0.17.0

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

In certain cases running on Pascal GPUs may result in out-of-memory errors which may present as apparent job hangs. This can be worked around by exporting the following environment variable:
Copy

Copied!
```
            
            TF_DEVICE_MIN_SYS_MEMORY_IN_MB=550
        
```
Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.
A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.
There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

TensorFlow Wheel Release 20.10

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 3. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product		Version
NVIDIA CUDA cuBLAS	nvidia-cublas	11.2.1.74
NVIDIA CUDA CUPTI	nvidia-cublas-cupti	11.1.69
NVIDIA CUDA NVCC	nvidia-cuda-nvcc	11.1.74
NVIDIA CUDA NVRTC	nvidia-cuda-nvrtc	11.*
NVIDIA CUDA Runtime	nvidia-cuda-runtime	11.1.74
NVIDIA CUDA cuDNN	nvidia-cudnn	8.0.4.30
NVIDIA CUDA cuFFT	nvidia-cufft	10.3.0.74
NVIDIA CUDA cuRAND	nvidia-curand	10.2.2.74
NVIDIA CUDA cuSOLVER	nvidia-cusolver	11.0.0.74
NVIDIA CUDA cuSPARSE	nvidia-cusparse	11.2.0.275
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.26.0+nv20.10
NVIDIA DALI for CUDA 11.0	nvidia-dali-cuda110	0.26.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.25.1+nv20.09
NVIDIA DLprof binary installation	nvidia-dlprof	0.16.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC	nvidia-nsys-cli	2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch	nvidia-horovod	0.20.0+nv20.10
NVIDIA CUDA NCCL	nvidia-nccl	2.7.8
DLprof TensorBoard plugin	nvidia-tensorboard-plugin-dlprof	0.8
TensorBoard lets you watch Tensors Flow	nvidia-tensorboard	1.15.0+nv20.10
NVIDIA TensorFlow	nvidia-tensorflow	1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library	nvidia-tensorrt	7.2.1.4

Driver Requirements

Release 20.10 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.10 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.10 is based on 1.15.4 and 2.3.1.
CUDA 11.1.0
cuDNN 8.0.4
TensorRT 7.2.1
DALI 0.26
DLProf 0.16.0

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.
A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.
There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100 and Turing GPUs. This will be addressed in a future release.
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

TensorFlow Wheel Release 20.09

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 4. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product		Version
NVIDIA CUDA cuBLAS	nvidia-cublas	11.2.0.252
NVIDIA CUDA CUPTI	nvidia-cublas-cupti	11.0.221
NVIDIA CUDA NVCC	nvidia-cuda-nvcc	11.0.221
NVIDIA CUDA NVRTC	nvidia-cuda-nvrtc	11.0.*
NVIDIA CUDA Runtime	nvidia-cuda-runtime	11.0.221
NVIDIA CUDA cuDNN	nvidia-cudnn	8.0.4.12
NVIDIA CUDA cuFFT	nvidia-cufft	10.2.1.245
NVIDIA CUDA cuRAND	nvidia-curand	10.2.1.245
NVIDIA CUDA cuSOLVER	nvidia-cusolver	10.6.0.245
NVIDIA CUDA cuSPARSE	nvidia-cusparse	11.1.1.245
NVIDIA DALI for CUDA 11.0	nvidia-dali	0.25.1
NVIDIA DALI TensorFlow Plugin for CUDA 11.0	nvidia-dali-nvtf-plugin	0.25.1+nv20.09
Distributed training framework for TensorFlow, Keras, and PyTorch	nvidia-horovod	0.19.2+nv20.09
NVIDIA CUDA NCCL	nvidia-nccl	2.7.8
TensorBoard lets you watch Tensors Flow	nvidia-tensorboard	1.15.0+nv20.09
NVIDIA TensorFlow	nvidia-tensorflow	1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library	nvidia-tensorrt	7.1.3.4

Driver Requirements

Release 20.09 is based on NVIDIA CUDA 11.0.3, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.09 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.09 is based on 1.15.3.
NVIDIA cuDNN 8.0.4
DALI 0.25

NVIDIA TensorFlow Wheel Versions

20.09
20.08
20.07
20.06

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100 and Turing GPUs. This will be addressed in a future release.
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

TensorFlow Wheel Release 20.08

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 5. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product	Version
nvidia-cublas	11.2.0.252
nvidia-cublas-cupti	11.0.221
nvidia-cuda-nvcc	11.0.221
nvidia-cuda-nvrtc	11.0.*
nvidia-cuda-runtime	11.0.221
nvidia-cudnn	8.0.2.39
nvidia-cufft	10.2.1.245
nvidia-curand	10.2.1.245
nvidia-cusolver	10.6.0.245
nvidia-cusparse	11.1.1.245
nvidia-dali	0.24
nvidia-dali-tf-plugin	0.24
nvidia-horovod	0.19.5
nvidia-nccl	2.7.8
nvidia-tensorflow	1.15.3
nvidia-tensorrt	7.1.3.4
nvidia-cuda-nvrtc	11.0

Driver Requirements

Release 20.08 is based on NVIDIA CUDA 11.0.3, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.08 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.08 is based on 1.15.3.
Ubuntu 18.04 with July 2020 updates

NVIDIA TensorFlow Wheel Versions

20.08
20.07
20.06

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

The memory required to train MaskRCNN with a given batch size has increased from 20.07 to 20.08. As a result, the batch size may need to be decreased.
There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
There is a known performance regression of 15% compared to the 20.03 release when training the JoC V-Net Medical models with small batch size and fp32 data type on Turing GPUs. This will be addressed in a future release.
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

TensorFlow Wheel Release 20.07

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 6. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product	Version
nvidia-cublas	11.1.0
nvidia-cublas-cupti	11.0.194
nvidia-cuda-nvcc	11.0.194
nvidia-cuda-nvrtc	11.0.194
nvidia-cuda-runtime	11.0.194
nvidia-cudnn	8.0.1
nvidia-cufft	10.2.0.218
nvidia-curand	10.2.1.218
nvidia-cusolver	10.5.0.218
nvidia-cusparse	11.1.0.218
nvidia-dali	0.23
nvidia-dali-tf-plugin	0.23
nvidia-horovod	0.19.5
nvidia-nccl	2.7.6
nvidia-tensorflow	1.15.3
nvidia-tensorrt	7.1.3

Driver Requirements

Release 20.07 is based on NVIDIA CUDA 11.0.194, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.07 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.07 is based on 1.15.3.
Improved XLA to avoid excessive recompilations
Enhancements for Automatic Mixed Precision with einsum, 3D Convolutions, and list operations
Improved 3D Convolutions to support NDHWC format
Default TF32 support

NVIDIA TensorFlow Wheel Versions

20.07
20.06

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100. This will be addressed in a future release.
An out-of-memory condition can occur in TensorFlow (TF1) 20.07 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy: TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
There is a known performance regression of 15% compared to the 20.03 release when training the JoC V-Net Medical models with small batch size and fp32 data type on Turing GPUs. This will be addressed in a future release.
There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.
There is a known performance regression of 5-15% on the VAE-CF model when using the pip wheel compared to the corresponding NGC Docker container. This will be addressed in a future release.

TensorFlow Wheel Release 20.06

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 7. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product	Version
nvidia-cublas	11.1.0.213
nvidia-cublas-cupti	11.0.167
nvidia-cuda-nvcc	11.0.167
nvidia-cuda-nvrtc	11.0.167
nvidia-cuda-runtime	11.0.167
nvidia-cudnn	8.0.1.13
nvidia-cufft	10.1.3.191
nvidia-curand	10.2.0.191
nvidia-cusolver	10.4.0.191
nvidia-cusparse	11.0.0.191
nvidia-dali	0.22.0
nvidia-dali-tf-plugin	0.22.0
nvidia-horovod	0.19.1
nvidia-nccl	2.7.5
nvidia-tensorflow	1.15.2 + nv20.06
nvidia-tensorrt	7.1.2.8

Driver Requirements

Release 20.06 is based on NVIDIA CUDA 11.0.167, which requires NVIDIA Driver release 450.36. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.06 release of TensorFlow Wheel requires the following software to be installed:

Ubuntu 18.04 (64-bit) or Ubuntu 20.04 (64-bit)

Note:

Note: Ubuntu 18.04 defaults to Python 3.6; however, Ubuntu 20.04 requires the user to install Python 3.6.
Python 3.6
Pip 19.09 or later

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

TensorFlow Wheel version 20.06 is based on TensorFlow 1.15.2.
Integrated latest NVIDIA Deep Learning SDK to support NVIDIA A100 using CUDA 11 and cuDNN 8
Improved NVTX annotations for XLA clusters for use with DLProf
Improved XLA to avoid excessive recompilations
Enhancements for Automatic Mixed Precision with einsum, 3D Convolutions, and list operations
Improved 3D Convolutions to support NDHWC format
Default TF32 support
Ubuntu 18.04 with May 2020 updates

NVIDIA TensorFlow Wheel Versions

20.06 is the first release of the TensorFlow Wheel.

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on A100 and Volta architectures, therefore you can get results much faster than training without Tensor Cores. These models are tested against each NVIDIA Optimized Frameworks monthly container release to ensure consistent accuracy and performance over time.

U-Net Medical model. The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper U-Net: Convolutional Networks for Biomedical Image Segmentation, without any alteration. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
SSD320 v1.2 model. The SSD320 v1.2 model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as “a method for detecting objects in images using a single deep neural network”. Our implementation is based on the existing model from the TensorFlow models repository. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
Neural Collaborative Filtering (NCF) model. The NCF model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
BERT model. BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUS for faster training times while maintaining target accuracy. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
U-Net Industrial Defect Segmentation model. This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
GNMT v2 model. The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper. The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).
ResNet-50 v1.5 model. The ResNet-50 v1.5 model is a modified version of the original ResNet-50 v1 model. The difference between v1 and v1.5 is in the bottleneck blocks which requires downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model; data-parallel multi-GPU training with Horovod, Tensor Cores (mixed precision) training, and static loss scaling for Tensor Cores (mixed precision) training. This model script is available on GitHub as well as NVIDIA GPU Cloud (NGC).

Known Issues

An out-of-memory condition can occur in TensorFlow (TF1) 20.06 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA in TensorFlow 20.05, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the XLA Best Practices Guide, running XLA with the following environment variable opts in to that strategy:
Copy

Copied!
```
            
            TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false
        
```
TensorFlow Wheel 20.06 NCF will not work until CuPy is updated to support CUDA 11.
There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and and U-Net Industrial models with small batch size on V100. This will be addressed in a future release.